DX th International Workshop on Principles of Diagnosis. Peñaranda de Duero, Burgos Spain. June 26-28, Edited by

Size: px

Start display at page:

Download "DX th International Workshop on Principles of Diagnosis. Peñaranda de Duero, Burgos Spain. June 26-28, Edited by"

Ann Byrd
5 years ago
Views:

1 DX th International Workshop on Principles of Diagnosis Peñaranda de Duero, Burgos Spain June 26-28, 2006 Edited by Carlos Alonso González, Teresa Escobet, and Belarmino Pulido

3 Foreword The International Workshop on Principles of Diagnosis is an annual event that started in 1989, rooted in the Artificial Intelligence community. Its focus is on theories, principles and computational techniques for diagnosis, monitoring, testing, reconfiguration and repair of complex systems. As in past editions, in DX-06 take part scientists and industrialists with diverse interests concerning diagnosis, and with different backgrounds. Many people and organisations have contributed to this workshop. We would first like to thank the authors that have provided the primary material of the workshop, a set of papers of outstanding quality. We also thank the Programme Committee members for the time and effort devoted to review the papers and their help selecting the contributions. Special thanks to the invited speakers, César Barta from Iberespacio, and Andrés Marcos from Deimos-Space, for having accepted to contribute to the workshop, and their time for preparing their talks. Previous to this DX edition, we had the second edition of the Summer School on Fault Detection and Diagnosis of Complex Systems. The main goal of this intensive seminar is the introduction to PhD students and companies to different advanced techniques currently used for fault detection and diagnosis. We want to specially thank Gautam Biswas, Luca Console, and Louise Travé- Massuyès for their contribution to the success of the School. This workshop would not have been possible without the support of the local organisation teams. In Valladolid, Aníbal Bregón, Isaac Moro, Oscar Prieto, and Arancha Simón have worked very hard in any single task we have asked help for. In Burgos, Juanjo Rodríguez, Jesús Maudes, and Santiago Villalba have provided invaluable help related to the conference location. And Alberto Calvete created the workshop web-site. We would like to acknowledge the contribution from different local, regional and national institutions and organizations: Ayuntamiento de Peñaranda de Duero, Departamento de Informática from Universidad de Valladolid, Universidad de Burgos, Junta de Castilla y León, Asociación Española para la Inteligencia Artificial, Ministerio de Educación y Ciencia, Caja Círculo (Burgos) and Consejo Regulador Ribera de Duero. We are also thankful to our institutions Universidad de Valladolid, Universidad de Burgos, and UPC for all the financial and infrastructure support for organising the workshop. Finally, we want to welcome DX06 participants. We hope this workshop will provide each participant with interesting and stimulating presentations and discussions as well as enjoyable social events. Carlos Alonso, Teresa Escobet and Belarmino Pulido DX06 Co-Chairs

5 Workshop Organization Programme Committee Co-Chairs Carlos Alonso González Universidad de Valladolid, Spain Teresa Escobet Universitat Politècnica de Catalunya, Spain Belarmino Pulido Universidad de Valladolid, Spain Programme Committee Members Gautam Biswas Vanderbilt University, USA Luca Console University of Turin, Italy Marie-Odile Cordier IRISA, France Philippe Dague Paris South University, France Johan de Kleer Xerox, USA Richard Dearden University of Birmingham, U.K. Michael Hofbaur Graz University of Technology, Austria Sriram Narasiman QSS, NASA Ames Research Center Pieter Mosterman The MathWorks, Inc. Mattias Nyberg Scania CV, Södertälje, Sweden Xavier Olive Alcatel Space, France Yannick Pencolé LAAS-CNRS, Toulouse, France Claudia Picardi University of Turin, Italy Gregory Provan Cork College, Ireland Martin Sachenbacher University of Munich, Germany Marcel Staroswiecki University of Lille, France Peter Struss Technical University of Munich, Germany Markus Stumpter University of South Australia, Adelaide, Australia Daniele Theseider-Dupre University of Piamonte Orientale, Italy Louise Travé-Massuyès LAAS-CNRS, Toulouse, France Brian Williams MIT, USA Franz Wotawa Graz University of Technology, Austria Marina Zanella University of Brescia, Italy Reviewers Arantza Aldea Matthew Daigle Michael Esser Maria J. de la Fuente Arjan van Gemund Xenofon Koutsoukos Wolfgang Mayer Bernhard Peischl Xavier Pucel Indranil Roychoudhury Juan J. Rodríguez Diez Marcos Da Silveira Oxford Brookes University, UK Vanderbilt University, USA Technische Universität München, Germany Universidad de Valladolid, Spain Delft University of Technology, Netherlands Vanderbilt University, USA University of South Australia, Australia Technische Universität Graz, Austria LAAS-CNRS, Toulouse, France Vanderbilt University, USA Universidad de Burgos, Spain LAAS-CNRS, Toulouse, France

6 Local Organization Committee Pulido Junquera, Belarmino Universidad de Valladolid, Spain Rodríguez Diez, Juan José Universidad de Burgos, Spain Alonso González, Carlos Universidad de Valladolid, Spain Bregón Bregón, Aníbal Universidad de Valladolid, Spain Maudes Raedo, Jesus Universidad de Burgos, Spain Moro Sancho, Q. Isaac Universidad de Valladolid, Spain Prieto Izquierdo, Oscar Universidad de Valladolid, Spain Simón Hurtado, Arancha Universidad de Valladolid, Spain Villalba Bartolomé, Santiago Universidad de Burgos, Spain

7 Table Of Contents Invited Talks Automatic Diagnosis in the Space Technogoly Field. European Programs... 3 César Barta. Robust FDI Estimation in Aerospace Applications... 5 Andrés Marcos. Papers Hybrid Systems Diagnosability by abstracting Faulty Continuous Dynamics... 9 Mehdi Bayoudh, Louise Travé-Massuyès, and Xavier Olive. Distributed Diagnosis by using a Condensed Local Representation of the Global Diagnoses with Minimal Cardinality Jonas Biteus, Erik Frisk, and Mattias Nyberg. Ambiguity Groups Determination for Analog Non-Linear Circuits Diagnosis Barbara Cannas, Alessandra Fanni, and Augusto Montisci. Exploiting independence in a decentralised and incremental approach of diagnosis Marie-Odile Cordier, and Alban Grastien. Multiple Fault Diagnosis in Complex Physical Systems Matthew Daigle, Xenofon Koutsoukos, and Gautam Biswas. Improvement of Chronicle-based Monitoring using Temporal Focalization and Hierarchization Christophe Dousson, and Pierre Le Maigat. Model-based Test Generation for Embedded Software Michael Esser, and Peter Struss. A Multi-Valued SAT-Based Algorithm for Faster Model-Based Diagnosis Alexander Feldman, Jurryt Pietersma, and Arjan van Gemund. A general method for diagnosing axioms Gerhard Friedrich, Stefan Rass, and Kostyantyn Shchekotykhin. Robust Fault Detection with State Estimators and Interval Models Using Zonotopes Pedro Guerra, Vicenç Puig, and Ari Ingimundarson. Supervision Patterns in Discrete Event Systems Diagnosis Thierry Jéron, Hervé Marchand, Sophie Pinchinat, and Marie-Odile Cordier. Primary and Secondary Plan Diagnosis Fenke de Jonge, Nico Roos, and Cees Witteveen. Getting the Probabilities Right for Measurement Selection Johan de Kleer. Runtime Fault Detection and Localization in Component-oriented Software Systems Berhard Peischl, Joerg Weber, and Franz Wotawa.

8 A bayesian approach to fault isolation with application to diesel engine diagnosis Anna Pernestål, Mattias Nyberg, and Bo Wahlberg. Automatic Generation of Benchmark Diagnosis Models Gregory Provan. A Bayesian Approach to Efficient Diagnosis of Incipient Faults Indranil Roychoudhury, Gautam Biswas, and Xenofon Koutsoukos. Qualitative Domain Abstractions for Time-Varying Systems: an Approach based on Reusable Abstraction Fragments Gianluca Torta, and Pietro Torasso. Reliability and Diagnostics of Modular Systems: a New Probabilistic Approach Michael Wachter, Rolf Haenni, and Jacek Jonczy. Posters Towards an Entropic Approach for the Analysis of Chronicle Models Nabil Benayadi, Marc Le Goc, and Philippe Bouché Focusing fault localization in model-based diagnosis with case-based reasoning Anibal Bregon, Belarmino Pulido, M. Aranzazu Simon, Isaac Moro, Oscar Prieto, Juan J. Rodriguez, and Carlos Alonso. A Framework for Decentralized Qualitative Model-based Diagnosis Luca Console, Claudia Picardi, and Daniele Theseider Dupré Comparing Diagnosability in Continuous and Discrete-Events Systems Marie-Odile Cordier, Louise Travé-Massuyès, and Xavier Pucel On-line diagnosis for Time Petri Nets G. Jiroveanu, G. B. De Schutter, and R. K. Boel. Incremental indexing of temporal observations in diagnosis of active systems Gianfranco Lamperti, and Marina Zanella Introducing Data Reduction Techniques into Reason Maintenance Rüdiger Lunde A Supervision Architecture to Deal with Disruptive Events in UAV Missions Rachid El Mafkouk, Jean-François Gabard, and Catherine Tessier. Debugging Failures in Web Services Coordination Wolfgang Mayer, and Markus Stumptner. Observer Gain Effect in Linear Interval Observer-based Fault Isolation Jordi Meseguer, Vicenç Puig, Teresa Escobet, and Joseba Quevedo. A Generalization of the GDE Minimal-Hitting Set Algorithm to Handle Behavioral Modes Mattias Nyberg. Abstract Dependence Models in Software Debugging Bernhard Peischl, Saffeeullah Soomro, and Franz Wotawa. Robust Fault Detection using Set-membership Estimation and Constraints Satisfaction Vicenç Puig, Carlos Ocampo-Martínez, Sebastián Tornil, and Ari Ingimundarson.

9 Hierarchical Modelling and Diagnosis for Embedded Systems Hervé Ressencourt, Louise Travé-Massuyès, and Jérôme Thomas. Intermittent Fault Detection through Message Exchanges : a Coherence Based Approach Siegfried Soldani, Michel Combacau, Jerôme Thomas, and Audine Subias Distributed Trace Estimation with Asynchronous Local Clocks and Imperfect Observation Channels. 257 Rong Su, and Michel Chaudron.

11 Invited Talks

13 Automatic Diagnosis in the Space Technology field European Programs César Barta, Iberespacio, Tecnología Aeroespacial, C/ Magallanes 1, 1ª Planta Madrid, Spain A Next Generation European Reusable Launcher (RLV) is under evaluation. The evolution of the current Ariane 5 ECA as an Expandable (ELV) version is also possible. In any case, the reference missions will include a mission to geostationary orbit covering the market of commercial heavy telecommunication satellites (4 to 6 tons in GEO). Crew transportation shall not be considered as a design criterion. The program development schedule foresees a Next Generation Launcher operational around The ELV reliability is defined based on a single mission (roughly one hour). Even if the European rocket Ariane is one of the most reliable of the worldwide industry, its history shows several catastrophic mission failures. For Ariane 5, the reliability target figure is 14*10-3. The RLV will have to perform several missions ( 100 missions), all with an increased level of reliability (10-3 ). To get this reliability increase, the new approach allowing failure detection and recovery for vehicle preservation will have to be implemented. In this field of spacecraft systems, health monitoring is the surveillance by means of sensors and signal processing units to allow a description of the system to detect and isolate operational anomalies. An effective Health Monitoring System accomplishes detection and identification of failure causes whereas an optimised Health Management System, HMS, guarantees the selection of appropriate actions to recover from faulty conditions. In this way, the on-board automatic diagnosis task will be accompanied by the flight control. The HMS autonomy level should be decided. Health Management should be present during flight and on ground and in between flights maintenance periods. It applies across the entire life cycle of the vehicle, beginning in the earliest phases of design. For the reference missions, a HMS has two primary functions: to increase safety and reliability and to decrease maintenance turn-around time and cost. To carry out all of these functions, the HMS will have two components: the on-board and the off-board HM subsystems. The on-board HM sub-system will support the ground mission operation and flight supervision. The SSME is the only rocket with reusable engine. Some diagnostic systems were made for it and a laboratory test-bed was built at NASA. Researchers demonstrated the successful real-time fault detection and isolation of a model-based reactive autonomous system. Deep Space One and X-37 IVHM are others NASA experiments. Several approaches to the HMS concept have had a limited success. On one extreme, the most usual implementation is based on rules algorithms by checking critical levels of a set of variables. The practice shows that an important number of false alarms are raised (example is Ariane 5 L501). On the other extreme, knowledge implemented by expert systems often can not manage the whole spectrum of possible failure leading to blackouts in the decision loops (Deep Space 1 remote agent deadlock). To fulfil the above functions, several development lines may be investigated: Structural and Turbomachinery health monitoring, Knowledge-based, Case-based, Machine-learning or Model-based approaches, the advanced sensors, Rule-based diagnostic, etc. DX'06 - Peñaranda de Duero, Burgos (Spain) 3

14 4 DX'06 - Peñaranda de Duero, Burgos (Spain)

15 Robust FDI Estimation in Aerospace Applications Andres Marcos, Ph.D. Advanced Projects Division (Simulation and Control) DEIMOS Space S.L. Ronda de Poniente 19 Edificio Fiteni VI, P2, 2 Tres Cantos, Madrid SPAIN andres.marcos@deimos-space.com In this talk we present an overview of the current State-Of-Art in model-based robust fault detection & isolation (FDI) for guidance-navigation-control (GNC) in the aerospace world. Robust FDI is a critical component in aerospace systems due to the autonomous characteristics of the systems (interplanetary missions with long communications delay, re-entry vehicles black-out, satellites' eclipses, etcetera), the environment where they operate (atmospheric, exo-atmospheric and space) and the aggressive system dynamics (high rotational and translational components). The standard approach in aeronautics and aerospace applications has been to use hardware redundancy and voting schemes but nowadays there is a trend to reduce this redundancy using advanced diagnostic techniques in conjunction with fault tolerant control (FTC) approaches. A summary of recent missions and projects where FDI/FTC has played a major role is presented together with a discussion of the specific problems/requirements for this type of applications and of the robust FDI techniques currently used. DX'06 - Peñaranda de Duero, Burgos (Spain) 5

16 6 DX'06 - Peñaranda de Duero, Burgos (Spain)

17 Papers / Posters

19 Hybrid Systems Diagnosability by Abstracting Faulty Continuous Dynamics Mehdi Bayoudh and Louise Travé-Massuyès LAAS-CNRS Toulouse, France {bayoudh, Xavier Olive Alcatel Alenia Space France Abstract On-line model based reconfiguration is generally used to improve the ability of a system to tolerate faults. Recovery after fault occurrence relies on allowing the system to proceed with its mission from a new known nominal state. In this paper, we consider on-line reconfiguration from a novel point of view, having in mind to use reconfiguration actions to disambiguate the tracked estimated system state, i.e. to produce a more precise diagnosis. The choice of the best suited reconfiguration action(s) must hence be guided by the diagnosability properties of the system. However, diagnosability conditions known for continuous systems (CS) on one hand and for discrete event systems (DES) on the other hand cannot be applied directly because of the hybrid nature of the systems that we consider. Our work proposes a framework for analyzing the ability of a hybrid system that stands on recent results establishing the formal equivalence of diagnosability definitions for DES and CS. The approach relies on merging the fault signatures exhibited at the continuous level into the Mode Automaton that represents the discrete dynamics of the system, so that DES diagnosability analysis can be performed on the resulting Behavior Automaton and the corresponding diagnoser. When the state of the system is ambiguous, an analysis of the diagnoser allows us to point at reconfiguration actions that safely move the system into a mode reducing ambiguity. 1 Introduction Embedded systems found in nowadays cars, aircrafts and space vehicles are characterized by a mix of hardware and software components and limited instrumentation. They hence undergo complex hybrid dynamics that can only be partially observed, which makes tricky their on-board monitoring and diagnosis. They generally require to use stochastic and/or uncertain approaches which provide a belief state or in other words an ambiguous diagnosis [Hofbaur and Williams, 2004] [Benazera et al., 2002] [Williams and Nayak, 1999]. In many cases, testing the system on line can be an interesting option to produce a more precise diagnosis. For instance, in the space domain, specific commands are often applied by the ground segment for getting more information about the state of a faulty spacecraft. This kind of testing, that we call active diagnosis, involves reconfiguring the system so that new symptoms are exhibited through the existing sensor instrumentation. The choice of the best suited reconfiguration action(s) must hence be guided by the diagnosability properties of the system. Diagnosability analysis proves to be a requisite for several other tasks such as instrumentation design, end-of-line testing, etc. and has deserved a lot of attention from the Model Based Diagnosis community in the last few years, both for the analysis of Continuous Systems (CS) and Discrete Events Systems (DES) [Struss and Dressler, 2003] [Console et al., 2000] [Travé- Massuyès et al., 2004][Sampath et al., 1995] [Pencolé, 2004]. However, diagnosability conditions known for continuous systems (CS) on one hand, and for discrete event systems (DES) on the other hand, cannot be applied directly when the system has hybrid dynamics. We rely on recent results establishing the formal equivalence of diagnosability definitions for DES and CS [Cordier et al., 2006] and propose to abstract the faulty continuous dynamics of a hybrid automaton to produce an enriched discrete automaton that accounts for fault models. Fault models are obtained from fault signatures exhibited from the continuous dynamics constraints that are interpreted in terms of events. DES diagnosability analysis can be performed on the resulting Behavior Automaton and the corresponding diagnoser. When the state of the system is ambiguous, an analysis of the diagnoser allows us to point at reconfiguration actions that safely move the system into a mode reducing ambiguity. The paper is organized as follows. Section 2 introduces the hybrid modeling framework used for tracking the states of the system. Section 3 provides an insight into fault signatures as defined for continuous systems and gives the intuitions guiding our contribution. Section 4 then introduces the main DES diagnosability notions. Section 5 presents the procedure for building the Behavior Automaton from the Mode Automaton and fault signatures exhibited at the continuous behavior level. Section 6 presents some criteria for hybrid systems diagnosability. Finally, section 7 illustrates our approach with a motivational example and it is shown how reconfiguration actions can increase diagnosability. Related work, perspectives for future work and a concluding discussion are provided in DX'06 - Peñaranda de Duero, Burgos (Spain) 9

20 section 8. 2 Hybrid modeling framework Embedded systems combine continuous dynamics with discrete events (which can be commanded or spontaneous). Hence, the hybrid formalism is appropriate for modeling such complex dynamic systems. Like in [Benazera et al., 2002][Bénazéra and Travé-Massuyès, 2003], a hybrid system is described by a hybrid automaton defined as a tuple S = (X, Q, Σ, T, C, (x 0, q 0 )), where: X is the set of continuous variables, which includes observable and non observable variables. Those variables are linked with constraints that vary from one mode to another. Q is the set of system states. Each state q i Q represents a functional mode of the system. Σ is the set of events. Events correspond to command value switches, spontaneous mode changes and fault events. Σo Σ is the set of observable events. Without loss of generality we assume that fault events are unobservable. T is the transition function, Q Σ Q. C is the set of constraints which may be qualitative or quantitative. Associating a subset of constraints C i C to functional mode q i allows one to describe the system behavior evolution in this mode. (x 0, q 0 ) is the initial condition. The discrete part of the hybrid automaton, given by M = (Q, Σ, T, q 0 ), is a discrete automaton that describes the discrete dynamics of the system, i.e. the possible evolutions between operating modes in Q. We refer to this automaton as the Mode Automaton. Modes include nominal and fault modes as well as an unknown mode which stands for all the non anticipated faulty situations. The unknown mode has no specified underlying behavior and hence no associated constraints. 3 Fault Signatures The constraints in each mode q i can be brought back to a set of consistency tests. Following the parity space approach, consistency tests take the form of analytical redundancy relations (SARR qi ) obtained by eliminating non observable variables [Cordier et al., 2004]. An ARR can be expressed as r = 0, where r is called the residual of the ARR. The ARRs are constraints that only contain observable variables. They can be determined off-line and then be evaluated on-line with the incoming observations, allowing one to check the consistency of the observed against the predicted system s behavior. They are satisfied if the observed behavior satisfies the model constraints, in this case the associated residuals are zero. In the opposite case, all or some of the residuals are non zero. The set of residuals hence results in a boolean fault indicator tuple. The expected boolean value pattern for a given fault provides the fault signature. In our hybrid framework, the set of ARRs associated with each functional system mode is generally different, although some ARRs may be shared. A fault hence manifests by the fact that a subset of residuals switches to a non zero value, whereas other residuals may switch from an undetermined value to zero. Definition 1 Given a set [r 1,..., r n ] of n residuals and a set F = [F 1, F 2,..., F m ] of m faults, the signature of a fault F j is given by the binary vector FS j = [s 1j,..., s nj ] T, s ij = 1 if some components affected by F j are involved in ARR i, s ij = 0 otherwise. Residuals and fault signatures provide an abstracted information about the continuous dynamics of the system, which is sufficient for characterizing the system s nominal or faulty state. When a fault occurs, fault signatures can be interpreted in terms of events referring to the residuals switching values. Our goal is to take advantage of this event driven information to enrich the system s Mode Automaton, abstracting the continuous dynamics of the hybrid automaton into an extended discrete automaton that we call the Behavior Automaton. The diagnosability of the hybrid system can thus be analyzed from the Behavior Automaton, by using discrete event systems criteria [Sampath et al., 1995]. 4 DES diagnosability Analysis 4.1 Diagnosability definition Diagnosability is the property of a system and its observables, i.e. set of all the possible observations, that guarantees that a set of anticipated fault situations can be assessed and distinguished. Diagnosability definitions have been provided independently for CS [Travé-Massuyès et al., 2004][Struss and Dressler, 2003][Frisk et al., 2003] and for DES [Sampath et al., 1995] [Rozé and Cordier, 2002]. However, recent results have proved that definitions on both sides are formally equivalent [Cordier et al., 2006]. We take benefit of this result and propose to interpret the CS fault signatures, which are the key diagnosability concept, in terms of an automaton that can be merged into the discrete dynamics model. In this way, the diagnosability problem for the hybrid system is brought back to the diagnosability problem for an extended DES system. In consequence, this section restricts the presentation to the DES diagnosability definition and analysis through the socalled diagnoser [Sampath et al., 1995]. A DES is modeled by a finite state machine M = (Q, Σ, T, q 0 ) where Q is the set of states, Σ is the set of events, T (Q Σ Q) the transition function and q 0 the initial state, as already defined in section 2. The event set Σ is partitioned as Σ = Σ uo Σ o, where Σ uo is the unobservable event set and Σ o the observable event set. Observable events are system commands or events generated from the sensors. In our approach, these latter observable events are the residual value switches. We consider Σ F Σ uo as the set of fault events to be diagnosed. In the DES community, the diagnosis consists in the deduction of unobservable fault events from the observable traces generated by the system. Definition 2 A fault F is diagnosable iff its occurrence is always followed by a finite observable sequence of events that 10 DX'06 - Peñaranda de Duero, Burgos (Spain)

21 allows us to diagnose F with certainty. The system is said to be diagnosable iff all the anticipated faults are diagnosable. Formally, let s F t be a sequence of events (or trajectory) such that s F ends with the occurrence of F, and t it is a continuation of s F. F is diagnosable iff: trajectory s F t, an integer n: length(t) n ( trajectory s such that P Σo (s)=p Σo (s F t), F occurs in s) [Pencolé, 2004], where P Σo is the projection operator on the set of observable events. 4.2 The diagnoser We assume that M (defined in subsection 4.1) has no unobservable cycles (i.e cycles containing unobservable events only). The set of fault events Σ F is partitioned into disjoint sets corresponding to different failure types F i, Σ F = Σ F1 Σ F2... Σ Fn and Σ Fi Σ Fj =, for i j. The aim of the diagnosis is to make inferences about past occurrences of failure types on the basis of the observed events. In order to solve this problem the system model is directly converted into a diagnoser. The diagnoser Diag(M) = (Q Diag, Σ Diag, T Diag, q 0 Diag ) is a deterministic finite state machine built from the system model M = (Q, Σ, T, q 0 ) [Sampath et al., 1995]: q 0Diag = {(q 0, { })} is the initial state of the diagnoser. Σ Diag = Σ o is the set of observable events of the system. Q Diag is the set of states of the diagnoser: Q Diag 2 Q 2Σ F or Q Diag P(Q P(Σ F )), where P(E) denotes the power set of E. The states of the diagnoser provide the set of diagnosis candidates as a set of couples whose first element refers to the state of the original system and the second is a label providing the set of faults on the path leading to this state. For example, when the diagnoser is in the state q Diag = {(q 1, { }), (q 2, {F 1, F 3 })}, it means that the system M is in one of the states q 1, q 2 as developed in table 1 1. T Diag is the diagnoser transition function built by a recursive process that consists in computing all the reachable states from the diagnoser initial state and by propagating the diagnosis information. For more details see [Sampath et al., 1995]. Definition 3 Given a diagnoser state q Diag Q Diag, this state is F i -uncertain iff F i does not belong to all the labels of the state whereas F i belongs to at least one label of the state. Theorem 1 The system M is not diagnosable iff the associated diagnoser Diag(M) contains an uncertain cycle, i.e. a cycle in which there is at least one F i -uncertain diagnoser state for some F i and whose states also define a cycle in the original system M [Sampath et al., 1995]. 5 Building the Behavior Automaton At the continuous level, a fault manifests itself as anticipated in the fault signature, which reduces the detection task to detecting the violation of a subset of ARRs. In this paper, we 1 We do not use the ambiguous label used in [Sampath et al., 1995], but we explicitly give the set of faulty system modes System Diagnosis Comments mode q 1 { } the system may be in the nominal state q 1 (no faults) q 2 {{F 1, F 3 }} the system may be in the faulty state q 3 with a diagnosis {F 1, F 3 } (the label F i means that at least one fault of type type Σ Fi has occurred) Table 1: The {F 1, F 3 } uncertain state of the Diagnoser propose to model the (nominal and faulty) continuous behavior of the hybrid system based on events referring to the set of ARRs associated to the different modes. For each mode of the system (nominal and faulty), we associate a so-called M- Behavior Automaton constructed from the knowledge of the residuals that must switch value when transitioning to this mode: these residuals include a subset of the residuals associated to the departure modes that switch to non zero value and the residuals in the current mode that must switch to zero. Notice that the same procedure is indifferently applicable for transitions triggered by command events, fault events or spontaneous events. Every M-Behavior Automaton state hence evolves with the occurrence of residual value switches that define a set of events. The M-Behavior Automata capture all the necessary information to determine the unobservable events (fault or spontaneous) that occurred at the transitioning between modes by the analysis of their observable trajectories. The system s Behavior Automaton is obtained as an extension of the Mode Automaton by the M-Behavior Automata for each mode. Let SARR qi = {ARR i1, ARR i2,..., ARR inarr(q i)} be the set of ARRs associated to mode q i and Sr qi = {r i1, r i2,..., r inarrs(q i)} the associated set of residuals, where N ARR (q i ) is the number of ARRs in mode q i. We denote by Sr system = i Sr q i the set of all residuals for the system and we denote by D = {0, 1, und} the residual value domain, where und stands for the undefined value that is used to represent the case when the associated ARR is not defined in one given mode. Now, let us define the function e, which associates an event to every residual value switch: e : Sr system D D \ Diag D D Σ behav (r ij, l, k) e lk ij where Diag D D is the set {(und, und), (0, 0), (1, 1)}. The event e lk ij is hence associated to the residual r ij switching from value l to value k. Remark The system changes from one mode q i1 to another mode q i2 iff, at least one event e 01 i, 1 j N 1j ARRS(q i1 ) has occurred and all events e und0 i 2j, 1 j N ARRS (q i2 ) have occurred. In this paper, we deal with the general case for which the order of occurrence of these events is not specified. Additional temporal information [Puig et al., 2005] permits to specify the order of event occurrence. DX'06 - Peñaranda de Duero, Burgos (Spain) 11

22 Definition 4 The M-Behavior Automaton for a given system mode q i (either nominal or faulty) is defined as Mbehav i = (Q i behav, Σi behav, T behav i, qi behav 0 ), where: Q i behav is the set of M-Behavior Automaton states, each state q i,k Q i behav is characterized by an instance of the global set of residuals Sr system and the trajectories exhibit the different possible order for the residual switches. Σ i behav Σ behav is the set of events, each event corresponds to one residual value switch. As stated before, the system s Behavior Automaton is obtained as an extension of the Mode Automaton by the M- Behavior Automata for each mode. This procedure allows us to generate the system s Behavior Automaton in a modedriven way, avoiding to enumerate all possible states (see Figure 4). Indeed formally, the system s Behavior Automaton is the synchronous product of the automata defining all the possible residual value switches in which all non accessible states and impossible transitions, defined by the Mode Automaton, are discarded. The proof of equivalence is not provided in this paper. 6 Hybrid Diagnosability Analysis In this section, the definition of diagnosability is extended to hybrid systems and sufficient conditions applying separately to the discrete part and to the continuous part of the system are provided. The necessary and sufficient condition is then given in terms of the Behavior Automaton built in section 5. Definition 5 An hybrid system is said to be diagnosable iff all the anticipated faults are diagnosable. A fault F i is diagnosable iff its occurrence is always followed by a sequence of observed events intermingled with continuous variable (or corresponding residual) observations that permit to distinguish F i from all F j, j i, with certainty. 6.1 Conditions for Hybrid Diagnosability Proposition 1 The hybrid system S = (X, Q, Σ, T, C, (x 0, q 0 )) is diagnosable if its discrete part M = (Q, Σ, T, q 0 ) (see section 2) is diagnosable according to the DES diagnosability conditions of theorem 1. The proof of this proposition is trivial. In practice, the discrete part of the hybrid system is rarely diagnosable because the Mode Automaton does not include explicit information about the events that occur after the occurrence of a fault. Diagnosability can only be decided on the basis of the command events observation and is generally not achieved. Proposition 2 The hybrid system S = (X, Q, Σ, T, C, (x 0, q 0 )) is diagnosable if the underlying continuous systems corresponding to every mode are all diagnosable according to the continuous systems diagnosability conditions [Travé-Massuyès et al., 2004]. The proof of this proposition is trivial. In practice, the underlying continuous systems are seldom all diagnosable for every mode, and only weak diagnosabilty is achieved [Struss and Dressler, 2003][Travé-Massuyès et al., 2004]. Proposition 3 The hybrid system S = (X, Q, Σ, T, C, (x 0, q 0 )) is diagnosable iff its Behavior Automaton is diagnosable according to the DES diagnosability conditions of Theorem 1 2 The Behavior Automaton is a discrete event model enriched with an abstraction of the continuous behaviors that combines the two aspects (continuous and discrete) of the hybrid system. In fact, the original discrete events provide important temporal information about the order in which continuous signatures are expected after the occurrence of a fault. For example, they can be key in discriminating two faults having the same continuous signature but occurring in different modes (after different command events). 7 Motivational Example R4 R3 E2 sw R1 R2 E1 Figure 1: Example circuit In this section, we take as an example the electrical circuit, (Figure 1), which has two nominal operating modes N1 and N2, commanded by a switch sw. Without loss of generality, we only deal with the faulty modes involving the components R1 and R2, (see Figure 2). In other words, we assume that components R3 and R4 are always in nominal mode. For sake of simplicity, we take the simple fault assumption. Two cases will be analyzed: first, the voltages E 1, E 2 and the currents I 1, I 2 are observable, which is referred as Case 1; second, E 2 becomes unobservable, which is referred as Case Computation of Analytic Redundancy Relations for Case 1 The ARRs of the system are computed in the two nominal modes and in the fault modes R1-short-circuit, R1-openedcircuit, R2-short-circuit and R2-opened-circuit as given in Table 2. In each fault mode, the commands on and off allow one to switch between the two possible configurations of the system (sw=off, and sw=on). In the following, for sake of simplicity, the ARRs are not indexed wrt modes like in section 5. This notation exhibits the shared ARRs in a clear way. 7.2 Diagnosability analysis for Case 1 In Case 1, the system is diagnosable according to the continuous diagnosability criterion, which proves the diagnosability 2 The proof of this proposition is not provided in this paper. I 1 I 2 12 DX'06 - Peñaranda de Duero, Burgos (Spain)

23 off off on on R1 R1 opened short circuit circuit f1 f 2 f 1 f 2 on N1 N2 off f3 f 4 f4 f 3 modes r 1 r 2 r 8 r 11 r 5, r 6 r 9, r 13 q und und und q und und und q und und q und 0 und q und q20 und 0 und und und q21 und 1 und und und q22 und 1 0 und und R2 opened circuit on off R2 short circuit on off Table 3: Mapping between the modes of the R2-opened circuit-behavior Automaton (Figure 4) and the instances of the residuals in case 2. fu fu unknown fu Figure 2: Mode Automaton of the system (transitions to the unknown mode are not all represented to avoid overprint) property of the whole hybrid system by Proposition 2. Indeed, it is easy to see that the set of ARRs is always different from one mode to another, implying that the fault continuous signatures are different. N1 on off N2 System mode N1 N2 the corresponding ARRs set ARR 1 : E 1obs = (R 1 + R 3 )I 1obs + R 3 I 2obs ARR 2 : I 2obs = R1 R 2 I 1obs ARR 2 : I 2obs = R1 R 2 I 1obs ARR 3 : (1 + α)e 1obs = E 2obs + RI 1 + R 4 I 2 (sw=on) ARR 4 : (1 + α)e 1obs = E 2obs + R I 2obs R1 opened ARR 5 : I 1obs = 0 circuit (sw=off) (f1) ARR 5 : I 1obs = 0 ARR 6 : E 1obs = (R 2 + R 3 )I 2obs (sw=on) R1 ARR 7 : (1 + α)e 1obs = E 2obs + R 4 I 1obs short ARR 8 : I 2obs = 0 circuit (sw=off) (f2) ARR 8 : I 2obs = 0 ARR 9 : E 1obs = R 3 I 1obs (sw=on) R2 ARR 10 : (1 + α)e 1obs = E 2obs + RI 1obs opened ARR 8 : I 2obs = 0 circuit (sw=off) (f3) ARR 8 : I 2obs = 0 ARR 11 : E 1obs = (R 1 + R 3 )I 1obs (sw=on) R2 ARR 12 : (1 + α)e 1obs = E 2obs + R 4 I 2obs short ARR 5 : I 1obs = 0 circuit (sw=off) (f4) ARR 5 : I 1obs = 0 ARR 13 : E 1obs = R 3 I 2obs Table 2: Table of ARRs classified by mode, R = (R 1 +R 4 + R 1R 4 R 3 ), R = (R 2 + R 4 + R2R4 R 3 ) and α = R4 R 3 und 0 e r9 und 0 e r8 q32 01 e r2 f 2 f 2 q30 q31 q34 und 0 e r9 q33 e und 0 r8 on off on off off on on on The fault mode R1 short circuit Figure 3: The R1-short-circuit-Behavior Automaton of Case Diagnosability analysis for Case 2 In Case 2, the residuals r 3, r 4, r 7,r 10 and r 12 become not computable, and by consequence their corresponding events e lk 3, elk 4, elk 7, elk 10 and elk 12, (l, k) D D \Diag D D must be removed from Σ behav. The updated set of ARRs leads to a non diagnosable underlying continuous system. Since the discrete part is not diagnosable on its own neither, none of the Sufficient conditions given in Propositions 2 and 1 are fulfilled. Diagnosability must hence be performed through the Behavior Automaton and its corresponding diagnoser. The R1-short-circuit Behavior Automaton and R2-opened-circuit Behavior Automata are given in Figure 3 and 4, respectively. Table 3 illustrates the mapping between the occurrence of events related to residuals and the resulting residual values in every state of the R2- e r2 01 q40 q41 q42 und 0 e r8 DX'06 - Peñaranda de Duero, Burgos (Spain) 13

24 und 0 e r11 und 0 e r8 q12 01 e r2 N1 q10 q11 on off The fault mode R2 opened circuit N2 f 3 f 3 q14 und 0 e r11 q13 e und 0 r8 on off on off off on on on e r2 01 q20 q21 q22 und 0 e r8 opened-circuit Behavior Automaton. The diagnoser has been built and part of it is given in Figure 5. The diagnoser shows that the hybrid system is not diagnosable according to the hybrid diagnosability condition of Proposition 3 because there is an uncertain on-off cycle containing the F 2 -uncertain and F 3 -uncertain diagnoser states. However, from the state {(q 12, {f 3 }), (q 32, {f 2 })} of this cycle, there are paths leading to the determined states {(q 14, {f 3 }) and {(q 34, {f 2 }). {(q 12, {f 3 }), (q 32, {f 2 })} is reachable by transitioning on an event which corresponds to a command sw=off, which means that it is controllable. In an ambiguous diagnosis situation of this cycle, an active diagnosis action, namely the off command, can hence be applied to reach a diagnosable configuration (in which the fault continuous signatures of the modes R1-short-circuit and R2- opened-circuit become different). From the point of view of active diagnosis, it is important to notice the difference between an uncertain diagnoser cycle with and without command events. The latter is definitively not diagnosable but the former can be turned diagnosable when reconfiguration is permitted. Figure 4: Case 2 (q33, {f 2} ) und 0 e r8 und 0 e r9 (q34, {f 2} ) The R2-opened-circuit-Behavior Automaton of ( N1, { } ) ( N2, { } ) off 01 e r2 (q11, {f 3} ) (q31, {f 2} ) (q51, {f 1} ) (q71, {f 4} ) on und 0 e r8 (q12, {f 3} ) (q32, {f 2} ) und 0 e r9 on und 0 e r11 und 0 e r11 off on (q14, {f 3} ) on (q13, {f 3} ) und 0 e r8 on off on 01 e r2 (q21, {f 3} ) (q41, {f 2} ) (q61, {f 1} ) (q81, {f 4} ) und 0 e r8 (q22, {f 3} ) (q42, {f 2} ) Figure 5: Part of the all System Diagnoser for Case 2 on 8 Discussion and conclusions This paper proposes a framework for analyzing the diagnosability of a hybrid system which stands on recent results establishing the formal equivalence of diagnosability definitions for DES and CS. The approach relies on merging the fault signatures exhibited at the continuous level into the Mode Automaton that represents the discrete dynamics of the system, so that DES diagnosability analysis can be performed on the resulting Behavior Automaton and the corresponding diagnoser. When the state of the system is ambiguous, an analysis of the diagnoser allows us to point at reconfiguration actions that safely move the system into a mode reducing ambiguity. To our knowledge there is no existing work proposing a method to analyze the diagnosability of a hybrid system. The method that we propose interprets the continuous dynamics of the system in terms of events and gives a procedure to merge this knowledge into the discrete dynamics model. Our approach can be related to the work by Lunze which uses Quantized Automata [Lunze, 2000a] [Lunze, 2000b]. Lunze starts with a continuous system and discretizes the continuous variable value domains. From this discretization, he is able to produce a behavior automaton that accounts for all the variable value switches. The behavior automaton that he produces is hence oriented towards behavior prediction and simulation purposes and its semantics are quite different from the behavior automaton that we produce. In our case, we have pursued the goal to obtain the same behavior automaton as used by the model-based DES diagnosis community [Sampath et al., 1995] [Puig et al., 2005] [Lamperti and Zanella, 2002], so that their results can then be applied as so. For this purpose, the abstraction of the continuous dynamics is performed from the continuous subspaces that characterize the different modes of the system and the switches undergone by the system state. The subspaces are generated by the ARRs and the switches correspond to value switches for their corresponding residuals. In this framework, fault signa- 14 DX'06 - Peñaranda de Duero, Burgos (Spain)

25 tures are uniquely defined, which is not the case when basing the abstraction on a state variable value partitioning of the state space as used by Lunze [Lunze, 2000a]. This paper is in continuation with the work done by the French Imalaia group [Cordier et al., 2004] and the Bridge Task Group within the MONET network of Excellence. It hence uses the knowledge and results obtained by these two groups and establishes yet another bridge between two model based communities, namely the continuous and the DES model based communities. Future work will be devoted to the problem of using diagnosability assessment for selecting the best reconfiguration action. This problems goes beyond selecting and applying a discrete action. Indeed, some physical constraints may require to plan a sequence actions and the hybrid nature of the system may call for hybrid control. References [Bénazéra and Travé-Massuyès, 2003] E. Bénazéra and L. Travé-Massuyès. The consistency approach to the on-line prediction of hybrid system configurations. IFAC Conference on Analysis and Design of Hybrid Systems ADHS 03, Saint-Malo (France), [Benazera et al., 2002] E. Benazera, L. Travé-Massuyès, and P. Dague. State tracking of uncertain hybrid concurrent systems. In 13th International Workshop on Principles of Diagnosis DX 02, pages , Semmering, Austria, [Console et al., 2000] L. Console, C. Picardi, and M. Ribaudo. Diagnosis and diagnosability analysis using PEPA. In Proceedings of 14th European Conference on Artificial Intelligence ECAI 00, pages , Berlin, Germany, [Cordier et al., 2004] M.O. Cordier, P. Dague, F. Lévy, J. Montmain, M. Staroswiecki, and L. Travé-Massuyès. Conflicts versus analytical redundancy relations : A comparative analysis of the model-based diagnostic approach from the artificial intelligence and automatic control perspectives. IEEE Transactions on Systems, Man and Cybernetics, Part B., 34( ), [Cordier et al., 2006] M.O. Cordier, L. Ttavé-Massuyès, and Xavier Pucel. Comparing diagnosability criterions in continuous systems and descrete events systems. In Proceedings of the17th International Workshop on Principles of Diagnosis DX 06, Burgos, Spain, [Frisk et al., 2003] E. Frisk, D. Düştegör, M. Krysander, and V. Cocquempot. Improving fault isolability properties by structural analysis of faulty behavior models: application to the DAMADICS benchmark problem. In Proceedings of IFAC Safeprocess 03, Washington, USA, [Hofbaur and Williams, 2004] M.W. Hofbaur and B.C. Williams. Hybrid estimation of complex systems. IEEE Transactions on Systems, Man, and Cybernetics - Part B., 34(5): , [Lamperti and Zanella, 2002] Gianfranco Lamperti and Marina Zanella. Diagnosis of discrete-event systems from uncertain temporal observations. Artif. Intell., 137(1-2):91 163, [Lunze, 2000a] Jan Lunze. Diagnosis of quantized systems based on a timed discrete-event model. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 30(3): , [Lunze, 2000b] Jan Lunze. Diagnosis of quantized systems by means of timed discrete-event representations. In HSCC, pages , [Pencolé, 2004] Y. Pencolé. Diagnosability analysis of distributed discrete event systems. In Proceedings of the 16th Eureopean Conference on Artificial Intelligence, ECAI 2004, pages 43 47, [Puig et al., 2005] V. Puig, J. Quevedo, T. Escobet, and B. Pulido. On the integration of fault detection and isolation in model based fault diagnosis. In Proceedings of the 16th International Workshop on Principles of Diagnosis DX 05, pages , [Rozé and Cordier, 2002] L. Rozé and M.-O. Cordier. Diagnosing discrete-event systems : extending the diagnoser approach to deal with telecommunication networks. Journal on Discrete-Event Dynamic Systems : Theory and Applications (JDEDS), 12(1):43 81, [Sampath et al., 1995] M. Sampath, R. Sengputa, S. Lafortune, K. Sinnamohideen, and D. Teneketsis. Diagnosability of discrete-event systems. IEEE Transactions on Automatic Control, 40: , [Struss and Dressler, 2003] P. Struss and O. Dressler. A toolbox integrating model-based diagnosability analysis and automated generation of diagnostics. In Proceedings of the 14th International Workshop on Principles of Diagnosis DX 03, Washington DC, US, [Travé-Massuyès et al., 2004] L. Travé-Massuyès, T. Escobet, and X. Olive. Diagnosability analysis based on component supported analytical redundancy relations. Rapport LAAS N04080, To appear in IEEE Transactions on Systems, Man and Cybernetics, Part A, [Williams and Nayak, 1999] Brian C. Williams and P. Pandurang Nayak. A model-based approach to reactive selfconfiguring systems. In Jack Minker, editor, Workshop on Logic-Based Artificial Intelligence, Washington, DC, College Park, Maryland, Computer Science Department, University of Maryland. DX'06 - Peñaranda de Duero, Burgos (Spain) 15

26 16 DX'06 - Peñaranda de Duero, Burgos (Spain)

27 Towards an Entropic Approach for the Analysis of Chronicle Models Nabil Benayadi, Marc Le Goc and Philippe Bouché. Laboratoire des Sciences de l'information et des Systèmes - LSIS UMR CNRS Université Paul Cézanne Aix-Marseille III Avenue Escadrille Normandie Niemen13397 Marseille Cedex 20 France {nabil.benayadi, marc.legoc, philippe.bouche}@lsis.org Abstract This paper proposes to use the informational entropy to analyze chronicle models discovered by a stochastic approach [Le Goc et al, 2005, 2006]. The aim is to define a method and an algorithm to analyze a chronicle model used to predict the occurrence of a particular discrete event class in a diagnosis task of dynamic system. This is a classification problem of discrete event occurrences sequences. This problem is hardly tackled in the literature, especially when the resulting model must be interpretable by human. We propose an algorithm based on an informational entropy criterion that constructs a continuous time decision tree from which a chronicle model is deduced. The latter orders the main discrete event class that suffers to predict the occurrence of a particular discrete event class. In this paper we show on an example how such an entropic criterion completes the stochastic approach and provides an operational tool to analyze a given chronicle model. 1 Introduction This paper presents an algorithm to analyze a chronicle model provided by an expert or, as in our case, generated by the BJT algorithm [Le Goc et al, 2005], which is based on a stochastic modeling of continuous time discrete event sequences. The chronicle models generated by the BJT algorithm are used in a diagnosis task in order to predict the occurrence of a particular discrete event class. A chronicle model is a set of binary relations between discrete event classes that are timed constrained. The aim of this work is to define a method and a tool to analyze the contribution of the discrete event classes of such chronicle models to predict the occurrence of a particular discrete event class. For that purpose, we propose to use an informational entropic criterion to build a continuous time decision tree that describes the way the occurrences of a particular discrete event class are generated. Our approach is inspired from the TemporalID3 algorithm of [Console and Picardi, 2003]. The principle of our algorithm, called CTID3 for ID3 with Continuous Time values, uses a chronicle model to build a set of sequences that labeled OK or KO according to the fact that they lead or not to an occurrence of a given target discrete event class. The algorithm works in three steps: (1) the representation of the sequences under the form of a continuous time decision table, (2) the construction of continuous time decision tree, and (3) the deduction of a chronicle model from the temporal decision tree. This chronicle model specifies the discrete event classes that contribute to the prediction of the occurrence of the target class. This provides then a mean to analyze the initial chronicle model. In this paper, the method is assessed with a set of theoretical sequences corresponding to the prediction of a certain event class B. The next section discusses the researches in the temporal knowledge discovery domain and the related works. Section 3 presents the definition of decision tree and describes the well-known algorithm ID3 [Quinlan, 1986], we introduce the extension to timed data proposed in [Console and Picardi, 2003]. The section 4 describes the problem statement. Our algorithm, CTID3, is presented in section 5 with a theoretical example. The conclusion is dedicated to present our future works. 2 Related Works Classification is one of the most typical tasks in supervised learning, but hasn t received much attention in temporal domain [Antunes and Oliveira, 2001]. In fact, there is relatively few applications based on temporal classification in the literature, notably when the result model must be interpretable by humans. The main contributions are decision trees [Breiman and al, 1984; Murthy, 1998] that are largely used for the classification of the temporal data [Kadous, 1999], [Drucker and Hubner, 2002], [Rodriguez and Alonso, 2004]. Its popularity comes from the capacity to produce interpretable, comprehensible classification models. Timed data is still a basic problem with the knowledge discovery for model based diagnosis applications. The use of temporal knowledge is required in a large range of domains, from engineering to medicine or marketing for example. In the engineering domain, the diagnosis of a dynamic system such as telecommunication networks or industrial production systems aims at predicting the malfunctions from the flow of alarms generated by the equipments of the supervision system. These alarms must be filtered in order to DX'06 - Peñaranda de Duero, Burgos (Spain) 17

28 preserve only those which are the most interesting. [Mannila and al, 1997] proposes a set of methods allowing the extraction of frequent patterns (called episodes) from a sequence of alarms. Episodes are said to be parallel when the alarms are not ordered and sequential when they are completely ordered. These methods are inspired of those developed in the marketing domain. [Agrawal and Srikant, 1994] defined an approach allowing the extraction of the sequential patterns over a large database of customer transactions in order to identify the groups of the most sold articles, where each transaction consists of a customer-id, a transaction time and the items bought in the transaction. They proposed three algorithms (AprioriAll, AprioriSome and DynamicSome) to solve this problem, all being extensions to temporal data of the well-known Apriori algorithm [Agrawal and al, 1993]. This approach was applied to the analysis of the alarms generated by a telecommunication network [Hatonen and al, 1996a, 1996b]. The temporal constraint is uniquely a maximum global interval (observation window) covering the duration of the episodes. This interval must be defined by the user, whereas the temporal constraints between the elements of the episodes are completely ignored. [Cordier and Dousson, 2000] and [Dousson and Duong, 1999] propose an approach to discover chronicles models from a log of timed alarms that aim at exhibiting recurring phenomena. This work completes the one of Mannila with the introduction of the temporal constraints between alarms. However, the alarm s sequentiality is not directly expressed in this approach, and the temporal constraints are calculated with the heuristic ad hoc. Manilla and al [2002] made a conclusion about the main results obtained using Apriori-Like algorithms and characterize that these algorithms identify very local relations between events. To avoid this problem, Mannila invites to use other algorithms by adapting a more global point of view, and proposes to investigate in Markov theory. [Le Goc et al, 2005, 2006] propose a stochastic approach that aims at building chronicle models from a sequence of discrete event classes generated by knowledge based system. This approach is based on the representation of a sequence of discrete event classes in the dual forms of a Markov chain and a superposition of Poisson processes. This approach provides global chronicle models with operational timed constraints that can be used in a prediction task for diagnosis. This chronicle s models can contain events which have a maximum probability in the markov chain, but do not have any sense with the event to predict. So, this approach requires the tools to analyze the contribution of the occurrence of the discrete event of which lead to the one of the class to predict. An informational entropic criterion can be used to this aim. Such a criterion is used to build decision tree [Quinlan, 1986]. [Geurts and Wehenkel, 1998] proposes an extension of binary these decision trees to timed data, called temporal decision trees, for the early prediction of electric power system blackouts, using a large data base of random power system scenarios generated by Monte-Carlo simulation. Each scenario is described by temporal variables and sequences of event describing the dynamic of the system as it is observed from real time measurement. The aim is to derive as simple as possible models to detect a blackout problem in the system. More recently, [Console and Picardi, 2003] used temporal decision trees to diagnose the behavior of the dynamic systems in order to decide possible correction actions to carry out. Inspired with [Geurts and Wehenkel, 1998], they extend the ID3 algorithm [Quinlan, 1986] to temporal data, called TemporalID3, in order to produce compact n-ary temporal decision trees from a set of problematic situations. We propose to adapt this algorithm with the aim of using an informational entropic criterion to analyze a chronicle model. 3 Temporal Decision Trees Decision trees are used to implement classification problem solving methods that can be used in diagnostic tasks. Each node of the tree corresponds to a variable. A node can have as many descendants as the number of the values taken by the variable. The leafs correspond to the different decisions. Formally, a decision tree is a structure T= <r, N, E, L> where: N=N I N L is the union of a set N I ={x i } of internal nodes indicating a variable x i and a set N L ={a i } of leaf nodes indicating a decision a i, r N is the node root of the tree, E N I N is a set of arcs (an arc corresponding to a value v j for a variable x i ), L is a labeling function defined over N E which returns: The name of the variable x i associated to a node N I, or, The decision a i associated to a leaf node N L, or, The value v j associated to an arc of E. The ID3 algorithm uses the informational entropy in a set Σ of cases to build a minimal decision tree. A case s is a collection of values v j taken by a set of variables x leading to a particular decision a i. To each node, ID3 chooses the variable x which minimizes the entropy ξ(x, Σ) in the set of the cases Σ: ξ ( Σ Σ x= v j x= v j ξ ( x, Σ) = = ) = n i= 1 k j= 1 P( x = v P( a ; Σ i x= v j x= v j { s Σ the var iable x has value v in s} j ) ξ ( Σ ) log ( P( a ; Σ 2 ) i j x= v j )) (1) where ξ( x=vj ) is the entropy of the partition x=vj and P(a i ; x=vj ) is the probability of the decision a i in this partition x=vj. s 1 s 2 Table 1: Example of Temporal Decision Table t0 n h t1 v h x 1 t2 n v t3 n t0 h h t1 h n x 2 t2 n n t3 n t0 l h t1 v n x 3 t2 v n t3 v Décision Dl a 1 a 2 t3 t2 18 DX'06 - Peñaranda de Duero, Burgos (Spain)

29 According to [Console and Picardi, 2003], a temporal decision tree is a decision tree where a node is a couple (x i, t k ), x i indicating a variable and t k is the observation date of its value, and an arc define a value v j of x i at the date t k (i.e. x i (t k )=v j ). It is a structure T= <r, N, E, L, Tp> endowed with a time-labeling function Tp: N I R+ which gives the associated date to a internal node. The training set is a collection of situations S= {s e=0,..,m }. A situation s e is the set of the values v j taken by a set of variables X= {x i } at every observation date t k leading to a particular decision a n (Table 1). A situation s e refer to a discrete time clock where t k kt, k ℵ and T R +, T is a period of discretization. The variables x 1, x 2, x 3 of Table 1 take the qualitative values n, h, l, or v at observation dates t0, t1, t2, t3. In the first situation s 1 (resp. s 2 ), the decision a 1 (resp. a 2 ) must be taken at least at the date t3 (resp. t2). This date is called "limit" because the knowledge of the variable s values beyond this date is useless for the decision-making. TemporalID3 is an extension of ID3 to timed data according to a discrete time clock structure to build temporal decision trees [Console and Picardi, 2003]. A partition P e is a subset of S containing identical situations on an time interval: t k [t min, t max ], x X, s i, s j P e, Ttd[s i, x, t k ]=Ttd[s j, x, t k ] s i s j. Thus, at every observation date t, S is partitioned in a set of partitions P,t = { P e =0,.., m } : P = { P P S, t [ t,dl(p )], s,s P, x X, (2),t e= 0,1,...,m e k e i j e Ttd[ s i, x, t k ] = Ttd[ s j, x, t k ] Dl(P e ) = min{ Dl(s l )s l P e } TemporalID3 builds a tree by seeking a time interval which maximizes a criterion related to the number of partitions. Then, like ID3, TemporalID3 chooses the couple (x i, t) that minimizes the entropy criterion in this interval (12) and creates the corresponding node. k ξ ((x, t),s) = P(x (t) = v ) ξ (S (3) i j = 1 i j x (t) = v ) i j All the values of the variables at every date before T are eliminated from the table, including x i (t), then TemporalID3 repeats its treatment until two terminate conditions are met: all the situations of S are classified or, there are no valid observations x(t) for splitting S. 4 Problem statement In this section, we give the definition of the problem of analyzing chronicle models. First, we give a brief overview of the spatial discretization principle and the notion of chronicle model based on the formal description of event classes introduced in [Le Goc, 2004] and extended in [Le Goc and al, 2005]. Second, we examine the problem of the chronicles models generated by the stochastic approach. A sequence ω n ={o k } k=0,,m-1, ω n Ω, is an ordered set of m occurrences o k (t k, x, r m ) of discrete event e k (x, r m ) where x X is the name of a real variable, r m R x ={r k } k=0,,n, is a interval index of values for x(t), and t k Γ={t i }, t i R, is the time assignation of the index r m to a variable x, x(t k )= r m. The occurrences are timed with a continuous clock structure (i.e. t k-2 -t k-1 t k-1 -t k ) : t R, r R, t < t k m x k - 1 k x( t ) r x( t ) r o ( t, x,r ) k - 1 m k m k k m (4) Let E vt ={e k } k ℵ be a set of event defined over X*R and Γ={t i } ti R a set of the occurrences times defined over R, we note E o ={o k } k ℵ, o k (t k, x, r m ), a set of occurrence of discrete event defined over Γ*E vt. Let d be a function that provides the date of an occurrence: d : E Γ, o (t,x, r ) E,d( o ) = t (5) o k k A sequence ω i ={o k } k=0,,m-1 defines a subset Γ ωi ={t j }of dates, so Ω={ω i } i=0, n-1 defined its subset Γ= Γ ωi=0,..,n : ω i Ω,ok ωi d ( ok ) Γω Γ (6) i A couple (o k, o k+1 ) of two successive occurrences related to the same variable x describes the temporal evolution of the discrete function x(t), defined on ℵ : ok ( tk,x,rm ), ok + 1 ( tk+ 1,x,rn ) ω, (7) ( ok, ok + 1) t [ tk, tk + 1[, x( t) = rm x( tk + 1) = rn A discrete event class is a set C j ={e i } of discrete events e i (x, r m ). We will use the notation e i ::C j to denote that the discrete event e i belongs to the class C j. By extension, we will note o k ::C j an occurrence of a discrete event that belongs to the class C j. A binary relation R(C i, C o, [τ -, τ + ]) describes an oriented relation between two discrete event classes that is timed constrained. [τ -, τ + ] is the time windows for observing an occurrence of the output class C o after the occurrence of the input class C i. i o + R( C,C,[ τ,τ ]) on,ok ω Ω, (8) o i + ( on :: C ) ( ok :: C ) (d( on ) d( ok ) [ τ,τ ]) A chronicle model is a set of binary relations with timed constraints between classes of discrete events. For example, a chronicle model M 3 ={R 12 (C 1, C 2, [τ - 12, τ + 12 ]), R 23 (C 2, C 3, [τ - 23, τ + 23 ])} defines two binary relations between three discrete event classes and means that there exists at least three occurrences in Ω so that : ok,on, om Ω (9) (ok ::C ) (on ::C ) (om ::C ) (d( on ) d( ok ) [ τ12,τ12 ]) (d( om ) d( on ) [ τ 23,τ 23 ]) Such chronicle models can be used to predict the occurrence of the final event classes, like C 3 in the M 3 chronicle model. To this aim, rules like the following can be used in a diagnosis task: ω Ω, o k,on ω, (10) (ok :: C ) (on :: C ) (d(on ) d(ok ) [ τ12, τ12] ) 3 + om ω,(om :: C ) (d(om ) d(on ) [ τ 23, τ 23]) [Le Goc and al, 2005] proposes an algorithm called BJT to discover such chronicle models from a sequence of discrete event occurrences. When the occurrences of the discrete event classes are independent and uniformly distributed according to the exponential distribution, a sequence can be modeled under the dual form of a homogeneous continuous time Markov chain and its corresponding superposition of Poisson processes. The BJT algorithm uses these two representations to build chronicle models. This constitutes the basis of the stochastic approach (see [Le Goc et al, 2006] to a global discussion on the stochastic approach). m o k k DX'06 - Peñaranda de Duero, Burgos (Spain) 19

30 The anticipation ratio is a criterion allowing the measurement of the quality of chronicle models produced by the stochastic approach. This ratio is defined like the number of sub-sequences respecting the sequential and temporal constraints of a chronicle model who are called sequence Ok (the positive predictable occurrences of target-class), noted ω ok n, and the number of sub-sequences respecting the constraints of this model private of the last binary relation concerning the class to be predicted who are called sequence Ko ( the false predictable occurrences of target-class), noted ω ko n (false events): Ok nbr( ωn ) anticipation ratio = (11) Ok ko nbr( ωn ) + nbr( ωn ) For example, let us consider a sequence ω 0 of the class occurrences C={A, B, C, D} describing the evolution of fourth variables X={v_a, v_b, v_c, v_d} in a given time period: ω 0 = {(0.8774, B), (1.9313, A), (2.8625, C), (3.8718, A), (4.4063, B), (4.7837, D), (6.0282, B), (6.0874, C), (6.2531, A), (8.0034, D), (8.4572, A), (9.2311, A), (9.4742, C), (9.5447, B), (9.8285, A), ( , B), ( , A), ( , B), ( , A), (13.621, B), ( , C), ( , D), ( , A), ( , B), ( , A), ( , C), ( , D), ( , A), ( , B), ( , A), ( , C), ( , B), ( , A), ( , A), ( , D), ( , C), ( , B), ( , A), ( , A), ( , C), ( , A), (26.882, B), ( , B), ( , D), ( , A), ( , B), ( , C), ( , A), ( , A), ( , B), (33.798, A), ( , C), ( , B), ( , A), ( , A), ( , A), ( , B), ( , D), (38.31, B), ( , C), ( , A), ( , B), ( , A), (41.992, C), ( , A), ( , B), ( , A), ( , B), ( , C), ( , C)}. Let us consider also a chronicle model deduced for this sequence with a stochastic approach (Figure 1). [ 0 ] [ ] [ ], , , A C A B Figure 1: Chronicle Model for B class This chronicle model defines a training set Ω={ω n } containing 8 subsequences of ω n and 16 subsequences ω n ok Ko (Figure 2). A 1.93 A C 2.86 D C A 3.87 B Figure 2: Sequences Ok (top) and Ko (down). Ideally, a chronicle model would have a ratio of 100% (nbr(ω n Ko )=0). Practically, because occurrences are generated according to a Poisson distribution, this ratio can not be 100%. [Le Goc and al, 2005] considers that a chronicle model with an anticipating ratio greater than 50% is operational for a diagnosis tasks. Our aim is then to define the means to analyze and eventually to improve the anticipating ratio of an operational chronicle model. The idea consists in using a temporal supervised classification method to identify the discrete event classes of a given chronicle model that allows to decide wether a sequence is OK or not. The method must provide a that must be: (1) A B 4.40 temporal to take into account the temporal nature of the occurrences, (2) distinctive to characterize only the sequences Ok, (3) interpretable to be able to explain the reasons of the decision. The Temporal Decision Trees of [Console and Picardi, 2003] based on [Geurts and Wehenkel, 1998] presents these properties but must be adapted to data that are timed according to a continuous clock structure. 5 Continuous Time ID3 This section introduces the CTID3 algorithm (Continuous Time ID3) to classify a set Ω=Ω Ok Ω Ko, where Ω Ok ={ω Ok n } contains a set of sequences Ok and Ω Ko ={ω Ko n } contains a set of sequences Ko. The CTID3 works in three-stages: (1) the sequence representation in a temporal decision table, (2) the construction of temporal decision tree, (3) the extraction of the chronicle model from the temporal decision tree. This section details these stages. 5.1 Sequences representation The objective of this stage is to build a decision table similar to Table 1 containing a set of training sequences Ω=Ω Ok Ω Ko. This consists mainly in passing from data timed with a continuous time clock structure to data timed with a discrete time clock structure. The first step re-dates the occurrences of a sequence in a time relative to the date of its last occurrence. Let tmax be the date of the last occurrence in a sequence ω i defining a set Γ ωi of dates: t max = max{ t k= 0,...,n tk Γω i }. The new time of an occurrences in a sequence ω is given by: ok ωi,d( ok ) = t max d( ok ) Now, the sequences of Ω can be analyzed from the end to the beginning in to the opposite direction of time so that the limit date Dl(ω n ) of a sequence ω n is the largest date in a redated sequence: Dl(ω n )= max{t k o ω n, d(o)=t k }. By extension, the global limit date Dl(Ω) of a sequences set Ω={ω n } is the smallest limit date : Dl(Ω) = min {Dl(ω n ) ω n, Ω}. Ω ω 1 ω 2 ω 24 Table 2: Continuous Time temporal Decision Table 0.0 C? C? C? v_a C? A C? A A A 4.5 C? A C? 0.0 B B C? v_b C? B B C? C? C? 4.5 C? B C? 0.0 C? C? C? v_c 0.1 C? C C 4.1 C? C? C? 4. 5 C? C C? 0.0 C? C? D 0.1 C? C? D v_d 4.1 The set of sequences Ω={ω n } defines a set X= X ωi=0,..,n of variables x i and a set Γ= Γ ωi=0,..,n of observation times t k. In the example of the section 2.2 (Figure 2), X contains 4 variables and Γ 174 observation times. The construction of a temporal decision table as Table 1 requires to know the values of each of the variables x i of X at each observation time t k of Γ. But because the equation (4) does not allows to deduce these values from a given sequence ω i, the decision table must be completed with event occurrences of the form o k (t,x,?), or o k ::C? where «C?» indicates the class of unknown values (Table 2). D C? C? 4.5 D D C? Dec Ok Ko Ko Di DX'06 - Peñaranda de Duero, Burgos (Spain)

31 5.2 Continuous Time Temporal Decision Tree The continuous time temporal decision tree associated with the discrete event sequences set Ω={ω n } can then be built by applying the TemporalID3 algorithm to the continuous time temporal decision table (Table 2). Note that the TemporalID3 algorithm has a number of prerequisites, and two of these are relevant in our case: the compatibility criterion that guarantees the recognition of the sequences before the limit date of the sequences and the minimal entropy criterion that guarantees the minimization of the size of the resulting temporal decision tree. The proofs of the TemporalID3 algorithm are provided in [Console and Picardi, 2003]. B A C? Ok C? v_a, C v_c, C? Ok A Ko v_a, v_c, C? C v_c, C? v_c, Ko Ko C? C Ok C Ok Figure 3 : Continuous Time Temporal Decision Tree Applied to Table 2, TemporalID3 builds the continuous time temporal decision tree of Figure 3. This tree provides the different decisions that lead to an occurrence of a B discrete event class according to Ω. A decision corresponds to a node and an output arc that specify a triplet of the form (x j, C j, t k ) where x i is the name of a variable, C j denotes a discrete event class and t k the maximum time an instance of C j must occurs. 5.3 Chronicle Model Extraction A suite of such decisions can be used to analyze the corresponding chronicle model. To this aim, the idea is to transform this tree into a set of chronicle models. To deduce the chronicle model from a continuous time temporal decision tree, we first prune the branches leading to a decision Ko. Next, we prune the branches with at least one arc labeled with the C? class. This provides a tree with branches that can be interpreted as a set of binary relations between discrete event classes. The time constraints are then estimated from the limit time of two successive nodes and the average inter occurrences delay of the corresponding discrete event classes. Because a continuous time temporal decision tree is built from the decision table where the time is reversed, the construction of the model goes from the leaves to the root according to the following algorithm: For each arc (n 1 n 2 ) E of the tree where n 2 is the child of n 1, we create a node in the chronicle model whose the name is the event class L(n 1 n 2 ). For each couple of nodes (n_c 1, n_c 2 ) of the chronicle model which is created respectively from two successive arcs (n 2 n 3 ) (n 1 n 2 ), an arc creates the arc (n_c 1, n_c 2 ) whose value is a temporal interval [τ 1, τ 2 ] calculated in the following way (where λ provides the average inter occurrences delay of a discrete event class): τ 1 = (d(n 2 )-d(n 1 ))-λ(n_c 2 ). τ 2 = (d(n 2 )-d(n 1 ))+ λ(n_c 1 ). This algorithm produces the chronicle model of Figure 4 with the bold branch ((v_a, ) (v_c, ) (Ok)) of the tree of Figure 3. [ 0.52,0.79 ] [ 0.47,0.66 ] C A B Figure 4: New Chronicle Model It can be seen that he relation R(A, C, [0, 2.14]) of the model of Figure 1, produced by a stochastic approach, no more appears in the new chronicle model produced by the entropic approach. This induces the idea that this relation brings only little information to predict an occurrence of the target class B. The temporal constraints are stronger because Ω is made up only of sequences respecting the timed constraints of the stochastic model. 6 Conclusion This article proposes the algorithm CTID3 to compute chronicle models from a set of discrete event sequences. This algorithm transforms a set of sequences of discrete event classes in continuous time temporal decision table that allows applying the Temporal ID3 algorithm of [Consolé and Picardi, 2003] to construct a continuous time temporal decision tree. Such a tree describes the suite of decisions that leads to the occurrence of a particular discrete event class. Because a decision in such a tree corresponds to an occurrence of a discrete event class, CTID3 can build a set of chronicle models describing the relations between discrete event classes that contribute to the occurrences of a given discrete event class. When the set of discrete event class is provided by the mean of a given chronicle model, the chronicle models produced by the CTID3 algorithm can be used to analyze the given chronicle model. The main advantage of this approach is that the entropic criterion allows to identify the event classes that are the most significant in order to predict the occurrence of a particular event class. This property is particularly important when the chronicle models are provided with learning algorithms like the one used in the stochastic approach of [Le Goc et al, 2005]. This first result invites us to combine the entropic and stochastic approaches to discover temporal knowledge with a stronger predictive power. The CTID3 algorithm is implemented and it will be applied for the discovery of the roads models of the St-Microelectronics company. DX'06 - Peñaranda de Duero, Burgos (Spain) 21

32 References [Agrawal and al, 1993] Agrawal Rakesh, Imielinski, Tomasz, and Swami Arun. Mining Association Rules between sets of Items in Large Databases. Proc. ACM SIGMOD Int l Conf. on Management of Data, p Mai 1993 [Agrawal and Srikant, 1994] Agrawal Rakesh and Srikant Ramakrishnan. Mining Sequential Patterns, Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March Expanded version available as IBM Research Report RJ9910, October [Antunes and Oliveira, 2001] Antunes Claudia and Oliveira Arlindo. Temporal Data Mining: an Overview. In KDD Workshop on TemporalData Mining, pages 1 13, San Francisco, [Breiman and al, 1984] Breiman Leo, Friedman Jerome H., Olshen R. A., and Stone C. J. (1984). Classification and Regression Trees. Belmont, CA: Chapman & Hall. [Console and Picardi, 2003] Console Luca, Picardi Claudia, and Dupré D. Theiser. Temporal Decision Trees: Modelbased Diagnosis of Dynamic Systems On-Board. Journal of Artificial Intelligence Research, 19, pp [Cordier and Dousson, 2000] Cordier, M.O., and C. Dousson. Alarm Driven Monitoring Based on Chronicle. Proceedings of SafeProcess 2000, pages , Budapest, Hungary, [Dousson and Duong, 1999] Dousson Christophe and.vu Duong Thang. Discovering Chronicles with Numerical Time Constraints from Alarms Logs for Monitoring Dynamic Systems. the 1-th International Join conference on Artificial Intelligence (IJCAI'99), pp [Drucker and al, 2001] Drucker Christian, Hubner Sebastian, Visser Ubbo, and Weland H. Georg.. As Time Goes by - Using Time Series Based Decision Tree Induction to Analyze the Behaviour of Opponent Players. RoboCup 2001: Robot Soccer World Cup V, LNAI (pp ). Berlin: Springer-Verlag [Frydman and al, 2001] Frydman Claudia, Le Goc Marc, Torres Lucile, and Giambiasi Norbert. Knowledge-Based diagnosis in SACHEM using DEVS models. Special Issues of Transaction of Society for Modeling and Simulation International (SCS) on Recent Advances in DEVS Methodology, Tag Gon Kim Ed., Volume 18, n 3, pages , 2001 [Geurts and Wehenkel, 1998]. Geurts Pierre and Wehenkel Louis. Early prediction of electric power system blackouts by temporal machine learning. In Proc. Of ICML- AAAI 98 Workshop on IA Approches to Times-series Analysis, Madison (Wisconsin), [Hatonen and al, 1996a]. Hatonen Kimmo, Klemettinen Mika, Mannila Heikki, Ronkainen Pirjo, and Toivonen Hannu. Knowledge discovery from telecommunication network alarm databases. In 12th International Conference on Data Engineering (ICDE 96). New Orleans, LA, pp [Hatonen and al, 1996b] Hatonen Kimmo, Klemettinen Mika, Mannila Heikki, Ronkainen Pirjo, and Toivonen Hannu. 1996b. TASA: Telecommunication alarm sequence analyzer, or how to enjoy faults in your network. In 1996 IEEE Network Operations and Management Symposium (NOMS 96). Kyoto, Japan, pp [Kadous, 1999] Kadous M. Walled. Learning comprehensible descriptions of multivariate time series. Proceedings of the Sixteenth International Conference on Machine Learning, pp San Francisco: Morgan Kaufmann [Le Goc, 2004] Le Goc Marc. Sachem, a Real Time Intelligent Diagnosis System based on the Discrete Event Paradigm. Simulation, The Society for Modeling and Simulation International Ed., Volume 80, pages , [Le Goc and al, 2005] Le Goc Marc, Bouché Philippe, and Giambiasi Norbert. Stochastic Modeling of continuous Time Discrete Event Sequence for Diagnosis, in: 16th International Workshop on Principles of Diagnosis, DX- 05, Pacific Grove, California, USA, 1-3 juin [Le Goc and al, 2006] Le Goc Marc, Bouché Philippe, and Giambiasi Norbert. Temporal Abstraction of Timed Alarm Sequences for Diagnosis, In COGIS 06, International Conference on Cognitive Systems with Interactive Sensors, Paris, France, March 15-17, [Mannila and al, 1995] Mannila, H.; Toivonen, H.; Verkamo, I. Discovering Frequent Episodes in Sequences First International conference on Knowledge Discovery and Data Mining (KDD 95), pp [Mannila and al, 1997] Mannila Heikki, Toivonen Hannu, and Verkamo A. Inkeri. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3): , [Mannila, 2002] Mannila Heikki. Local and Global Methods in Data Mining: Basic Techniques and Open Problems. 29th International Colloquium on Automata, Languages and Programming, Volume n 2380, pages 57-68, Malaga, Spain, 2002 [Murthy, 1998] Murthy K. Sreerama. Automatic Construction of Decision Trees from Data: A Multi-disciplinary Survey. Data Mining and Knowledge Discovery, vol. 2, no. 4, pp [Quinlan, 1986] Quinlan J. Ross. Induction of decision trees. Machine Learning 1, [Rodriguez and Alonso, 2004] Rodriguez Juan J. and Alonso Carlos J. Interval and Dynamic Time Warpingbased Decision Trees. Symposium on Applied Computing, Proceedings of the 2004 ACM symposium on Applied computing DX'06 - Peñaranda de Duero, Burgos (Spain)

33 Distributed Diagnosis by using a Condensed Local Representation of the Global Diagnoses with Minimal Cardinality Jonas Biteus Dept. of Electrical Engineering Linköpings universitet, Sweden. Erik Frisk Dept. of Electrical Engineering Linköpings universitet, Sweden Mattias Nyberg Power-train Division Scania AB, Sweden Abstract The set of diagnoses is commonly calculated in consistency based diagnosis, where a diagnosis includes a set of faulty components. In some applications, the search for diagnoses is reduced to the set of diagnoses with minimal cardinality. In distributed systems, local diagnoses are calculated in each agent, and global diagnoses are calculated for the complete system. The key contribution in the present paper is an algorithm that synchronizes the local diagnoses in each agent such that these represent the global diagnoses with minimal cardinality. The resulting diagnoses only include faulty components used by the specific agent, and are therefore a condensed local representation of the global diagnoses with minimal cardinality. 1 Introduction This paper considers distributed systems that consist of a set of agents, where an agent is a more or less independent software entity, connected to each other via some network [Hayes, 1999; Weiss, 1999]. The diagnoses can, in distributed systems, be divided into two different types, global diagnoses that are diagnoses for the complete distributed system and local diagnoses that are diagnoses for a single agent [Roos et al., 2002]. It is an advantage to have the set of minimal global diagnoses in each agent. However, an agent has only an interest in knowing the fault status of the components used by that agent since the other components does not affect the specific agent. Consider for example a global diagnosis that consists of a set of components that have been found to be faulty. An agent that uses some components in its operation is interested in knowing if any of these components are included in the global diagnosis. The agent however does not have an interest in the fault status of the rest of the components in the global diagnosis. In some applications, the calculation of diagnoses is focused on to some smaller set of diagnoses, for example the most probable diagnoses [de Kleer, 1991] or the diagnoses with minimal cardinality [de Kleer, 1990]. Corresponding author. biteus@isy.liu.se. Address: Vehicular Systems, Electrical Engineering, Linköpings universitet, SE Linköping, Sweden. Phone: The key contribution in the present paper is an algorithm that synchronizes the minimal local diagnoses in one agent with the minimal local diagnoses in the other agents, such that the result is a set of diagnoses with minimal cardinality. Each resulting diagnosis is a subset of some global diagnoses, and only components used by the specific agent is included the resulting diagnosis. Since only the components used by the specific agent are included, both the size and the number of the resulting diagnoses with minimal cardinality are reduced compared to the set of global diagnoses with minimal cardinality. The resulting diagnoses with minimal cardinality are therefore a condensed local representation of the global diagnoses with minimal cardinality, and are here denoted condensed diagnoses with minimal cardinality. By reducing the size and the number, the algorithm requires a low computational load, low memory usage, and low network load. The algorithm is distributed such that it can handle both changes in the number of agents and the exchange of single agents. The algorithm is described in Section 5, using the framework for distributed diagnosis presented in Section 3 4. Our work has been inspired by diagnosis in distributed embedded systems used in automotive vehicles. These systems typically consist of precomputed diagnostic tests that are evaluated in different agents, which in the automotive industry correspond to electronic control units (ECUs). Sets of conflicts are generated when the diagnostic tests are evaluated in the ECUs, and the ECUs then compute sets of minimal local diagnoses. These embedded distributed systems typically consist of ECUs with both limited processing power and limited RAM memory. The algorithm presented here is therefore constructed such that it requires low processing power and low memory usage. In these systems, it should be possible to exchange, add, or remove ECUs without having to do any changes to the diagnostic software. The algorithm presented in this paper is therefore constructed such that it can handle such changes. Requirements on diagnostic systems used in automotive vehicles are discussed in Section Related Work Most research, such as [Reiter, 1987], has been aimed at the centralized diagnosis problem. These methods can also be used for distributed systems by letting a central diagnostic agent collect conflicts from the system and then calculate the minimal global diagnoses. It is not always suitable to use DX'06 - Peñaranda de Duero, Burgos (Spain) 23

34 a dedicated central diagnostic agent due to for example limited computing resources in each agent, robustness against agent disconnection, and the possibility to add new agents to the network. It therefore exist algorithms, see for example [Provan, 2002], which compute the minimal global diagnoses in a cooperation between the agents. These algorithms aim at the complete set of global diagnoses, while the method presented here aims at the set of global diagnoses with minimal cardinality. There also exist algorithms where the agents update the sets of local diagnoses such that these are consistent with the global diagnoses [Roos et al., 2003]. The method does not guarantee that a combination of the agents local minimal diagnoses is also a global minimal diagnosis. However, for every global minimal diagnosis, there is a combination of local minimal diagnoses. The updated sets of local diagnoses represent the global diagnoses without actually computing the complete set of global diagnoses. The method presented in this paper updates the local diagnoses such that these represent the global diagnoses with minimal cardinality. In [Biteus et al., 2005], a method was presented that calculated the global diagnoses with minimal cardinality by transmitting the minimal local diagnoses from one agent to another, which adds its minimal local diagnoses, then transmit the result to the next agent, etc. Even though the method is efficient, it might not give a sufficient distributed diagnostic system since it requires a lot of cooperation between the agents. In [Biteus et al., 2005], the computational burden when calculating the global diagnoses with minimal cardinality was reduced by partitioning the system into two or more sub-systems, whose minimal local diagnoses did not share components with each other. The partition approach has similarities with the tree reduction technique used in [Wotawa, 2001] and can also be used for the algorithm presented here. Related to this work is also the EU funded project Multi- Agents-based Diagnostic Data Acquisition and Management in Complex Systems (MAGIC) which develops an architecture useful for distributed diagnosis [Köppen-Seliger et al., 2003]. The project discusses protocols for network communication, control algorithms, and other aspects of the integration of diagnosis in distributed systems. 2 Requirements on Diagnostic Systems used in the Automotive Industry To better understand the industrial demands on diagnosis, the distributed diagnostic systems used in Scania AB heavy-duty trucks have been analyzed. These systems consist of ECUs connected to each other via a controller area network (CAN). The software embedded in the ECUs is primary used for control and monitoring. 2.1 An Example of a Distributed System One configuration of the distributed system in Scania s heavy-duty vehicles is shown in Figure 1. The system includes three separate CAN buses, the red, the yellow, and the green. Each of the ECUs is connected to sensors and actuators, and both sensor values and control signals can be shared with the other ECUs over the network. One example of an AUS Audio system CSS Crash safety system ACC Automatic climate control WTA Auxiliary heater system water to air ATA Auxiliary heater system air to air CTS Clock and timer system RTG Road transport informatics gateway Green bus Yellow bus Diagnostic bus COO Coordinator system GMS Gear box manage ment system LAS Locking and alarm system AWD All wheel drive system ICL Instrument cluster system TCO Tachograph system VIS Visibility system Red bus ACS Articulation control system SMD Suspension management dolly EMS Engine management system EEC Exhaust Emission Control SMD Suspension management dolly ISO11992/3 15 pole BMS Brake management system ISO11992/2 7 pole Trailer Figure 1: The distributed system in current Scania heavy-duty vehicles. ECU is the engine management system, which is connected to sensors and actuators related to the engine. There can be up to about 30 ECUs in the system, depending on the type of the truck, and roughly between 4 and 110 components are diagnosed by each ECU. The ECUs CPUs have typically a clocking speed of 8 to 64 MHz, and a RAM memory capacity of about 4 to 150 kb. CAN buses can typically transfer 100 to 500 kbit/s. As these numbers indicate there is not much computational, memory, nor network capacity available, especially when considering that the ECUs should be used for both control and diagnosis. 2.2 Overall Requirements on Distributed Systems A distributed system that can present the same information to users, as if it were a centralized system, can be denoted transparent [Tanenbaum and van Steen, 2002]. Considering fault diagnosis, one interpretation of transparency is that the minimal diagnoses presented by the distributed diagnostic system should be the same as those presented by a centralized diagnostic system, meaning that the minimal global diagnoses should be presented, not only the minimal local diagnoses. Another interpretation of transparency is that, even though one ECU fails to deliver its minimal local diagnoses the remaining system should still be able to deliver the minimal global diagnoses. This means that the diagnostic processes should be distributed among the ECUs, or if a centralized diagnostic ECU is used, backup ECUs should exist. If it is possible to increase or decrease the size of the system, without changes in the software, the system can be said to be scalable [Tanenbaum and van Steen, 2002]. Considering a truck, it should be possible to attach new parts including new ECUs to the network without having to change the software in the ECUs. If it is possible to exchange an ECU to for example a new version without having to change the software in the other ECUs, the system can be said to be interoperable [Tanenbaum and van Steen, 2002]. This is especially important in automo- 24 DX'06 - Peñaranda de Duero, Burgos (Spain)

35 tive systems where it frequently occurs that parts are replaced by parts from other manufacturers. Network input signal output 2.3 Requirement Conclusions An algorithm for distributed diagnosis used in automotive vehicles, should require a limited processing power, memory usage, and network load. The algorithm should further result in a transparent, scalable, and interoperable system. The algorithm presented in this paper synchronizes the minimal local diagnoses, which results in a transparent, scalable, and interoperable system. The condensed diagnoses with minimal cardinality do not include components that are not of interest and this reduces computational load, memory usage, and network load. 3 Consistency Based Diagnosis A system consists of a set of components C, where a component is something that can be diagnosed. This not only includes components directly connected to the agents, such as sensors and actuators, but also includes components shared between the agents, e.g. cables and pipes. To reduce the complexity of the diagnostic system, it is sometimes preferable to only consider the abnormal AB and the not abnormal AB mode, where the AB mode does not have a model. This means that the minimal diagnosis hypothesis is fulfilled [de Kleer et al., 1992], and therefore the notation in for example GDE will be employed. A diagnosis is a set of components D C, such that the components abnormal behaviors, the remaining components normal behaviors, the system description, and the observations are consistent. Since the minimal diagnosis hypothesis is fulfilled and D is a diagnosis, all supersets of D are also diagnoses. Further, a diagnosis D is a minimal diagnosis if there is no proper subset D D where D is a diagnosis [de Kleer et al., 1992]. An evaluation of a diagnostic test results in a conflict if some components, checked by the test, have been found to behave abnormal. A conflict is a set of components π C, such that the components normal behaviors, the system description, and the observations are inconsistent. A set D C is a diagnosis if and only if it has a nonempty intersection with every conflict in a set of conflicts. A consequence of this is that the set of minimal diagnoses is exactly determined by the set of minimal conflicts [de Kleer et al., 1992]. In some cases, it is computationally intractable to calculate the complete set of minimal diagnoses. To reduce the computational cost, the search can be focused on the diagnoses with minimal cardinality, as described in for example [de Kleer, 1990]. Let D be a set of diagnoses, then the set of minimal cardinality diagnoses is the set D mc = {D D : D = min D D D }. 4 Distributed Diagnosis This section will present the framework for distributed systems that will be used to describe how condensed diagnoses with minimal cardinality can be calculated. Agent A 1 Diag nosis A B C D Agent A 2 Figure 2: Agents, network, components, and diagnosis. 4.1 Outputs, Inputs, and Components in Distributed Systems A distributed system consists of a set of agents A. A local diagnosis is determined by the conflicts in a single agent, while a global diagnosis is determined by all agents conflicts. Here, the complete set of components are partitioned into private components P C and common components G C, where P G =. A private component is only used by one agent, while a common component is used by two or more agents. The set of private components is partitioned into different sets belonging to different agents such that for two such sets, P Ai P Aj =, where A i, A j A. The set X A is the subset of X that is used in agent A. In addition to components, an agent in a distributed system should also be able to diagnose inputs from other agents. The outputs are values from sensors, to actuators, or from calculations, which are made available to the other agents over the network. The complete set of signals S, a set of inputs IN S, and a set of outputs OUT S is here used. Each output σ OUT is connected to a subset of inputs Γ IN. Example 1: Figure 2 shows a typical layout of agents and components. The system consists of two agents, a network, and four sensor components, A to D. The sensors A and B are physically connected to agent A 1, while the sensors C and D are connected to A 2. A diagnosis in agent A 1 could for example include the components A, B, and C, connected with dashed lines. Agent A 1 diagnoses indirectly sensor C through the signal transmitted over the network. An output might depend on other components, such as sensors, and the information about this relationship is stored in the output s assumptions. Definition 1 (Assumption) Let s be a signal which is an output from agent A or an input that is connected to the output from agent A. Let the set ass(s) P A G IN A. If s is abnormal if and only if some non-empty subset C ass(s) is abnormal, then ass(s) is the set of assumptions for s. Each output depends on components and other inputs. This dependency can be propagated to a set consisting only of components. Definition 2 (Dependency) Let s S be a signal, then the dependency for s is dep(s) = ass(s) C dep(t). t ass(s) S DX'06 - Peñaranda de Duero, Burgos (Spain) 25

36 Network Agents A Private Components Common Components 1 A 2 A Figure 3: An example of a two agent system. Since the dep( ) function is defined implicit, the possibility of loops has to be considered in an implementation. Example 2: Continuation of Example 1. The assumption of the output is ass(output) = {C}. The dependency equals the assumption since the assumption does not include any signal. 4.2 Diagnoses & Conflicts on Components & Inputs An agent should state diagnoses that include both components and inputs. The diagnoses in Section 3 cannot be used directly since these only include components. Instead, a component-input diagnosis is defined on the set C IN. Definition 3 (Component-input diagnosis) A set D = C Γ Θ, C C, Γ IN, is a component-input diagnosis if the set C Ĉ, where i Γ : Ĉ dep(i), is a diagnosis. A component-input diagnosis will simply be denoted a diagnosis when no misunderstanding is imminent. As with diagnoses, conflicts can also be defined upon C IN. 4.3 Condensed Diagnoses Representing Global Diagnoses It was mentioned in the introduction that the condensed diagnoses with minimal cardinality should be calculated in each agent. Definition 4 (Condensed diagnosis) Let D be a set of global diagnoses. The tuple D, k, where D P A G IN A and k Z, is a condensed diagnosis in agent A if D D, such that D + k = D, D P = D P A, D G = D G, G s D IN = {i : dep(i) D\D, i IN }. The condensed diagnosis D, x in agent A, is a tuple where the set D represents the subset of some global diagnoses, diagnosis D in the definition, including components used by agent A. Variable k represents the components not included in D but included in D. Interpretation of the different requirements for a condensed diagnosis: D + k = D means that the cardinality plus k should equal that of the global diagnosis with minimal cardinality; D P = D P A means that D should only include private components used by agent A; D G = D G means that the common components should be included; B C D IN = {i : dep(i) D, i IN } means that inputs, that might be faulty due to its dependency on some faulty components, should be included in D. Example 3: Consider the system shown in Figure 3. There exist a signal s whose dependency dep(s) = {B}, represented by the dotted line. The sets of private components are P A1 = {A} and P A2 = {B, C}. The set of common components is G = {G}. Let {A, B, C, G} be a global diagnosis. A condensed diagnosis in agent A 1 is {A, G, s 1 }, 1. Component A is included since it is a private component in A 1, G since it is a common component, and s since it depends on the faulty component B. Component C is represented by k = 1 since it does not affect agent A 1. The minimal cardinality condensed diagnoses is the set of condensed diagnoses where each condensed diagnosis is a subset of some minimal cardinality global diagnoses, i.e. in Definition 4, the set of global diagnoses D is the set of minimal cardinality global diagnoses. The objective of the algorithm described in the next section is to calculate the sets of minimal cardinality condensed diagnoses in each agent. 5 Algorithm for Calculating the Minimal Cardinality Condensed Diagnoses The main idea of the algorithm presented in this Section is that each agent first find and then transmits the subset of minimal local diagnoses that other agents might be interested of. Each agent then receives the transmitted diagnoses and merges these with its own set of minimal local diagnoses resulting in the minimal cardinality condensed diagnoses. The transmitting part is described in Section 5.2, the receiving and merging is described in Section 5.3, and finally the main algorithm is described in Section 5.4. In the algorithms, D is some diagnosis, Γ IN, Ω OUT, P P, and G G. 5.1 Outputs Dependent on Inputs The algorithm, as written, requires that ass(s) C, i.e. an output s value does not depend on any signal. This can be fulfilled for a general system, where ass(s) C IN, by replacing the assumptions with the corresponding dependencies, such that ass(s) := dep(s) C. 5.2 Transmit Diagnoses The first step is to find and transmit the subset of minimal local diagnoses that is of interest to the other agents. A minimal local diagnosis should be transmitted if it includes common components, inputs, or components that outputs depends on, since these might affect other agents. For each diagnosis that should be transmitted, the private components can be removed since these do not directly affect other agents. If some outputs depend on any of the removed private components these outputs are instead added to the diagnosis. The minimal local diagnoses that are not transmitted on the network can be represented by a variable n N, which is the minimal cardinality of the non-transmitted local diagnoses. The agents receiving the diagnoses will then be aware that 26 DX'06 - Peñaranda de Duero, Burgos (Spain)

37 Algorithm 1 Transmit diagnoses transmit(a, m). Require: A set of minimal local diagnoses D A, limit m. Ensure: Set T X broadcasted on the network. 1: D TX := {D D A : D m} 2: D TX := {D D TX : D (IN G ( σ OUT A ass(σ))) } 3: n := min D D A \D T X D 4: TX := { D, k : D D TX, D = P G Γ, D = G Γ Ω, Ω = {σ OUT A : ass(σ) P }, k = P Ω } 5: if D TX D A then 6: TX := TX {, n } 7: end if 8: Broadcast TX on the network. there exist one or more non-transmitted local diagnoses with cardinality n. Algorithm 1 performs the steps described above. Since the minimal cardinality condensed diagnoses are searched for, the algorithm accepts a maximum limit m on the cardinality of the local diagnoses to be transmitted. Row 2 decides which diagnoses that should be transmitted. Row 4 constructs a tuple including the diagnosis without private components and a variable k N representing the removed private components. Example 4: Consider the system shown in Figure 4, where objects connected to the agents with solid lines are included in some minimal local diagnosis, while those connected with dashed lines are not included. The sets of private components are P A1 = {A}, P A2 = {B, C, D}, and P A3 = {E}. The set of common components is G = {G}. There exist two signals with assumptionsass(s 1 ) = {B} and ass(s 2 ) = {E}. Assume that the following set of minimal local diagnoses has been calculated in agent A 2 D A2 = {{C, s 2 }, {B, C}, {G, C}, {D}}. Using Algorithm 1 with m 2, the set D TX = {{C, s 2 }, {B, C}, {G, C}} is first calculated. The variable n = 1, i.e. the cardinality of {D}. The transmitted set of tuples is TX = { {s 2 }, 1, {s 1 }, 1, {G}, 1,, 1 } where the private components have been removed. This set will represent D A2 in agents A 1 and A Receive and Merge Diagnoses The second step is to receive the transmitted sets, transform them into an appropriate form, and then calculate the minimal cardinality condensed diagnoses. If a received diagnosis include a signal s that are an output from the receiving agent then the receiver know which components that s depend on, and s is therefore replaced with the assumption ass(s). After the replacement, the minimal local diagnoses and the received diagnoses can be merged to form a set of condensed diagnoses. If a condensed diagnosis includes several inputs and if several of these inputs depend Network Agents A 1 Private Components Common Components A s 1 s 2 G A 2 B C D A 3 Figure 4: An example of a three agent system. on the same component then the cardinality of the condensed diagnosis will not be correct. Consider for example two signals s 1 and s 2 depending on componenta. The cardinality of {s 1, s 2 }, 0 is two while the cardinality of the corresponding global diagnosis {A} is one. The condensed diagnosis should be {s 1, s 2 }, 1 where the minus one a compensation. Algorithm 2 performs the steps described above. The algorithm transforms the received sets of tuples by replacing inputs that are outputs from the current agent, with the components in the outputs assumptions, row 1 3. The function MHS(M) calculates the minimal hitting set for the collection of sets M. For example, the minimal hitting set MHS({{A, B}, {B, C}}) = {{A, C}, {B}}. To be able to calculate the compensation discussed above, the set of inputs is partitioned into the set Γ j, which is outputs that was added to the diagnosis when A j transmitted the tuple, and the set Γ j, which is the inputs to agent A j without the outputs from A i. In row 5 the received sets of tuples are merged. In row 6, the condensed diagnoses are compensated for signals depending on the same components. Finally, in row 7, the condensed diagnoses, which do not have minimal cardinality, is removed. Algorithm 2 Calculate the minimal cardinality condensed diagnoses condense(a i ). Require: Received sets TX Aj from all agents A j i as a result of evaluating transmit(a j, m). Set of minimal local diagnoses D Ai. Ensure: Set of minimal cardinality condensed diagnoses D Ai s. 1: for all j i do 2: RX Aj := { P G Ḡ Γ Γ, k : G Γ Ω, k TX Aj, Γ = Ω, Ω = Γ OUT Ai, Γ = Γ \ Ω}, H MHS( σ Ω{ass(σ)}), P = H P, Ḡ = H G} 3: end for 4: RX Ai := { D, 0 : D D Ai } 5: D Ai s := { H, k : D j, k j RX Aj, H = j D j, k = j k j}. 6: D Ai s := { D, k + comp(d) : D, k D Ai s } 7: D Ai s := { D, k D Ai s, k = min D,k D A i k} s The value of k + comp(d) should be the difference between the cardinality of D and a corresponding minimal car- E DX'06 - Peñaranda de Duero, Burgos (Spain) 27

38 Algorithm 3 Function comp(d). Input: Diagnosis D. Output: Variable k. 1: Each diagnosis is constructed such that D = (P i G i Γ i ) j i ( P j G j Ḡj Γ j Γ j ) 2: Z j := (Γ i k i Γk ) OUT Aj 3: k j 21 := min H MHS( σ Z j {ass(σ)}) H P Z j 4: k j 22 := min H MHS( σ Γ j Z j {ass(σ)}) P H Γ j Z j 5: k 2 := j (kj 21 + kj 22 ) 6: Z G := G i j i (G j Ḡj ) 7: k 3 := min H MHS( σ j i Z j {ass(σ)}) G H Z G H 8: k := k 2 + k 3 dinality global diagnosis. The variable k is calculated in Algorithm 1, while the function comp(d) is given by Algorithm 3. In the algorithm, k 2 is the compensation for signals depending on the same private components, while k 3 is the compensation for signals depending on the same common components. The following example will illustrate comp(d). Example 5: Consider an agent A 2 with outputs s 1 and s 2, private components P A2 = {A}, and assumptions ass(s 1 ) = {A} and ass(s 2 ) = {A}. Let a minimal local diagnosis in agent A 1 be D = {s 1, s 2 } and let A 2 have an empty set of minimal local diagnoses. The minimal cardinality condensed diagnosis in A 2, before evaluating comp(d), is in this case {s 1, s 2 }, 0. A minimal cardinality global diagnosis is {A}, and the cardinality of the minimal cardinality condensed diagnosis is therefore not correct, {s 1, s 2 } +0 {A}. Using comp(d), row 1 give that Γ 2 = {s 1, s 2 }. Row 3 give that k21 2 = {A} {s 1, s 2 } = 1. The result is the minimal cardinality condensed diagnosis {s 1, s 2 }, 1, which has correct cardinality. Assume now that agent A 2 has the minimal local diagnosis {A}, and agent A 1 has an empty set of minimal local diagnoses. Agent A 2 transmit the set { {s 1, s 2 }, 1 } which is the set of minimal cardinality condensed diagnosis in A 1. Using comp(d), the set Γ 2 = {s 1, s 2 } and the result is that k 2 = 0. In the second example, the compensation was done in the transmitting agent, while in the first example, the compensation had to be done in the receiving agent. How computationally difficult is the comp(d) function? Consider the special case where ass(σ) P Ai and ass(σ k ) ass(σ l ) =, which means that a signal only depends on private components and that no two signals depends on the same component. In this simplified case comp(d) = 0. If ass(σ) P, i.e. a signal only depends on private components, then the variable k 3 = 0 while k 2 might be some nonzero value. The more connections that exist between signals dependencies, and the more common components that exist, the more computationally complex will comp(d) be. The following lemma shows how the minimal cardinality condensed diagnoses could be calculated. Lemma 1 Let D mc be the set of minimal cardinality global diagnoses, and let transmit(a j, ) be evaluated for all Algorithm 4 Main algorithm. Require: Set of minimal local diagnoses D A in all agents. Result: Set of minimal cardinality condensed diagnoses. 1: Decide with voting: m 1 = max A A min D D A D 2: A A evaluate transmit(a, m 1 ) 3: A A evaluate condense(a) 4: Decide with voting: m 2 = max A A min D D A s D 5: A A evaluate transmit(a, m 2 ) 6: A A evaluate condense(a) agents A j A. Let D Ai s be the result after evaluating Algorithm 2 in agent A i. Then D Ai s is the set of minimal cardinality condensed diagnoses. Proof The complete proof is given in [Biteus et al., 2006]. Outline of the proof: A diagnosis D e, k D Ai s is by construction a condensed diagnosis if D e + k = D g, where e referees to condensed and g to global. First show the cardinality of D g D mc, second show that the cardinality of D e in D e, k D Ai s, and finally show that the cardinalities are equal. It is found that D e + X = D g where X = j i (kj 1 +kj 2 )+k 3, and k j 1 is identified as k 1 in Algorithm 1 in agent A j. The variables k j 2 and k 3 is identified as the corresponding variables in Algorithm 3. Example 6: Continuation of Example 4. Assume that agent A 2 has received the following sets of tuples from agent A 1 and A 3, TX A1 = { {s 1 }, 0,, 1 } TX A3 = { {s 2 }, 0 } Using Algorithm 2, the sets RX A1 = { {B}, 0,, 1 } RX A3 = { {s 2 }, 0 } is calculated. Assume that agent A 2 has the following set of minimal local diagnoses D A2 = {{C, s 2 }, {B, C}, {G, C}, {D}} which is transformed with the algorithm to the set RX A2 = { {C, s 2 }, 0, {B, C}, 0, {G, C}, 0, {D}, 0 }. The RX sets are merged, resulting in the set D A1 s = { {B, C, s 2 }, 0, {s 2, C}, 1, {s 2, D}, 1 }, which is the set of minimal cardinality condensed diagnoses. 5.4 Main Algorithm Lemma 1 shows that the sets of minimal cardinality condensed diagnoses can be calculated with Algorithm 1 to 3. However, it is possible to reduce the computational burden by using the cardinality limit m in Algorithm 1. Algorithm 4 first calculates a lower bound m 1 on the cardinality of the minimal cardinality global diagnoses. The minimal cardinality condensed diagnoses are then calculated with this lower bound as input to transmit( ). The algorithm then computes an upper bound on the cardinality of the global 28 DX'06 - Peñaranda de Duero, Burgos (Spain)

39 Network s Network Agents A 1 A 2 A 3 A Private Components Common Components H I G Figure 5: A system including relatively few signals compared to the number of components. diagnoses with minimal cardinality m 2. Since m 2 is the cardinality of a global diagnosis, it is known that the cardinality of a minimal cardinality global diagnosis is less than or equal to m 2. The local diagnoses with cardinality greater than m 2 can therefore not be part of a minimal cardinality global diagnosis and can therefore be ignored. The result is described in Theorem 1. The result after evaluating Algorithm 4 is that all agents have a set of minimal cardinality condensed diagnoses D A s. The reason for using Algorithm 4 is that it is, in most cases, more efficient than using the algorithm as described in Lemma 1. Even though Algorithm 4 is written as two separated parts, the result of part one should in an implementation be used when calculating part two. The correctness of the algorithm is shown in Theorem 1. Theorem 1 Same assumptions as in Lemma 1, but let D Ai s be the result after evaluating Algorithm 4 in agent A i. Then the set D Ai s is the set of minimal cardinality condensed diagnoses. Proof Follows from Lemma 1. 6 Example using the Algorithms Two examples will be studied in this section. Example 7: Consider the system shown in Figure 5. It includes three agents with the sets of private components P A1 = {A, H, I}, P A2 = {B, C, D}, P A1 = {E, F, J}, the set of common components G = {G}, and the set of signals {s}. The following sets of conflicts have been detected, Π A1 = {{A, H, G}, {A, I, G}}, Π A2 = {{C}, {D}}, and Π A3 = {{F, J}}. The sets of minimal local diagnoses are calculated from the conflicts resulting in the sets D A1 = {{A}, {G}, {H, I}} D A3 = {{F }, {J}}. B C D E F D A2 = {{C, D}} The following sets of tuples are transmitted to agent A 1 from A 2 and A 3. TX A2 = {, 2 } TX A3 = {, 1 }. The received sets in agent A 1 are RX A1 = { {H, I}, 0, {A}, 0, {G}, 0 } RX A2 = {, 2 } RX A3 = {, 1 }. J Agents Private Components Common A G B C D E F Figure 6: A system including relatively many signals compared to the number of components. In agent A 1, the resulting set of minimal cardinality condensed diagnoses is D A1 s = { {A}, 3, {G}, 3 }. To verify this set, the set of minimal cardinality global diagnoses is calculated D mc = {{A, C, D, F }, {A, C, D, J}, {G, C, D, F }, {G, C, D, J}}. Is the minimal cardinality condensed diagnoses correct? Consider the condensed diagnosis D = {A}, 3, and the minimal cardinality global diagnosis D = {A, C, D, F }. Using Definition 4 to verify that D is a condensed diagnosis, D + 0 = D, D P = D P A = {A}, and D G = D G =. Further D IN = {i : dep(i) D\D, i IN } =. This shows that D is a condensed diagnosis, and since D is a minimal cardinality global diagnosis, D is a minimal cardinality condensed diagnosis. The condensed diagnosis {A}, 3 represents the first and the second minimal cardinality global diagnoses, while the condensed diagnosis {G}, 3 represents the other. In the example above, the minimal cardinality condensed diagnoses was a condensed and efficient representation of the minimal cardinality global diagnoses. The next example will be used to exemplify when the minimal cardinality condensed diagnoses is not a condensed and efficient representation. Example 8: Consider the system shown in Figure 6. The system includes three agents with the sets of private components P A1 = {A}, P A2 = {B, C, D}, P A1 = {E, F }, set of common components G = {G}, and the set of signals {s 1, s 2, s 3 }. The assumptions are ass(s 1 ) = {B, C}, ass(s 2 ) = {C, D}, ass(s 3 ) = E. The following sets of conflicts have been detected, Π A1 = {{s 1, A, G}, {s 3, A, G}}, Π A2 = {{s 3, D}}, and Π A3 = {{E, F }}. The minimal local diagnoses is calculated from the object conflicts resulting in the sets D A1 = {{s 1, s 3 }, {A}, {G}} D A2 = {{s 3 }, {D}} D A3 = {{E}, {F }}. The following sets of tuples are transmitted to agent A 1 from A 2 and A 3. TX A2 = { {s 3 }, 0, {s 2 }, 0 } TX A3 = { {s 3 }, 0,, 1 } DX'06 - Peñaranda de Duero, Burgos (Spain) 29

40 The received sets in agent A 1 are RX A1 = { {s 1, s 3 }, 0, {A}, 0, {G}, 0 } RX A2 = { {s 3 }, 0, {s 2 }, 0 } RX A3 = { {s 3 }, 0,, 1 } Resulting set of minimal cardinality condensed diagnoses D A1 s = { {s 1, s 3 }, 0, {s 1, s 2, s 3 }, 1, {A, s 3 }, 0, {G, s 3 }, 0, } where the 1 in the second condensed diagnosis means that the true cardinality is one less than the cardinality for the set {s 1, s 2, s 3 }. To be able to verify the correctness of the minimal cardinality condensed diagnoses, the minimal cardinality global diagnoses are calculated. The set of conflicts, not including signals, is Π = {{B, C, A, G}, {E, A, G}, {E, D}, {E, F }} and the set of minimal cardinality global diagnoses is D mc = {{B, E}, {A, E}, {C, E}, {G, E}}. Is the minimal cardinality condensed diagnoses correct? Consider the condensed diagnosis D = {s 1, s 3 }, 0, and the minimal cardinality global diagnosis D = {B, E}. Using Definition 4 to verify that D is a condensed diagnosis, D +0 = D, D P = D P A =, and D G = D G =. Further D IN = {i : dep(i) D\D, i IN } = {s 1, s 3 } since ass(s 1 ) D and ass(s 3 ) D. This shows that D is a condensed diagnosis, and since D is a minimal cardinality global diagnosis, D is a minimal cardinality condensed diagnosis. As can be seen in the above example, there where quite some calculations that had to be performed compared to the calculation of the minimal cardinality global diagnoses. If a system has a high degree of components used by several agents, the minimal cardinality condensed diagnoses will include relatively many components. The reduction of size and the number of diagnoses will in this case be limited, and the efficiency of the algorithm reduced, as was seen in the example. Considering automotive systems, the ECUs typically have a large number of private components compared to both the number of inputs and the number of common components. It is therefore applicable to use the algorithm for these systems. 7 Conclusions The objective when designing the algorithm described in Section 5, was to gain a diagnostic algorithm that used low processing power, low memory usage, low network load, and resulted in a transparent, scalable, and interoperable distributed system, see Section 2. An algorithm has been presented in Section 5, that synchronizes the minimal local diagnoses in a distributed system. The result is a set of minimal cardinality condensed diagnosis, where each minimal cardinality condensed diagnosis is a subset of some minimal cardinality global diagnoses, see Theorem 1 and Lemma 1. The minimal cardinality condensed diagnoses only include components that are used by the specific agent and is therefore a local condensed representation of the minimal cardinality global diagnoses. A diagnostic system, using the algorithm presented in this paper, is transparent since the loss of an agent would only mean that this agent would not transmit its minimal local diagnoses on the network. It is scalable since it is directly possible to add new ECUs to the network. The algorithm requires a low processing load and memory usage since unwanted private components have been removed from the condensed diagnoses with minimal cardinality. References [Biteus et al., 2005] J. Biteus, M. Jensen, and M. Nyberg. Distributed diagnosis for embedded systems in automotive vehicles. In Proceedings of IFAC World Congress 06, Prague, Czech Republic, [Biteus et al., 2006] Jonas Biteus, Erik Frisk, and Mattias Nyberg. Condensed representation of global diagnoses with minimal cardinality in local diagnoses extended version. Technical report, Dept. of Electrical Engineering, Linköpings Universitet, Sweden, To be published. [de Kleer et al., 1992] J. de Kleer, A. K. Mackworth, and R. Reiter. Characterizing diagnoses and systems. Artificial Intelligence, 56, [de Kleer, 1990] J. de Kleer. Using crude probability estimates to guide diagnosis. Artificial Intelligence, 45: , [de Kleer, 1991] J. de Kleer. Focusing on probable diagnoses. In Proceedings of 9th National Conf. on Artificial Intelligence, Anaheim, U.S.A., [Hayes, 1999] C. C. Hayes. Agents in a nutshell-a very brief introduction. Knowledge and Data Engineering, IEEE Transactions on, 11(1): , Jan/Feb [Köppen-Seliger et al., 2003] B. Köppen-Seliger, T. Marcu, M. Capobianco, S. Gentil, M. Albert, and S. Latzel. MAGIC: An integrated approach for diagnostic data management and operator support. In Proceedings of IFAC Safeprocess 03, Washington, U.S.A., [Provan, 2002] G. Provan. A model-based diagnosis framework for distributed systems. In 13th International Workshop on Principles of Diagnosis, Semmering, Austria, May [Reiter, 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57 95, Apr [Roos et al., 2002] N. Roos, A. ten Teije, A. Bos, and C. Witteveen. An analysis of multi-agent diagnosis. In Proceedings of the Conference on Autonomous Agents and Mult-Agent Systems, Bologna, Italy, Jul [Roos et al., 2003] N. Roos, A. ten Teije, and C. Witteveen. A protocol for multi-agent diagnosis with spatially distributed knowledge. In Proceedings of 2nd Conference on Autonomous Agents and Mult-Agent Systems, Australia, Jul [Tanenbaum and van Steen, 2002] Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems. Prentice Hall, [Weiss, 1999] G. Weiss, editor. Multiagent systems : a modern approach to distributed artificial intelligence. MIT Press, Cambridge, Mass., U.S.A, [Wotawa, 2001] F. Wotawa. A variant of Reiter s hitting-set algorithm. Information Processing Letters, 79, DX'06 - Peñaranda de Duero, Burgos (Spain)

41 Focusing fault localization in model-based diagnosis with case-based reasoning Aníbal Bregón, Belarmino Pulido, M. Aránzazu Simón, Isaac Moro, Oscar Prieto, Juan J. Rodríguez Diez, Carlos Alonso-González Intelligent Systems Group (GSI), Department of Computer Science, E.T.S.I. Informática, University of Valladolid, Valladolid, Spain Department of Civil Engineering, University of Burgos, Burgos, Spain Abstract Consistency-based diagnosis automatically provides fault detection and localization capabilities, using just models for correct behavior. However, it may exhibit a lack of discrimination power. Knowledge about fault modes can be added to tackle the problem. Unfortunately, it brings additional complexity issues, since it will be necessary to discriminate among a maximum of K M mode assignments, for M components and K possible fault modes per component. Usually, some kind of heuristic information is included in the diagnosis process to focus the modelbased diagnostician. In this work we study the combination of a consistency-based diagnosis system together with a Case-based Reasoning system. The consistency-based diagnosis will perform fault detection and localization. The CBR system provides accurate indication of the most probable fault mode, at early stages of the localization process. Keywords: Fault Diagnosis, Model-based Diagnosis, Case-Based Reasoning 1 Introduction Consistency-based diagnosis is the most used approach to model-based diagnosis in the Artificial Intelligence community, also known as the DX approach. One of its main advantages is that it just requires correct-behavior models to perform fault detection and localization [Hamscher et al., 1992]. However, such feature may lead to the every component involved in the conflict syndrome [Dressler, 1996]: we just know one component has failed, but there is almost no discriminative power. This fact is specially true while diagnosing dynamic systems with scarce observability. Even more, pure consistency-based diagnosis can lead to logically sound, but physically impossible diagnosis [Dressler and Struss, 1996]. Usually, to solve such drawback, knowledge about fault modes is introduced. This can be done in several different ways. For instance, we could rely upon an abductive approach to model-based diagnosis [Poole, 1989]. In such a case, knowledge about faults is used to explain observations. However, that approach can provide inconsistent results, from a logical point of view. Since we want to retain the logical soundness in our results, we have just explored those approaches within the pure consistency-based approach: fault modes will be rejected if faulty behavior estimation is not consistent with observations. Just consistent fault models will remain. Even within that approach, knowledge about fault modes can be introduced in different, and complementary, ways: Non-predictive approaches have little knowledge about fault modes. Two examples are: physical impossibility, [Friedrich et al., 1990], which just describes impossible physical behavior, or non-intermittency, [Raiman et al., 1991], which takes advantage of information about nonintermittent faults. Predictive approaches, which use models for fault modes to estimate faulty behavior, as in Sherlock [de Kleer and Williams, 1989] or GDE+ [Struss and Dressler, 1989]. Based on such estimation, nonconsistent fault modes are rejected. In such approaches, one fault mode can only be confirmed, if any other fault mode has been rejected and there is no unknown fault mode. This is the selected approach. Nevertheless, the increase in the discriminative power has a price. Diagnosis must discriminate among 2 N behavioral mode assignments when just correct, ok( ), and incorrect modes, ok( ), are present for N components. When M behavioral models are allowed, diagnosis must discriminate among M N mode assignments. This is the problem faced by any model-based diagnosis proposal which attempts fault identification [Dressler and Struss, 1996]. Since the pure approach is infeasible in real systems for practical reasons, many approaches have been proposed in recent years to deal with the complexity issue. However, to the best of our knowledge, there is no general architecture suitable for any kind of system. In fact, many approaches just perform fault detection and localization, or rely upon a combination of some kind of heuristic, which helps focusing the diagnosis task. This will be also our approach. In the recent past, first we proposed a consistency-based diagnosis architecture which combined model-based and expert reasoning for diagnosis of industrial continuous systems [Pulido et al., 2001]. In such proposal, we used expert knowl- DX'06 - Peñaranda de Duero, Burgos (Spain) 31

42 edge to derive a classification tree containing temporal information [Pulido, 2001]. Later on, we were able to deduce a set of rule-based classifiers through machine-learning techniques [Pulido et al., 2005]. In this article we introduce a preliminary work for introducing case-base reasoning within the proposed architecture. The organization of this paper is as follows. First, we will introduce the compilation technique used to perform consistency-based diagnosis, which is the basis for our model-based diagnosis system. Then, we provide a brief summary of case-based reasoning, then introducing the developed case-based reasoning system. Afterwards, we explain how the CBR system has been included in our diagnosis architecture, and show some results on a case study plant. Finally, we discuss the results and draw some conclusions. 2 Consistency-based diagnosis using possible conflict Model-based diagnosis has been traditionally associated with just the Control Theory approach usually known as the FDI approach. Recently, there is a strong on-going research effort to bring a common framework for both the DX and FDI approaches. It has been defined as the BRIDGE framework [Biswas et al., 2004]. In this context, the work by Cordier et al. [Cordier et al., 2004] has established a common theoretical ground where proposals from FDI and DX communities can be compared, clearly stating underlying hypotheses. Our proposal fits in this common framework. The computation of possible conflicts is a compilation technique which, under certain assumptions, is equivalent to on-line conflict calculation in GDE, and off-line generation of ARRs in the Control Theory approach to diagnosis, [Pulido and Alonso, 2000; Pulido and Alonso González, 2004]. We include a brief summary of this approach for the sake of selfcontainment. The main idea behind the possible conflict concept is that the set of subsystems capable to generate a conflict 1 can be identified off-line. This identification can be done in three steps. The first one generates an abstract representation of the system, as an hypergraph. In this representation there is only information about constraints in the models, and their relationship to known and unknown variables in such models. The second step looks for minimal over-constrained sets of relations, which are essential for model-based diagnosis. These subsystems, called minimal evaluation chains, represent a necessary condition for a conflict to exist. Each minimal evaluation chain, which is a partial sub-hypergraph of the original system description, need to be solved using local propagation criteria alone. In the third step, extra knowledge is added to fulfill that requirement. We specify each possible way a constraint can be solved by means of local propagation. As a consequence, each minimal evaluation chain generates a directed and-or graph. In each and-or graph, a search for every possible way 1 The possible conflict would represent, in FDI terms, an ARR which could be used for fault detection and isolation. the system can be solved using local propagation, is conducted. Each possible way is called a minimal evaluation model, and it can predict the behavior of a part of the whole system. Moreover, since conflicts will arise only when models are evaluated with available observations, the set of constraints in a minimal evaluation model is called a possible conflict. Those models can be used to perform fault detection. If there is a discrepancy between predictions from those models and current observations, the possible conflict would be responsible for such a discrepancy and should be confirmed as a real conflict. Afterwards, diagnosis candidates are obtained from conflicts following Reiter s theory. 2 A detailed description of consistency based diagnosis with possible conflicts can be found in [Pulido and Alonso, 2000; Pulido et al., 2001]. 3 Case-Based Reasoning 3.1 Case-Based Reasoning fundamentals Case-Based Reasoning (CBR) is a way to solve problems by remembering past similar situations and reusing the information and knowledge about these situations [Kolodner, 1993]. CBR uses the information stored in a case base to infer the solution for new problems. CBR proposes a four-step cycle, which Aamod and Plaza [Aamodt and Plaza, 1994] describe as: retrieve, reuse, revise, and retain. The first task in the CBR cycle is to retrieve one or more similar cases from the case library where former experiences are stored. Hence, it is necessary to have a retrieval algorithm and a similarity measure that will be used to bring back a set of similar cases. Aamodt and Plaza [Aamodt and Plaza, 1994] describe the reuse task focusing on two aspects: first, the differences among the past and the current case; second, what part of the retrieved case can be transferred to the new case? In some cases the reuse task lies in copying the past solution to the new case, but in other cases this solution cannot be directly applied and has to be adapted. In the revision task, the solution for the new problem has to be tested. Retainment is the last task in the CBR cycle. In this step the new case and the solution for this case, obtained in the reuse stage, are stored in order to be used in the future. 3.2 CBR for diagnosis CBR has been used for diagnosis tasks in different domains ([Lenz et al., 1998] and [Price, 1999] provide several examples). More precisely we want to do diagnosis as classification of time series (as in [Colomer et al., 2003]). Even more, since we rely upon a model-based diagnosis system for fault detection and localization, we are just concerned with an early diagnosis problem. In this case, early diagnosis of temporal data series means that we must do early fault identification, i.e. with incomplete data. When symptoms exhibit different dynamics, they will manifest at different times. Hence, 2 In DX terminology candidates are faulty components. 32 DX'06 - Peñaranda de Duero, Burgos (Spain)

43 faults can only be completely identified when the whole series are available. Consequently, with incomplete data, different faults can be consistent with current observations. In this context, CBR must provide a set of feasible faults, just after fault detection, according to available observations. Since we want to test a CBR system just as a complementary tool to model-based diagnosis, we have decided to develop our own tool. As in any CBR system, we looked for a case structure which appropriately stores knowledge from past experiences and facilitates each step in the CBR cycle. In our system we must study temporal data series to find the actual fault. Therefore, our cases will be made up of the set of available measurements in a period of time, and a label for the class of the fault. The source of those data will be described in section 4. System architecture CBR is mainly intended for those domains where there is neither explicit knowledge nor reasoning model. Just past experiences and their related solutions. To test the approach we will rely upon simulated faults. The architecture of the system can be seen in figure 1. To perform an appropriate classification we have a configurable tool that simulates the different faults that could happen in the industrial plant and stores this information in a simulation experiment data base. From each experiment we built a WEKA file [Witten and Frank, 2005]. WEKA is widely used in problem solving in AI. Using these WEKA files, and the class of the fault, a case is built. Figure 1: Architecture of the CBR system. Most CBR systems use a retrieving mechanism based on local distance, and most of them use the K-Nearest Neighbors algorithm. Applying this algorithm we will also need some kind of similarity measure. In this work, we have chosen the K-Nearest Neighbors algorithm as retrieving algorithm and three different similarity measures: Euclidean distance, Manhattan distance and Dynamic Time Warping (DTW). DTW [Bellman, 1957; Keogh and Ratanamahatana, 2005] is a technique that allows to obtain a more robust dissimilarity measure between two sequences with different lengths which are not exactly aligned in the time axis. In the considered application the time series are multivariate. That is, we have multi-dimensional series. In order to use DTW with this kind of data we consider an approach for calculating the dissimilarity between two multi-dimensional series. We apply DTW for each variable, then, the dissimilarity between two multivariate series is calculated as the average of the dissimilarities for each variable. In order to work out the distance d(q i, c j ) between two points q i and c j from the series, we use three different kind of metrics: Linear: q i c j Quadratic: (q i c j ) 2 Valley: 10 (1 exp( (q i c j ) 2 6 )) The reuse method used in this work is based on the K- nearest neighbors. We can choose the number of neighbors to be used and adapt the solution of the new case voting among the solutions for each retrieved neighbor. In order to make a better classification, the system applies leaking and standardization algorithms to the data. Leaking does not allow the numerical series to exceed the allowed limits. Through standardization we get all the data having mean equals to 0 and variance equals to 1. A detailed description of the CBR system can be found in [Bregón et al., 2006] 4 Case study The laboratory plant shown in figure 2 resembles common features of industrial continuous processes. It is made up of four tanks {T 1,..., T 4 }, five pumps {P 1,..., P 5 }, and two PID controllers acting on pumps P 1, P 5 to keep the level of {T 1, T 4 } close to the specified set point. To control temperature on tanks {T 2, T 3 } we use two resistors {R 2, R 3 }, respectively. In this plant we have eleven different measurements: levels of tanks T 1 and T 4 {LT 01, LT 04}, the value of the PID controllers on pumps {P 1, P 5 } {LC01, LC04}, in-flow on tank T 1 {F T 01}, outflow on tanks {T 2, T 3, T 4 } {F T 02, F T 03, F T 04}, and temperatures on tanks {T 2, T 3, T 4 } {T T 02, T T 03, T T 04}. Action on pumps {P 2, P 3, P 4 }, and resistors {R 2, R 3 } are also known. This plant can work on different situations. We have defined three working situations which are commanded through four different operation protocols. In the operation protocol used in this article resistor R 3 is switched off, while resistor R 2 is on. Also, pumps {P 3, P 4 } are switched off; hence, just flow F T 01 is incoming to tank T 1. We have used common equations in simulation for this kind of process. 1. t dm : mass balance in tank t. 2. t de : energy balance in tank t. 3. t fb : flow from tank t through a pump. 4. t f : flow from tank t through a pipe. 5. r p: resistor failure. Based on these equations we have found the set of possible conflicts shown in table 1. In the table, second column shows the set of constraints used in each possible conflict, which DX'06 - Peñaranda de Duero, Burgos (Spain) 33

44 Figure 2: Diagram of the laboratory plant. are minimal with respect to the set of constraints. Third column shows the involved components (support for an ARR in BRIDGE terminology, according to Cordier et al. [Cordier et al., 2004]). Fourth column indicates the estimated variable for each possible conflict. Constraints Components Estimate P C 1 t1 dm, t1 fb1, t1 fb2 T 1, P 1, P 2 LT 01 P C 2 t1 fb1, t2 dm, t2 f T 1, T 2, P 1 F T 02 P C 3 t1 fb1, t2 dm, r2 p T 1, P 1, T 2, R 2 T T 02 P C 4 t1 fb2, t3 dm, t3 f T 1, P 2, T 3 F T 03 P C 5 t1 fb2, t3 dm T 1, P 2, T 3 T T 03 P C 6 t4 dm T 4 LT 04 P C 7 t4 fb T 4, P 5 F T 04 Table 1: Possible conflicts found for the laboratory plant; constraints, components, and the estimated variable for each possible conflict. In the plant we have considered for the current protocol the set of fault modes shown in table 2. We have also included noise in the measurements, but no sensor failure was studied. Possible conflicts related to fault modes can be seen in the fault signature matrix shown in table 3. It should be noticed that these are the fault modes classes which can be distinguished for fault identification. In the fault localization stage, the following pair of faults {f 1, f 2 }, {f 4, f 11 }, and {f 3, f 12 }, and {f 10, f 13 } can not be separately isolated. Class Component Description f 1 T 1 Small leakage in tank T 1 f 2 T 1 Big leakage in tank T 1 f 3 T 1 P ipe blockage T 1 (left outflow) f 4 T 1 P ipe blockage T 1 (right outflow) f 5 T 3 Leakage in tank T 3 f 6 T 3 P ipe blockage T 3 (right outflow) f 7 T 2 Leakage in tank T 2 f 8 T 2 P ipe blockage T 2 (left outflow) f 9 T 4 Leakage in tank T 4 f 10 T 4 P ipe blockage T 4 (right outflow) f 11 P 1 P ump failure f 12 P 2 P ump failure f 13 P 5 P ump failure f 14 R 2 Resistor failure in tank T 2 Table 2: Fault modes considered. 4.1 Experimental design The study was made on a data-set, made up of several examples obtained from simulations of the different classes of faults that could arise in the plant. We have modeled each fault class with a parameter in the [0, 1] range. We have made twenty simulations for each class of fault. Each simulation lasted 900 seconds. We randomly generate the fault magnitude, and its origin, in the interval [180, 300]. We also have assumed that the system is in stationary state before the fault appears. The data sampling was one data per second. However, due to the slow dynamics in the plant, we can select one data every three seconds without losing discrimination capacity. Since we just have eleven measures, then each simulation will pro- 34 DX'06 - Peñaranda de Duero, Burgos (Spain)

45 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f 13 f 14 P C P C P C P C P C P C 6 1 P C Table 3: PCs and their related fault modes. vide eleven series of three hundred numeric elements. Moreover, decreasing the sampling we reduce the processing time for DTW (which is quadratic). Hence, each case in the case base was made up of a label for the single fault plus eleven data series, one for each available measurement. 4.2 Results We conducted several experiments applying all the described techniques for retrieving and reusing. We applied one-dimensional DTW using different metrics (linear, quadratic and valley), Euclidean distance and Manhattan distance. Table 4 shows the classification success achieved using one, three and five neighbors for the reusing task. The results have been obtained using stratified cross validation with 10 equally sized subsets from the dataset. Since we will use the CBR system for early classification, we only show results for 30, 40 and 50% of the data series lenght. As we stated before, we randomly generate the fault magnitude in the interval [180, 300]. The simulation lasted 900 seconds. Therefore, we will have at least 20% to 33% of the data at detection time. In the initial stages of fault isolation, the CBR system will be used to focus the consistencybased diagnosis system as described below. 5 Integration proposal Consistency-based diagnosis automatically provides fault isolation based on fault detection results. Using possible conflicts, consistency-based diagnosis can be easily done without on-line dependency recording. The proposed diagnosis process will incrementally generate the set of candidates consistent with observations. In the off-line stage, we initially analyze the system and find out every possible conflict, pc i. Then, we build an executable model, SD pci, for each pc i. In the on-line stage, we perform a semi-closed loop simulation with each executable model SD pci : repeat 1. simulate (SD pci, OBS pci ) P RED pci. 2. if P RED pci OBS pc i > δ pci confirm pc i as a real conflict. 3. update (set of candidates, set of activated pcs) until every pc i is activated or time elapsed. OBS pci denotes the set of input observations available for SD pci ; P RED pci represents the set of predictions obtained from SD pci ; OBS pc i denotes the set of output observations for SD pci ; and δ pci is the maximum value allowed as the dissimilarity value between OBS pc i and P RED pci. Without further information about fault modes, consistency-based diagnosis will just provide a list of feasible faulty candidates. To improve the accuracy in our system, in previous works we have relied upon an induced classifier. Such a classifier provided upon request a ranking of more feasible fault modes based on current and past data [Alonso et al., 2003; Pulido et al., 2005]. Following a similar approach, we want to keep logical properties from consistency-based diagnosis. Therefore, we just use CBR as an indication for the most feasible fault, as an initial stage to fault identification. In such sense, the integration of the new CBR system is straightforward. Every time a conflict detection is done, we incrementally compute the set of candidates, consistent with current observations. Afterwards, we invoke the CBR classifier: 4. CLASSIFIER CBR (t 0, set of candidates) and the set of candidates is ranked according to the CBR output. CLASSIFIER CBR(t, c) denotes and invocation to the CBR system with a fragment of series from t to the min(current time, t+maximum series length), and with the set of candidates c. t 0 is the starting time of the series, prior to the first conflict confirmation. Once again, the consistencybased diagnosis system will command the diagnosis results, because we just consider CBR output which is consistent with consistency-based diagnosis results. 6 Discussion and Conclusions In this article we have introduced a preliminary work which explores the integration of model-based diagnosis and casebased reasoning. The case-base tool could be used in a real complex system where no fault model is available, but a collection of past fault diagnosis episodes. Currently, we are testing our approach in a laboratory plant, before we validate the final diagnosis system in a real continuous plant. This was the main reason for our CBR system to be made up of simulated diagnosis cases. The proposed integration of consistency-based diagnosis with CBR was rather simple: once the set of candidates is updated, subsequent calls to the CBR system will provide an order on the set of feasible fault modes. A similar integration procedure was done in previous works, where we used other machine-learning techniques for inducing classifiers [Pulido et al., 2005]. Such classifiers were developed using the whole temporal data series, for the same set of faults. The procedure CLASSIF IER CBR(t 0, set of candidates) does not need on-line the whole series to provide a reasonable classification success of 87, 9% using just 40% of the series and quadratic DTW in the retrieval stage. This percentage increases to 91, 1% for 50% of the series. Results with 100% of the data series are 91, 4%. DX'06 - Peñaranda de Duero, Burgos (Spain) 35

46 Retrieving One Neighbor Three Neighbors Five Neighbors distance 30% 40% 50% 30% 40% 50% 30% 40% 50% Linear DTW 51.4% 84.3% 87.1% 48.2% 82.1% 83.6% 46.4% 79.3% 81.4% Quadratic DTW 56.1% 87.9% 91.1% 57.9% 84.3% 87.1% 53.2% 83.2% 83.6% Valley DTW 56.1% 87.9% 91.1% 57.9% 84.3% 87.1% 53.2% 83.2% 83.6% Euclidean 52.1% 83.6% 86.1% 51.4% 80.0% 84.3% 48.6% 78.9% 82.1% Manhattan 49.3% 81.4% 85.4% 44.3% 80.7% 83.6% 42.1% 76.4% 81.4% Table 4: Classification success using one-dimensional DTW, Euclidean distance and Manhattan distance with one, three and five neighbors for the reuse task. Table 5 shows the classification success achieved using other classifiers for different percentages of the data series: decision trees and support vector machines (SVM, with linear kernel). CBR obtains better results than SVM, but induced rule-based classifiers for early diagnosis still provides better results with respect to error rate at 30, 40 and 50% of the time series. However, it is still necessary to learn a classifier for each percentage of the series. On the other hand, a simple CBR system easily provides a preferred fault mode (or K-preferred modes if we use K-neighbors retrieval algorithm). Table 5: SVM. Technique Classification success 30% 40% 50% Decision trees 68.6% 94.3% 91.8% SVM 44.6% 80.7% 84.6% Classification success using Decision trees and Main conclusion is that our CBR system is able to provide a valuable ranking of available fault models with 40 and 50% of the data completely needed to identify a fault, assuming faults arise between 20 and 33% of the whole series. Moreover, we just need one classifier for different sizes on the set of data to be classified. Those two differences can also be applied to other CBR systems, such as the work by Colomer et al. [Colomer et al., 2003], where there is an abstraction of temporal data in episodes, and the whole series is needed. Main advantage of the simple interface proposed, between model-based diagnosis and CBR approaches, is that we retain the consistency of our model-based diagnosis system. We use a semi-closed loop simulation of numerical models over a small period of time, in a noisy environment, with uncertainties in the models; we do not perform a crisp comparison between predicted and observed values, but we compare a dissimilarity value (a kind of global distance) against a fixed threshold. We empirically select these thresholds to minimize fault detections. Concerning completeness, our results depend on the simulation models. We can provide fault detection and isolation capabilities based just on correct behavior models, due to our consistency-based diagnosis approach. For fault identification we use CBR to rank most feasible fault modes; again we priorized consistency-based diagnosis results. However, we can not guarantee the identification of 100% of fault modes considered. Our best results are 8.6% error for complete fault episode. As a future work, we plan to increase the size of the case base. Our initial guess is that a bigger size will improve the result on the K-Nearest Neighbors algorithms for more than one neighbor. Also, we will need to revise the linear structure of the case base. Our intention is to use the combined result from model-based diagnosis and the classifier as an initial clue for the fault identification stage using consistency-based diagnosis. Acknowledgments: this work has been partially funded by Spanish Ministry of Education and Culture, through grant DPI , and Junta Castilla y León VA088A05. References [Aamodt and Plaza, 1994] A. Aamodt and E. Plaza. Case- Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications. IOS Press, Vol. 7: 1,, pages 39 59, [Alonso et al., 2003] C. Alonso, J.J. Rodriguez, and B. Pulido. Enhancing consistency-based diagnosis with machine-learning techniques. In 14 th International Workshop on Principles of Diagnosis, DX03, Washington, D.C. USA, [Bellman, 1957] R.E. Bellman. Dynamic Programming. Cambridge Studies in Speech Science and Communication. Princeton University Press, Princeton, [Biswas et al., 2004] G. Biswas, M.O. Cordier, J. Lunze, L. Travé-Massuès, and M. Staroswiecki. Diagnosis of complex systems: bridging the methodologies of the FDI and DX communities. IEEE Trans. on Systems, Man, and Cybernetics. Part B: Cybernetics, 34(5): , [Bregón et al., 2006] A. Bregón, M.A. Simón, J.J. Rodríguez, C. Alonso, B. Pulido, and I. Moro. Early fault classification in dynamic systems using case-based reasoning (accepted; to be published in Lecture Notes on Artificial Intelligence, draft available upon request). In Post-proceedings 11th Conference of the Spanish Assosiation of Articial Intelligence, CAEPIA 05, Santiago de Compostela, Spain, [Colomer et al., 2003] J. Colomer, J. Melendez, and F. Gamero. A qualitative case-based approach for situation assessment in dynamic systems. application in a two tank system. In Proceedings of the IFAC-Safeprocess 2003, Washington, USA, DX'06 - Peñaranda de Duero, Burgos (Spain)

47 [Cordier et al., 2004] M.O. Cordier, P. Dague, F. Lévy, J. Montmain, M. Staroswiecki, and L. Travé-Massuyès. Conflicts versus analytical redundancy relations: a comparative analysis of the model-based diagnosis approach from the artificial intelligence and automatic control perspectives. IEEE Trans. on Systems, Man, and Cybernetics. Part B: Cybernetics, 34(5): , [de Kleer and Williams, 1989] J. de Kleer and B.C. Williams. Diagnosing with Behavioral Modes. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI-89), Detroit, Michigan, USA, [Dressler and Struss, 1996] O. Dressler and P. Struss. The consistency-based approach to automated diagnosis of devices. In Gerhard Brewka, editor, Principles of Knowledge Representation, pages CSLI Publications, Standford, [Dressler, 1996] O. Dressler. On-line diagnosis and monitoring of dynamic systems based on qualitative models and dependency-recording diagnosis engines. In Proceedings of the Twelfth European Conference on Artificial Intelligence (ECAI-96), pages , [Friedrich et al., 1990] G. Friedrich, G. Gottlob, and W. Nejdl. Physical impossibility instead of fault models. In Proceedings of the Eighth National Conference on Artificial Intelligence, pages , Boston, Massachusetts, USA, [Hamscher et al., 1992] W. Hamscher, L. Console, and J. de Kleer (Eds.). Readings in Model-based Diagnosis. Morgan-Kaufmann Pub., San Mateo, [Keogh and Ratanamahatana, 2005] E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3): , [Kolodner, 1993] J. Kolodner. Case-Based Reasoning. Morgan Kaufmann Publishers, [Lenz et al., 1998] M. Lenz, M. Manago, and E. Auriol. Diagnosis and decision support. In Case-Based Reasoning Technology, From Foundations to Applications, pages Lecture Notes In Computer Science; Vol. 1400, [Poole, 1989] D. Poole. Normality and faults in logicbased diagnosis. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI- 89), pages , Detroit, Michigan, USA, [Price, 1999] C. Price. Computer-based diagnostic systems. Springer Verlag, New York, [Pulido and Alonso González, 2004] B. Pulido and C. Alonso González. Possible conflicts: a compilation technique for consistency-based diagnosis. IEEE Trans. on Systems, Man, and Cybernetics. Part B: Cybernetics, 34(5): , [Pulido and Alonso, 2000] B. Pulido and C. Alonso. An alternative approach to dependency-recording engines in consistency-based diagnosis. In Artificial Intelligence: Methodology, Systems, and Applications. AIMSA-00, volume 1904 of LNAI, pages Springer Verlag, Berlin, Germany, [Pulido et al., 2001] B. Pulido, C. Alonso, and F. Acebes. Lessons learned from diagnosing dynamic systems using possible conflicts and quantitative models. In Engineering of Intelligent Systems. XIV Conf. IEA/AIE-2001, volume 2070 of LNAI, pages , Budapest, Hungary, [Pulido et al., 2005] B. Pulido, J.J. Rodriguez Diez, C. Alonso González, O. Prieto, E. Gelso, and F. Acebes. Diagnosis of continuous dynamic systems: integrating consistency-based diagnosis with machine-learning techniques. In XVI IFAC World Congress, 2005, Prague, Zcheck Republic, [Pulido, 2001] B. Pulido. Possible conflicts as an alternative to on-line dependency-recording for diagnosing continuous systems (in Spanish). Ph.D., E.T.S.I. Informática. Universidad de Valladolid, Valladolid, Febrary [Raiman et al., 1991] O. Raiman, J. de Kleer, V. Saraswat, and M. H. Shirley. Characterizing non-intermittent faults. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages , Anaheim, California, USA, [Struss and Dressler, 1989] P. Struss and O. Dressler. Physical negation: Introducing fault models into the general diagnostic engine. In Proceedings of the Eleventh International Joint Conference on Artifical Intelligence (IJCAI- 89), pages , Detroit, Michigan, USA, [Witten and Frank, 2005] I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Morgan Kaufmann, DX'06 - Peñaranda de Duero, Burgos (Spain) 37

48 38 DX'06 - Peñaranda de Duero, Burgos (Spain)

49 Ambiguity Groups Determination for Analog Non Linear Circuits Diagnosis Barbara Cannas, Alessandra Fanni, and Augusto Montisci Dipartimento di Ingegneria Elettrica ed Elettronica University of Cagliari Piazza d Armi, Cagliari, Italy {cannas; fanni; amontisci}@diee.unica.it Abstract In this paper, a symbolic procedure for ambiguity groups determination, based on the a priori identification concept, is proposed. The method starts from an analysis of the occurrence of parameters in the input/output relationship coefficients in order to select the potential canonical ambiguity groups. This first step allows one to strongly reduce the problem complexity. Then, the nonlinear system obtained imposing the ambiguity conditions is solved, resorting to Gröbner bases theory. These procedures are completely symbolic, then they do not cause round-off errors. Furthermore, this approach overcomes the limits of the procedures presented in literature that can be directly applicable only to linear circuits. In fact, the method can be applied both to linear circuits and to nonlinear circuits with algebraic nonlinearity. Examples of application regard a well known linear circuit benchmark and a third-order chaotic circuit. 1 Introduction In recent years, there has been a growing interest on automatic procedures for analog circuits fault diagnosis [Catelani et al., 1987][Stenbakken et al., 1989][Carmassi et al., 1991][Liberatore et al., 1994][Fedi et al., 1998][Fedi et al., 1999][Starzyk et al., 2000][Cannas et al., 2004]. To perform this task it is crucial to have a quantitative measure of circuit solvability, i.e., of solvability of the non linear fault diagnosis equations. The testability concept is strictly linked to the concept of circuit solvability. The most popular definition of analog circuit testability has been introduced in [Sen and Saeks, 1979][Chen and Saeks, 1979]. This definition provides a quantitative measure of network testability, i.e., a measure of solvability of fault equations. Apart from which actual method is used to write the fault equations, the testability of a circuit gives the maximum number of faults that an hypothetic diagnostic system could detect, if they are simultaneously present in the circuit. Therefore, the results of the testability analysis fixe an upper limit to the information that both an FDI and a DX [Cordier et al., 2004] approach can give, regarding the fault state of a circuit. For low testability circuits, an important concept is that of ambiguity groups. Roughly speaking, an ambiguity group is a set of components that, if considered as potentially faulty, do not give an unique solution in phase of fault location. An ambiguity group that does not contain other ambiguity groups is said Canonical Ambiguity Group (CAG) [Fedi et al., 1998]. One of the first algorithms for CAGs determination has been presented in [Stenbakken et al., 1989]. Several numerical algorithms and symbolic procedures for evaluating circuit testability and CAGs in case of linear circuits have been developed by several authors [Catelani et al., 1987][Stenbakken et al., 1989][Starzyk et al., 2000]. However, few contributions are present in literature regarding nonlinear circuit testability. In [Fedi at al., 1998b] a symbolic approach for testability evaluation of nonlinear analog circuits is presented. The approach is an extension of the methodologies developed for the linear case to circuits where nonlinear components, such as diodes or transistors, are present. In practice, the procedure for determining the testability value and the ambiguity groups resort to the substitution of each nonlinear component with its piece wise linear model. It is worth pointing out that the network functions determined in the s domain have not any meaning and they cannot be used in phase of fault location in order to determine the effective value of the circuit components. They can give information only about the testability and ambiguity group determination. In [Worsman and Wong, 2000] the ambiguity analysis in the dc and ac domains are conducted independently of one another and the results are combined. In the first case, the non linear components are replaced with PWL models whereas, in the ac case, a small signal analysis is performed. This paper proposes a symbolic procedure for ambiguity groups determination, based on the a priori identification concept, that can be applied to both linear and non linear circuits. The method starts from an analysis of the occurrence of parameters in the input/output relationship coefficients in order to select the potential CAGs. This first step allows one to strongly reduce the problem complexity. Then, for each potential CAG, the nonlinear system DX'06 - Peñaranda de Duero, Burgos (Spain) 39

50 obtained imposing the ambiguity conditions is solved, resorting to Gröbner bases [Becker and Weispfenning, 1993]. 2 Theory Let us consider an analog, time-invariant, nonlinear circuit described by: dx = f dt y = h[ x( t)] [ x( t), u( t), F] where x is the n-dimensional state variable vector, u is a scalar input, F={F 1, F 2,, F L } is the vector of the parameters, L the number of components, y a scalar output, and f and h, are algebraic function vectors in x. We assume that only one scalar output is measured as y(t) (i.e., the state is not directly available). Literature reports several definition of CAGs for linear circuits [Fedi et al., 1998][Cannas et al., 2005]. By the first definition of CAG [Fedi et al., 1998], d components F 1, F 2,, F d belong to a CAG if a variation of d 1 <d components causes a variation on the transfer function, indistinguishable with respect to the one produced by the variation of the other d-d 1 components of the group, i.e., H ( 1 1 d 2 d1 +1 d (1) F,, F ) = H ( F,, F ) (2) 1 In this paper, an equivalent definition of CAG is adopted, (see [Cannas et al., 2005] for the proof of the equivalence of the two definitions). Definition: d components {F 1, F 2,, F d } belong to a CAG if a variation F 0 exists that does not affect the transfer function value, i.e., H ( F 1,, Fd ) = H2( F1 + F 1,, Fd + F ) 1 d For example, two components belong to a CAG if, for every deviation from the nominal value in one parameter, there exists a deviation in the other, which perfectly balances it so as to have no effect on the transfer function. In case of nonlinear circuit diagnosis, a definition of CAG is necessary that doesn t resort to assumptions on the transfer function, only defined for linear systems. The most natural way is to refer to the Input/Output Relationship (IOR). Definition 1: d components {F 1, F 2,, F d } constitute a CAG iff a variation F 0 exists such that the IOR does not globally change. Generally speaking, if d components constitute a CAG, the corresponding system is not identifiable. In fact, system (1) is identifiable through IOR if: IOR(F 1, F 2,, F k ) = IOR(F 1, F 2,, F k ) iff (F 1, F 2,, F k ) = (F 1, F 2,, F k ) for a finite number of sets {F 1, F 2,,F k }, i.e., if the components F 1, F 2,, F k does not belong to a CAG. For a system in the form (1), the IOR is a non linear differential polynomial in u, y and their derivatives and can be expressed as: ( y, y&, && y,..., u, u&, u&..., a,,...) 0 z ( u, y) = z a2 1 = where the a i are the input/output relationship coefficients and they depend on the parameter vector F. The IOR can be determined by resorting to the concept of Characteristic Set associated to the dynamic state equations. The Characteristic Set was introduced by Ritt [Ritt, 1950] in 1950, and since 1990 it has been widely used for the study of dynamic systems [Ljung and Glad, 1994]. In order to define the Characteristic Set, we need to introduce some concepts of Differential Algebra [Fliess and Glad, 1993]. The peculiarity of the Characteristic Set is that it summarizes all the information contained in the differential equations defining a dynamic system. If one chooses the ranking of the variables and their derivatives: u < u& < u&& <... < y < x1 < x2 <... < y& < x& 1 < x& 2... for a system of the form (1), the Characteristic Set exhibits n+1 differential polynomials, that is: the IOR denoted by z(u, y); n differential polynomials, in y, x and u, denoted by the n-dimensional vector Z(u, x, y). The procedure to identify the IOR may turn out to be rather complex, hence resorting to a software to calculate the Characteristic Set is mandatory. However, in some cases, it can be easily detected by simple inspection and manipulation of the dynamic equations. In this paper, the commercial tool REDUCE TM has been used and the implemention of the Ritt s algorithm described in [Audoly et al., 2001]. 3 Determination of Canonical Ambiguity Groups A preliminary analysis of the IOR coefficients allows one to a priori exclude that a group of components constitutes a CAG. For example, two components F 1 and F 2 cannot constitute a CAG if there is an IOR coefficient such that only one of them (e.g., F 1 ) appears. In fact, in this case, a variation of F 2, cannot produce the same effect of a variation of F 1. Let us consider a circuit composed by K components F 1, F K, and let a j, for j=1,..,h be the IOR coefficients. An incidence matrix Y can be written as follows: where: F F 1 Y = M K a 1 y M yk 11 1 K a y K y H 1, H M K, H 40 DX'06 - Peñaranda de Duero, Burgos (Spain)

51 y ij 1 if the component i appears in the coefficient j = 0 otherwise i=1, K; j=1, H. The incidence matrix inspection yields a set of groups that could constitute a CAG. In general, a set of components could constitute a CAG only if at least two of them are present in the coefficients where they appear, i.e., if: yij 1 i : Fi CAG j = 1, K, H (3) i It is worth noting that, if ij j = 1,..., H y = 0 the component F i is not diagnosable at all. For the potential CAGs, further analysis is necessary. Let us consider d components that could belong to a CAG of order d. They actually constitute a CAG iff a variation F i, i = 1,, d of these components does not modify the IOR (see definition 1). This is equivalent to solve a problem of constrained identifiably where the constraints consist of fixing the set of known parameters corresponding to the faulty-free components. This problem can be expressed by a system of g equations for each CAG: ( F F ) = a ( F F ) ai 1,..., k i 1,..., subject to F j = F j F j CAG k i = 1,... g where a i is the generic coefficient containing at least two components of the CAG, g is the number of such coefficients, and Fi ' = Fi + Fi. If the d components constitute a CAG, the corresponding system where the d components are the unknowns, is not identifiable. In particular, the d components constitute a CAG iff the algebraic system (4) has infinite solutions. If the system (4) admits a finite number of solutions, this means that only a finite number of parameters vectors cannot be distinguished each other. Such situation is not G 2 (4) considered as ambiguous, because the probability to have exactly the ambiguous vector of parameters values is virtually null. Furthermore, if such an ambiguity exists, and the two undistinguishable vectors correspond to a fault and to a healthy configuration, one can consider as faulty the configuration that is not. In this work the system (4) has been solved resorting to the Gröbner bases theory [Becker and Weispfenning, 1993]. This is the origin of many symbolic algorithms used to manipulate multiple variable polynomials. In particular, the commercial tool REDUCE TM has been used that implements the Buchberger s algorithm [Buchberger, 1985]. It makes use of a generalization of Gauss elimination method for multivariable linear equations and of Euclide s algorithm for one variable polynomial equations. 4 Examples In this section the procedure is firstly applied to a linear circuit and then to a nonlinear circuit, both being well known benchmarks. In both cases, we assume the IOR of the circuit is available. The analysis we perform concerns the coefficients of said IOR, which we assume can be deduced from a set of observations. The method that one uses to explicitly describe the relationship between the IOR coefficients and the observations is out of the scope of the present work, and it does not affect the generality of the approach described here. In this perspective, the equations just described are the Analytical Redundancy Relations (ARR) [Cordier et al., 2004] of the diagnostic system. Furthermore, because of the IOR coefficients are described in terms of parameters of the circuit, it is possible to write the Signature Matrix for single faults [Cordier et al., 2004], which gives the information on the fact that a certain parameter affects the residual of a particular ARR. In fact, as the dependence of the IOR coefficient by the parameters is explicit, the Single Fault Signature Matrix coincides with the Incidence Matrix of the parameters vs the IOR coefficients. 4.1 Sallen-Key band-pass filter The proposed procedure is firstly described by means of an example retrieved from the literature [Fedi et al., 1999][Cannas et al., 2004]. In particular the Sallen-Key band-pass filter, shown in Fig. 1, has been considered. The IOR corresponding to the input/output voltages shown in Fig. 2 is: G 1 C 1 2 dy d y y + a2 + a u (5) 2 dt dt a1 = 3 where: v in G 5 C 2 G 3 G 4 Figure 1- The Sallen-Key band-pass filter v out G3 a1 = ( G1 + G2 ) C1C 2 G1G 5 G2G4 G3 a2 = + G5C2 C1C G1 a3 = ( G4 + G5 ) G C ( C + C ) 1 2 DX'06 - Peñaranda de Duero, Burgos (Spain) 41

52 In [Fedi et al., 1999] the following CAGs are reported for this circuit: {C 1, G 2, G 3 }, {G 4, G 5 }. The incidence matrix Y is: Y = G G G G G C C a1 a2 a Since the IOR has 3 coefficients, the analysis will be performed only for groups of order 2 and 3. In fact, all the groups of order greater than 3 contain at least one CAG. We shall firstly evaluate the groups of order 2. On the basis of the matrix Y inspection it is possible to a priori exclude the ambiguity for a set of couples of components. For example, G 1 and G 2 cannot constitute a CAG because in the incidence matrix it results (see Eq. (6)): y 11 + y21 = The same reasoning allows one to reduce the potential ambiguity groups to the 6 couples indicated by a? in Table I. Each couple constitutes a CAG, iff the corresponding system (4) has infinite solutions. For example, let us consider the couple (G 1, C 2 ). The system (4) consists of the following g = 3 equations: G G G C G C G G C C C C G G G G G G C C C ( G + G ) = ( G + G ) 3 ( G + G ) = ( G + G ) 1 G G G G G C G C C ( C + C ) = + ( C + C ) where G 1 and C 2 are the fault values and they represent the system unknowns. The system (7) admits only one solution, {G 1=G1 ; C 2=C2}. Thus, G1 and C2 do not constitute a CAG. Let us now consider the couple {G4, G5}. The system (4) consists of the following g = 2 equations: G1 4 G5C2 G1G5 G2G G C 1 ( G + G ) = ( G + G ) G 5 G 5C G3 + C C ( C + C ) = + ( C + C ) G G G G G C G C C where G 4 and G 5 are the fault values and they represent the TABLE I. SALLEN-KEY FILTER- POTENTIAL CAGS OF ORDER 2 G 1 G 2 G 3 G 4 G 5 C 1 C 2 G ? G 2? - -? - G 3 -?? - G 4? - - G C (6) (7) (8) system unknowns. The system (8) admits infinite solutions: G G G = (9) G5 Thus, G 4 and G 5 constitute a CAG. By performing this analysis for all the 6 couples in Table I, it results that the couple {G 4, G 5 } is the only CAG of order 2. Now let us consider the groups of order 3. On the basis of the matrix Y inspection it is possible to a priori exclude the ambiguity for a set of terns of components. For example, {G 1, G 2, G 3 } cannot constitute a CAG because (see Eq. (6)): y 11 + y21 + y31 = Let us consider the component G 2. The potential CAGs of order 3 that contain the component G 2 are shown in Table II. Let us now consider the tern {G 2, G 1, G 4 }. The system (4) consists of the following g = 3 equations: G 1 G 4 5 G5C2 G5C G3 G 1 2 C1C2 C1C2 G G G G G G C C C ( G + G ) = ( G + G ) 3 ( G + G ) = ( G + G ) 2 1 G G G G G C ( C + C ) = + ( C + C ) G C C (10) where G 2, G 1 and G 4 are the fault values and they represent the system unknowns. The system (10) admits 2 solutions, thus, the group {G 2, G 1, G 4 } is not a CAG. Let us now consider the tern {G 2, G 3, C 1 }. The system (4) consists of the following g = 2 equations: G G C C C C G G G G G G C C C ( G + G ) = ( G + G ) 2 G G G G G C G C C ( C + C ) = + ( C + C ) (11) where G 2, G 3 and C 1 are the fault values and they represent the system unknowns. The system (11) admits TABLE II. SALLEN-KEY FILTER- POTENTIAL CAGS OF ORDER 3 G 1 G 2 G 3 G 4 G 5 C 1 C 2 G 1 G 2 -?? -? G 1 G 3?? -? G 1 G 4 -?? G 1 G 5?? G 1 C 1? G 2 G 3 - -? - G 2 G 4 - -? G 2 G 5 -? G 2 C 1 - G 3 G 4 - -? G 3 G 5 -? G 3 C 1 - G 4 G G 4 C 1? G 5 C 1? 42 DX'06 - Peñaranda de Duero, Burgos (Spain)

53 infinite solutions: G 2G4 G3G5 + G3G5 C1G3 G = 2 ; C1 = (12) G4 G3 Thus, the group {G 2, G 3, C 1 } is a CAG. By performing this analysis for the other potential CAGs, it results that the group {G 2, G 3, C 1 } is the only CAG of order 3. The system (9) as the (12) gives us an important information. In fact, they express the relationship between the fault values that can be confused. So, this information could be combined with further constraints of the actual diagnostic problem, in order to reduce the uncertainty of the diagnosis. For example, if the ambiguity exists for a couple of values, one of them is negative, in practice there is no ambiguity. This result represents an important improvement with respect to the previous approaches, where the fault equations system solvability is evaluated by means of the analysis of the Jacobian matrix [Catelani et al., 1987][Stenbakken et al., 1989][Carmassi et al., 1991][Liberatore et al., 1994][Fedi et al., 1998][Fedi et al., 1999][Starzyk et al., 2000][Cannas et al., 2004]. In this case, no information can be deduced on the relationship between the parametr values for indistinguishable faults. 4.2 Chua s circuit Chua's circuit (see Fig. 2) is a simple electronic circuit that exhibits classic chaos theory behaviour. First introduced in 1983 by Leon O. Chua [Chua, 1993], its ease of construction has made it an ubiquitous real-world example of a chaotic system, leading some to declare it 'a paradigm for chaos'. Chua's circuit consists of two linear capacitors, one linear resistor, one linear inductor and one nonlinear resistor. The nonlinear resistor (Chua's diode) is chosen to have a cubic V/I characteristic of the form [Huang et al., 1996]: 3 i = γ 1 v + γ 3 v The IOR by considering the voltage v 1 as the output is: d v1 C2 ( 1 + Rγ 1 ) d v1 Lγ 1 + RC1 dv1 γ 3 dv v (13) dt RC1C2 dt RC1C2 dt C1 dt 2 γ 3 2 d v1 γ 3 2 dv1 γ v1 + 3 v1 + v1 = 0 2 C dt RC C dt LC C Let us call a i with i=1,,7 the coefficients of the equation (13). Then the incidence matrix Y will be: L C 2 i 3 v 2 R 1 v 1 Figure 2- Chua s circuit 2 C 1 i g N g TABLE III CHUA S CIRCUIT- POTENTIAL CAGS OF ORDER 3 R C 1 C 2 L γ 1 γ 3 R C ? R C 2? -? R L - - R γ 1 - C 1 C 2 - -? C 1 L - - C 1 γ 1? C 2 L - - C 2 γ 1 - L γ 1 - Y = R C1 C2 L γ 1 γ 3 a1 a2 a3 a4 a5 a6 a (14) The coefficients are one more than the parameters, then all the possible groups of faults have to be considered as potential canonical ambiguity groups. By following the same procedure used in the linear example, simply on the basis of the incidence matrix (14), it is possible to exclude the presence of any CAGs of order 2. Now let us consider the potential CAGs of order 3. The analysis of the Y matrix yields the result reported in table III. Only five potential CAGs of order 3 have to be examined: {R, C 1, γ 3 }, {R, C 2, L}, {R, C 2, γ 3 }, {C 1, C 2, γ 3 }, {C 1, γ 1, γ 3 }. On the basis of the Gröbner analysis, none of these groups results to be an ambiguity group, as the corresponding equations systems have only one solution. By applying the analysis of the Y matrix, the following potential ambiguity groups of order greater than 3 are obtained: {R, C 1, C 2, γ 3 }, {R, C 1, L, γ 3 }, {R, C 1, γ 1, γ 3 }, {R, C 2, L, γ 1 }, {C 1, C 2, L, γ 3 }, {C 1, C 2, γ 1, γ 3 }, {C 1, L, γ 1, γ 3 } of order 4, {R, C 1, C 2, L, γ 3 }, {R, C 1, C 2, γ 1, γ 3 }, {C 1, C 2, L, γ 1, γ 3 } of order 5, {R, C 1, C 2, L, γ 1, γ 3 } of order 6. The Gröbner analysis has been applied to the corresponding 11 equations systems, and for each of them we obtained only one solution. Thus, the Chua s circuit has no ambiguity groups, then, theoretically, whichever variation of one or more parameters of the circuit, can be univocally determined. 5 Comments and conclusion In this paper, a symbolic procedure for canonical ambiguity groups determination is proposed, resorting to the philosophy of a priori identifiability. The method starts with the IOR determination; then an analysis of the occurrence of parameters in the input/output relation coefficients is performed in order to select the potential CAGs. In the last step, the corresponding parameter identifiability is investigated. The IOR determination is performed resorting to an implemention of the Ritt s algorithm. The main drawback of the algorithm is that it is directly applicable only to DX'06 - Peñaranda de Duero, Burgos (Spain) 43

54 algebraic systems; some transcendent systems (e.g., systems with trigonometric or exponential non linear functions) can be analysed resorting to suitable transformations that cancel the nonlinearity. Thus, the method cannot be used to determine the CAGs in circuits with non linear components described by piecewise linear, or by some transcendent functions. The identifiability problem has been solved resorting to the the Buchberger s algorithm for the calculation of the Gröbner bases. The limit of this algorithm is that it is only applicable to systems with rational algebraic coefficients. An analysis of computational complexity has not been performed to date. It is worth noting that the complexity strongly depends on the chosen ordering of the variables both for the Ritt s algorithm and the Buchberger s algorithm. Moreover, a bound for the complexity is not available. The main advantage of the method is that it is independent on the fault value, since it does not introduce any linear approximation, as done in the majority of symbolic approaches proposed in the literature. Moreover, the analysis of the IOR coefficients instead of the transfer function coefficients, allows one to extend to nonlinear systems, the applicability of several methods, presented in literature only for linear systems. Future work regards the comparison with the methods presented in literature in terms of complexity and the extension of the method to a wider class of nonlinear circuits. References [Catelani et al., 1987] M. Catelani, G. Iuculano, A. Liberatore, S. Manetti, and M. Marini, Improvements to numerical testability evaluation, IEEE Trans. Instrum. Meas., vol. 36, pp , Dec [Stenbakken et al., 1989] G. N. Stenbakken, T. M. Souders, and G. W. Stewart, Ambiguity groups and testability, IEEE Trans. Instrum. Meas., vol. 38, pp , October [Carmassi et al., 1991] R. Carmassi, M. Catelani, G. Iuculano, A. Liberatore, S. Manetti, and M. Marini, Analog network testaility measurement: a symbolic formulation approach, IEEE Trans. Instrum. Meas., vol. 40, pp , Dec [Liberatore et al., 1994] A. Liberatore, S. Manetti, and M.C. Piccirilli, A new efficient method fo analog circuit testability measurement, IEEE Instrumentation and Measurement Technology Conference (IMTC), Hamamatsu, May 10-12, 1994, pp [Fedi et al., 1998] G. Fedi, A. Luchetta, S. Manetti, and M. C. Piccirilli, A new symbolic method for analog circuit testability evaluation, IEEE Trans. Instrum. and Meas., vol. 47, pp April [Fedi et al., 1999] G. Fedi, S. Manetti, M.C. Piccirilli, and J. Starzyk, Determination of an optimum set of testable components in the fault diagnosis of analog linear circuits, IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, vol. 46, pp , July [Starzyk et al., 2000] J. A. Starzyk, J. Pang, S. Manetti, M.C. Piccirilli, and G. Fedi, Finding ambiguity groups in low testability analog circuits, IEEE Trans. Circuits and Systems - I: Fundamental Theory and Applications, vol. 47, pp , August [Cannas et al., 2004] B. Cannas, A. Fanni, S. Manetti, A.Montisci, M.C. Piccirilli, Neural Network Based Analog Fault Diagnosis Using Testability Analysis, Neural Computing & Applications Journal, Springer, Neuro computing and Applications, pp. 258, vol.13, [Cordier et al., 2004] M.O. Cordier, P. Dague, F. Lévy, J. Montmain, M. Staoswiecki, and L. Travé-Massuyès, Conflicts Versus Analytical Redundancy Relations: A Comparative Analysis of the Model Based Diagnosis Approach From the Artificial Intelligence and Automatic Control Perspective, IEEE Trans. Sys., Man, Cyber. Part B: Cybernetics, Vol. 34, No 5, October [Sen and Saeks, 1979] N. Sen, and R. Saeks, Fault diagnosis for linear systems via multifrequency measurements, IEEE Trans. Circuits and Systems, vol. 26, pp , July [Chen and Saeks, 1979] H. Chen, and R. Saeks, A search algorithm for the solution of the multiferquency fault diagnosis equations, IEEE Trans. Circuits and Systems, vol. 26, pp , July [Becker and Weispfenning, 1993] T. Becker, and V. Weispfenning, Gröbner Bases: A Computational Approach to Commutative Algebra. New York: Springer-Verlag, [Buchberger, 1985] B. Buchberger, Gröbner bases: An algorithmic method in polynomial ideal theory. Multidimensional Systems Theory, N. K. Bose, ed., D. Reidel Publishing Co., 1985, pp [Fedi et al., 1997] G. Fedi, R. Giomi, A. Luchetta, S. Manetti, and M.C. Piccirilli, "Symbolic algorithm for ambiguity group determination in analog fault diagnosis." In Proc. of European Conference on Circuit Theory and Design (ECCTD'97), Budapest, Hungary, Aug. 1997, pp [Manetti and Piccirilli, 2003] Manetti S., Piccirilli M., A singular-value decomposition approach for ambiguity group determination in analog circuits, IEEE Trans. on Circuits and Systems I-Fundamental theory and applications, vol. 50, n. 4, pp , [Ljung and Glad, 1994] L.Ljung and S.T. Glad, On global identifiably for arbitrary model parameterization, Automatica, vol. 30, pp , [Fliess and Glad, 1993] M. Fliess and S. T. Glad, An algebraic approach to linear and nonlinear control, Essays on Control: Perspectives in the Theory and its 44 DX'06 - Peñaranda de Duero, Burgos (Spain)

55 Applications, Groningen, vol. 14 of Progr. Systems Control Theory, Boston, Birkhauser Boston, pp , [Ritt, 1950] F. Ritt, Differential Algebra, American.Mathematical Society Colloquium Publications, vol. 33, [Worsman and Wong, 2000] M. Worsman, M. W. T. Wong, Non-linear analog circuit fault diagnosis with large change sensitivity, International Journal on circuit theory and applications, vol. 28, pp , [Fedi at al., 1998b] G. Fedi, R. Giomi, S. Manetti, M.C. Piccirilli, A Symbolic Approach for Testability Evaluation in Fault Diagnosis of Nonlinear Analog Circuits, 1998 IEEE ISCAS, June 1998, Monterey, USA. [Cannas et al., 2005] B. Cannas, A. Fanni, and A. Montisci, Testability Evaluation for Analog Linear Circuits Via Transfer Function Analysis, Proc. of IEEE International Symposium on Circuits and Systems (ISCAS), Kobe, Japan, May 23-26, [Audoly et al., 2001] S. Audoly, G. Bellu, L. D Angiò, M. P. Saccomani, and C. Cobelli, Global identifiability of nonlinear models of biological systems, IEEE Trans. Biomed. Eng., vol. 48, pp , Jan [Huang et al., 1996] A. Huang, L. Pivka, C.W. Wu, M.Franz Chua s equation with cubic nonlinearity, International Journal of Bifurcation and Chaos 1996; 6: [Chua, 1993] L. O. Chua, Global unfolding of Chua s circuit, IEICE Trans. Fundamentals-I, pp , DX'06 - Peñaranda de Duero, Burgos (Spain) 45

56 46 DX'06 - Peñaranda de Duero, Burgos (Spain)

57 A Framework for Decentralized Qualitative Model-Based Diagnosis Luca Console, Claudia Picardi Daniele Theseider Dupré Università di Torino Dipartimento di Informatica Corso Svizzera 185, 10149, Torino, Italy Università del Piemonte Orientale Dipartimento di Informatica Via Bellini 25/g, 15100, Alessandria, Italy Abstract In this paper we propose a framework for decentralized model-based diagnosis of complex systems. We consider the case where subsystems are developed independently along with their associated (or embedded) software modules in particular their diagnostic software. This is useful in those situations where subsystems are developed (possibly by different suppliers) without a-priori knowledge of the system in which they will be exploited, or without making assumptions on the role they will play in such system. We describe a decentralized architecture where subsystems are analyzed by Local Diagnosers, coordinated by a Supervisor. Within the framework, both the Local Diagnosers and the Supervisor can be designed independently of each other, without any advance information on how the subsystems will be connected (provided that they share a common modeling ontology) and allowing also for runtime changes in the overall system structure. Local diagnosers are thus loosely coupled and communicate with the Supervisor via a standard interface, supporting independent implementations. 1 Introduction The application of model-based diagnosis has often to face the problem of architectural complexity in real physical systems. Decomposition has been recognized as an important leverage to manage this type of complexity. Most of the approaches, however, focused on hierarchical decomposition [Genesereth, 1984; Mozetic, 1991], while decentralization has been explored less frequently (see e.g. [Pencolé and Cordier, 2005]). In this paper we propose a decentralized supervised approach to diagnosis. This approach generalizes the one in [Ardissono et al., 2005], which is tailored to the specific case of diagnosis of Web Services. We assume that a system is formed by subsystems each one having its Local Diagnoser, while a Supervisor is associated with the overall system, with the aim of coordinating Local Diagnosers and producing global diagnoses. Although the details of the approach will be discussed in the following sections, it is important to notice here that we consider a general case where: The Supervisor needs not have a-priori knowledge about the subsystems and their paths of interaction. Each Local Diagnoser can adopt its own diagnostic strategy and must only implement a communication interface with the Supervisor. Local Diagnosers don t know each other. The Supervisor should behave as a Local Diagnoser in case the system is a subsystem of a more complex system (hierarchical scalability). Several reasons motivate the adoption of such an approach. First of all, it is suitable for distributed systems where the set of subsystems and their paths of interaction vary across time. The assumptions above guarantee that subsystems can be added/removed/replaced independently of each other at any time, that the paths of interaction may even be nondeterministic, and that no one has to be informed of the change (loosely coupled integration). Such a decoupling is important and interesting for several reasons. First it allows developing Local Diagnosers independently of each other, making it possible to control the design of the diagnostics in a complex system. This is not only a problem of managing complexity. In many application cases the subsystems are designed by different entities (e.g., suppliers of the company assembling the complex system) and thus are black boxes for the designer of the complex system. The problems are common in many application domains. For example, in aerospace applications, it is very common that systems are assembled with subsystems provided by different suppliers. A simple example is a landing gear, including hydraulic, mechatronic and electronic components that are assembled by the assembler of the aircraft. Each subsystem (e.g., power transmission, or an hydraulic actuator) is supplied together with its own FMECA and control/diagnostic software, without any information on its internal details. Indeed in our approach we assume that Local Diagnosers are designed independently of each other, and that the design of a Supervisor requires only that Local Diagnosers implement a common communication interface (which is a reasonable assumption in an assembler-supplier relationship). In this sense a decentralized approach where local diagnosers DX'06 - Peñaranda de Duero, Burgos (Spain) 47

58 communicate with a Supervisor is more flexible than a fully distributed approach, as we can assume that Local Diagnosers have no information or model about the others and we make no assumption on the diagnosers and the models they use (provided the models share a common ontology). The paper is organized as follows: in the next section we describe the decentralized architecture we propose. Section 3 explains the assumptions we make on the models used for diagnosis. Then, section 4 discussed how decentralized diagnosis is carried out in our approach. Section 5 concludes with a discussion on related work. 2 Architecture The architecture we define is a supervised one where: A Local Diagnoser is associated with each subsystem. The model of the subsystem is private and it is not visible outside the diagnoser. A Supervisor of the diagnostic process is associated with the system. It coordinates the work of the diagnosers optimizing their invocations during a diagnostic session. The structure of the system (i.e., the subsystems and their connections) may be known a-priori to the Supervisor or may be reconstructed by it during the interaction (without any a- priori information). This increases the flexibility of the approach as it allows us to deal with system whose structure (number of components and their connections) may vary dynamically. The architecture defines a communication interface between the Local Diagnosers and the Supervisor. Local diagnosers are awakened by the subsystem they are in charge of whenever an anomaly (fault) is detected. They may explain the fault locally or may put the blame on an input received from another subsystem. They communicate to the Supervisor their explanations (without disclosing the details of the local failure(s)). The Supervisor may invoke the Local Diagnoser of the blamed subsystem asking it to generate explanations. The Supervisor may also ask a Local Diagnoser to check some predictions. The selection of the Local Diagnosers to invoke is made according to rules that optimize invocations, as we will discuss in section 4.2. The architecture is scalable (hierarchical) in the sense that the Supervisor can act as a Local Diagnoser when the system is used as a subsystem in a more complex system. Notice that, in order to communicate, the Supervisor and Local Diagnosers must speak a common language; in other words, we must assume that they share a modeling ontology that allows them to share information about exchanged quantities. This nevertheless allows each Local Diagnoser to use its own modeling language and assumptions, and even to internally use a different ontology if needed, provided that it is able to map it over the common one during communications. 3 Models The approach we propose focuses on Qualitative Models; in this paper we will in particular deal with deviation models [Malik and Struss, 1996] although we will briefly describe how our proposal can be applied to other qualitative models as well. It is worth noting that deviation models proved to be Independent Power Supply Engine T DC Power Power Transmission X C V Cockpit Hydraulic Extraction Subsystem Y E LEDs Landing Gear Figure 1: The Landing Gear section of an Aircraft System successful in several applications to technical domains (e.g., [Sachenbacher et al., 2000; Picardi et al., 2002; 2004 ]). As we discussed in the previous section, the Supervisor is not aware of specific subsystem models; however, it is aware of their existence and, in order to coordinate Local Diagnoser and combine the information they provide, it must make some assumptions on the modeling ontology. In this section we will discuss these assumptions. First of all, as it is common in a large part of model-based diagnosis, we consider component-oriented models, where each component has one correct behavior and one or more faulty behaviors. Each behavior of a component is modeled as a relation over a set of variables. Variables that represent quantities exchanged with other components are either input or output variables. The complete system is specified as a set of assignments of an output variable of one component to an input variables of another one. By Qualitative Model (of a physical system) we intend a model where all variables have a finite, discrete domain, thus representing either a discretization of the corresponding physical quantity, or an abstraction over some of its properties. This implies that, in a qualitative model, the relation that defines a component or system can be expressed extensionally (although in most cases this is not desirable). In this context, each component C can be represented as a unique relation over a set of variables by introducing a distinguished mode variable C.m, that ranges over the set of behavior modes {ok, ab 1,...,ab n } of the component itself. As mentioned above, in this paper we will particularly focus on deviation models, which means that: Each physical quantity q represented in the model has a corresponding variable q representing a qualitative abstraction of the deviation of q with respect to its expected value. Common domains for deviation variables are {ok, ab}, expressing whether the value of q is normal or not, and {, 0, +}, expressing whether q is lower than, equal to, or higher than its expected value. Here we will take this last option; we will therefore be discussing sign-based deviation models. Each behavior model expresses relations among deviation variables. For example, an electrical model might state that if the resistance in a circuit is higher than expected, than the intensity of the current flowing in the circuit is lower than expected. We will now introduce a running example that we will use throughout the paper to illustrate the approach. Figure 1 shows a simplified part of an Aircraft System, namely the part concerned with the Landing Gear. The pic- 48 DX'06 - Peñaranda de Duero, Burgos (Spain)

59 ture represents a high-level view of the system: we do not see individual components, but interacting subsystems. Subsystem pictured with a dashed line do not directly take part in the example, but are shown for the sake of completeness. The Hydraulic Extraction System (HES) creates a pressure that mechanically pushes the Landing Gear, thereby extracting it. The HES is also connected to some leds on the cockpit, that show whether the subsystem is powered up or not. In order to create the pressure, the HES takes power from two sources. The main source is the Power Transmission from the aircraft engine, which actually powers up the main pump of the HES. A secondary source (used to transmit the pilot command from the cockpit, and to light up leds) is the Independent Power Supply, which produces a low-amperage DC current. We will detail parts of this example in the following, as needed to explain the different aspects of the decentralized diagnostic process. 4 Decentralized Diagnostic Process In this section we will describe how the diagnostic process is carried out. As we have already mentioned, a Supervisor coordinates the activities of several Local Diagnosers {LD 1,...,LD n }. In order to obtain a loosely coupled integration, the Supervisor assumes that each LD i is stateless, that is every invocation of a Local Diagnoser is independent of the others. Moreover, the Supervisor does not make any assumption on the implementation of each Local Diagnoser. The interaction is thus defined by: An interface that each Local Diagnoser must implement, discussed in section 4.1. The role of Local Diagnosers consists in generating htpotheses consistent with their model and observations. An algorithm for the Supervisor that computes diagnoses by coordinating the Local Diagnoser through the above interface, explained in section 4.2. The role of the Supervisor is to propagate hypotheses to the neighbors of a Local Diagnoser. Figure 2 shows an example of this architecture for the system in Figure 1. LD 1 extend LD 2 DC Power Power Transmission Supervisor LD 3 LD 4 Hydraulic Extraction Subsystem LEDs extend Landing Gear LD 5 Figure 2: An example of decentralized architecture Hydraulic Extraction Subsystem X DC power Wire (input) C DC power (command) Pump V Pipe Valve main power Pressure Chamber Y DC power (output) E extraction Figure 3: A closer view to the Hydraulic Extraction System 4.1 The Extend interface We assume that each Local Diagnoser LD i reasons on a model M i of its subsystem S i, according to what we discussed in section 3. From the point of view of the Supervisor and its interactions with Local Diagnosers, each model M i is a relation over a set of variables where: mode variables, denoted by S i.m, express the behavior mode of components in S i ; each of them has thus a different domain; input/output variables express deviations of quantities that S i exchanges with other subsystems. Of course a model M i may include additional (internal, and private) variables, and may be not expressed at all as an extensional relation. This is however hidden in the implementation of each Local Diagnoser. The interface between the Supervisor and the Local Diagnosers is made of a single operation called EXTEND that the Supervisor invokes on the Local Diagnosers. Moreover, upon an abnormal observation, Local Diagnosers autonomously execute EXTEND and send the results to the Supervisor. The goal of EXTEND is to explain and/or verify the consistency of observations and/or hypotheses made by other Local Diagnosers. Thus, the input to EXTEND when invoked on LD i is a set of hypotheses on the values of input/output variables of M i. Such hypotheses are represented as a set of partial 1 assignments over the variables of interest. In the particular case where EXTEND is autonomously executed by a Local Diagnoser upon receiving an abnormal observation from its subsystem, the set of hypotheses contains only the empty assignment (i.e. an assignment with an empty domain). For each hypothesis α received in input, LD i must first of all check whether α is consistent with the local diagnostic model M i and the local observations ω i. Then LD i must extend α in two directions: backwards to verify whether that hypothesis needs further assumptions to be supported within M i and ω i, and forward to see whether the hypothesis has consequences on other subsystems, that may be verified in order to discard or confirm it. Let us consider as an example the Hydraulic Extension System, a (simplified) close-up view of which is depicted in Figure 3. Suppose the Local Diagnoser LD 3, associated to the HES, receives in input an assignment that assigns the value to 1 A partial assignment over a set of variables V is any assignment α whose domain Dom(α) is a subset of V. DX'06 - Peñaranda de Duero, Burgos (Spain) 49

60 its output data variable HES. E, representing the mechanical extraction of the Landing Gear. This value means that the Landing Gear did not extract when expected. LD 3 may see that, with respect to the local model, for that piece of data to be less than expected, one of the following must have happened: An internal component has failed, for example the pump is stuck or the pipe is leaking. One of the two inputs of the pump is wrong, for example the pump is not powered or it has not received the command. This means that the partial assignment HES. E = can be extended in four ways: by assigning HES.m = stuck, or HES.m = leaking, orhes. V =,orhes. C =. What is important regarding extensions is that they should include everything that can be said under a given hypothesis, but nothing that could be said also without it. In other words, we are interested in knowing whether a given assignment constrains other variables more than the model alone does. The following definition captures and formalizes this notion. Def. 1 2 Let M i be a local model with local observations ω i, and let γ be an assignment. We say that γ is admissible with respect to M i and ω i if: i. γ is an extension of ω i ; ii. γ is consistent with M i ; iii. if M i,γ is the model obtained by constraining M i with γ, its restriction to variables not assigned in γ is equivalent to the restriction of M i itself to the same variables: M i,γ VAR(Mi)\Dom(γ) M i VAR(Mi)\Dom(γ). Thus, for each assignment α received in input, the EX- TEND operation of a Local Diagnoser LD i returns a set of assignments E α that contains all minimal admissible extensions of α with respect to M i and ω i, restricted to input, output and mode variables of M i. Notice that, whenever α is inconsistent with the observations and/or the model, E α is empty. Minimal admissibility avoids unnecessary committments on values of variables that are not concretely constrained by the current hypothesis. In order to illustrate these definitions, let us consider a (again, simplified) model M P of the pump P alone. The model includes four variables: P.m represents the behaviour mode, P. V the power supply to the pump, P. C the command that turns on the pump, and P. F the flow coming out from the pump. The extensional representation of M P is: P.m P. C P. V P. F P.m P. C P. V P. F ok stuck 0 0 ok stuck 0 + 0, +, ok 0 stuck 0 ok stuck + 0 0, +, ok stuck + + 0, +, ok + 0, +, stuck + 0, +, ok 0 stuck 0 ok + 0, +, stuck + 0, +, ok stuck 2 The definitions in this section are rephrased from [Ardissono et al., 2005]. Now suppose we execute EXTEND on M P alone, having in input an assignment α such that Dom(α) = {P. F } and α(p. F ) =. Let us first of all verify that α itself is not admissible, by computing M P {P.m,P. C,P. V } and M P,α {P.m,P. C,P. V }. M P {P.m,P. C,P. V P.m P. C P. V ok 0 0 M P,α {P.m,P. C,P. V } ok 0 + P.m P. C P. V ok 0 ok 0 ok + 0 ok + ok + + ok 0 ok + ok + ok 0 ok ok + stuck 0 0 ok stuck 0 + stuck 0 0 stuck 0 stuck 0 + stuck + 0 stuck 0 stuck + + stuck + 0 stuck + stuck + + stuck 0 stuck + stuck + stuck 0 stuck stuck + stuck It is easy to see that the minimal admissible extensions of α wrt M P are the following: P.m P. C P. V P. F γ 1 stuck γ 2 γ 3 γ 1, γ 2 and γ 3 express the three possible explanations for F = : either the pump is stuck, or it is not powered, or it did not receive the command. For each of the three assignments, unassigned variables are those whose value is irrelevant to the explanation. 4.2 The Supervisor algorithm The Supervisor is started by a Local Diagnoser LD first that sends to it the results of an EXTEND operation executed as a consequence of an abnormal observation. During its computation, the Supervisor records the following information: a set H of assignments, representing current hypotheses; for each assignment α and for each variable x Dom(α), amodified bit. Bymdf(α(x)) we will denote the value of this bit for variable x in assignment α. Moreover, we assume that given an input or output variable x that has been mentioned by one of the Local Diagnosers, the Supervisor is able to determine the variable conn(x) connected to it. We do not make any assumption on whether this information is known a priori by the Supervisor, or it is retrieved dynamically, or it is provided by the Local Diagnoser itself. The Supervisor initializes its data structures with the results of the initial EXTEND operation that has awakened it: Hcontains all the assignments that were sent by LD first as the result of EXTEND; for each α H, and for each x Dom(α), the Supervisor extends α to the variable conn(x) connected with x by assigning α(conn(x)) = α(x). Then the Supervisor sets mdf(α(conn(x))) = DX'06 - Peñaranda de Duero, Burgos (Spain)

61 Modified bits are used by the Supervisor to understand whether it should invoke a Local Diagnoser or not. The basic rule it uses is the following: Basic Rule. If a subsystem S i has a variable x with mdf(α(x)) = 1 for some α, then LD i is a candidate for invocation and α should be passed on to EXTEND. There are however two exceptions to this rule, that allow to reduce the number of invocations: Vertical Exception. Given a variable x belonging to a subsystem S i, if all assignments give the same value to x, and either the value is 0 or x is an input variable, then x and its modified bits should not count towards deciding whether LD i should be invoked or not. This exception avoids that a Local Diagnoser is invoked to verify/discard something that has already been assessed to be true. If however the assesses truth concerns an abnormal value for one of S i s outputs, then LD i is still asked to give an explanation. This exception is referred to as vertical because it considers the value of a variable throughout different assignments, that is a column in the assignments table representing H. Horizontal Exception. Given an assignment α and a subsystem S i,ifmdf(α(x)) = 1 implies α(x) =0, then α and its modified bits should not count towards deciding whether LD i should be invoked or not. If LD i is nevertheless invoked, α should be passed on to extend only if LD i has never been invoked before. This exception avoids that a Local Diagnoser is invoked only to verify that everything ok is consistent with its model (which is trivially true). It is called horizontal because it considers values of different variables in the same assignment, that is a row in the assignments table representing H. Notice that when deciding whether the Horizontal Exception can be applied, only variables with modified bits set to 1 are considered. In order to apply it, we do not need the whole assignment to assign value 0, but only the part of it that has never been examined by the proper Local Diagnoser. Thus, if a Local Diagnoser LD i has already been invoked, it is never invoked again until one of the variables in S i is assigned a value different than 0. Thanks to the definition of EXTEND, we do not have to worry that a newly assigned 0 is inconsistent with a previously extended assignment. After initializing data structures, the Supervisor loops over the following three steps: Step 1: select the next LD i to invoke. The Supervisor selects one of the Local Diagnosers LD i for which there is at least one assignment α meeting the Basic Rule with no Exception. If there is none, the loop terminates. Step 2: invoke EXTEND. If LD i has never been invoked before in this diagnostic process, then the input to EXTEND is the set of all assignments in H, restricted to variables of M i. Otherwise the input consists only of those assignments α that meet the Basic Rule but not the Horizontal Exception. Step 3: update H. The Supervisor receives the output of EXTEND from LD i. For each α in input, EXTEND has returned a (possibly empty) set of extensions E α = {γ 1,...,γ k }. Then α is replaced in H by the set of assignments {β 1,...,β k } where β j is obtained by: combining α with each γ j E α ; extending the result of this combination to connected variables, so that for each x Dom(γ) representing an input/output variable, β j (conn(x)) = β j (x). This implies that rejected assignments, having no extensions, are removed from H. Step 4: update the mdf bits. For each assignment β j added in Step 3 mdf bits are set as follows: (i) For each variable x not belonging to M i such that x Dom(β j ) and x Dom(α), mdf(β j (x)) is set to 1. (ii) For each variable x belonging to M i such that x Dom(β j ), mdf(β j (x)) is set to 0. (iii) For any other variable x Dom(β j ), mdf(β j (x)) = mdf(α(x)). 3 Notice that the diagnostic process terminates: new requests for EXTEND are generated only if assignments are properly extended, but assignments cannot be extended indefinitely. At the end of the loop, assignments in H provide consistency-based diagnoses as follows: Def. 2 Let α be an assignment in H at the end of the diagnostic process. The diagnosis associated with α is the assignment (α) such that: Dom( (α)) is the set of all mode variables of all involved models; for each x Dom(α) Dom( (α)), (α)(x) =α(x); for each x Dom( (α)) \ Dom(α), (α(x)) = ok. Of these diagnoses, non-minimal ones are discarded. It can be proved that this algorithm computes all minimal consistency based diagnoses with respect to the model and observations of the part of the system involved in the diagnostic process. Notice that in the Supervisor algorithm the role of mode variables is rather peculiar. Since these variables are local to a given system, the Supervisor never communicates their value to a Local Diagnoser different than the one originating them. Thus, they are not needed for cross-consistency checks. There are two main reasons why we need the Supervisor to be aware of them: They provide the connection between two different invocations on the same Local Diagnoser; by having the Supervisor record them and send them back on subsequent calls, we allow the Local Diagnosers to be stateless. The Supervisor needs them to compute globally minimal diagnoses. 3 Variables in Dom(β j) but not in Dom(α) are all covered in the first two cases. DX'06 - Peñaranda de Duero, Burgos (Spain) 51

62 Notice however that, for both of these goals, the value of mode variables needs not be explicit: in other words the Local Diagnosers may associate to each fault mode a coded Id and use it as a value for mode variables. In this way, information on what has happened inside a subsystem can remain private. Let us conclude this section with an example of execution of the Supervisor algorithm on the system in figure 1. The names of the subsystems will be shortened as follows: Eng will denote the Engine; DCP the DC Power; PT the Power Transmission; HES the Hydraulic Extraction System; LG the Landing Gear; LED the LEDs. Moreover, we will write S.x to denote variable x belonging to subsystem S, and S.m will denote a variable summarizing the behaviour mode of components in S. Let us assume that LG observes a value for LG. E. It autonomously invokes EXTEND, which as an output has only the following assignment: LG.m LG. E α 0 The Supervisor receives this assignment, and accordingly initializes its data structures. Figure 4 shows how H changes during the execution of the Supervisor algorithm; in particular, H 0 corresponds to H immediately after initialization. 4 At this stage the only candidate for invocation is LD 3, responsible for the HES. Its input consists of assignment α 0 restricted to HES variables, which (as already discussed in section 4.1) is extended by LD 3 as follows: HES. E α 0 = HES.m HES. E HES. V HES. C α 01 leak α 02 stuck α 03 α 04 The Supervisor receives these result and updates H, obtaining H 1 depicted in Figure 4. Now there are two candidates for invocation, LD 2 and LD 1, respectively responsible for the DCP and for the PT. They can be invoked in any order; let us assume that LD 2 is invoked first. Since it is the first invocation, its input are all the assignments in H restricted to DCP variables. For the sake of the example, let us assume that the DCP can produce a on C only by being failed itself, and that as consequence of this failure also X should be as well. Then, LD 2 extends the input assignments in this way: α 0[1,2,3] α 04 DCP. C = DCP.m DCP. C DCP. X α 0[1,2,3] α 04 fail The Supervisor therefore updates H; the result is the set H 2. Now it s LD 1 s turn. The input and output of EXTEND are: α 0[1,2,4] α 03 PT. V = PT.m PT. V PT. T α 0[1,2,4] 0 α 03 fail 0 In this case, we assume that the PT has observed that its input T from the engine is normal. As a consequence, it can explain the on V only with an internal failure. Thus, the output sent to the Supervisor is: 4 In Figure 4 the symbol indicates that for the corresponding assignment and variable the modified bit is set to 1. The set of hypotheses is updated accordingly and the Supervisor obtains H 3. The only candidate for invocation is now LD 3, because although Eng has some modified bits set to 1, it falls in both the Exceptions. Thus LD 3 is invoked again, only on the modified assignment, that is α 04, restricted to HES variables. Now, let us assume that from the point of view of the HES having X = implies necessarily that also Y =. The result of EXTEND is then: HES.m HES. E HES. V HES. C HES. X HES. Y α 04 H now becomes as H 4 in Figure 4. Now LD 4 is invoked; let us assume that observations on LED report that everything is ok with the leds. What happens is that one of the two input assignment is extended, while the other is rejected as inconsistent. Here are the input and output of EXTEND: α 0[1,2,3] α 04 LED. Y = LED.M LED. Y α 0[1,2,3] 0 Updating H the Supervisor obtains H 5. Notice that due to the Exception there is no Local Diagnoser left to invoke, thus the diagnostic process ends here. Minimal diagnoses thus are: HES.m = leak: the pipe of the Hydraulic Extraction System is leaking (from α 01 ). HES.m = stuck: the pump of the Hydraulic Extraction System is stuck (from α 02 ). PT.m = fail: the Power Transmission is broken (from α 03 ). 4.3 Other types of models We briefly discuss here how the approach can be applied also to models different than the sign-based deviation models used in our explanation. In general, the algorithm can be applied as is to any type of deviation model where there is only one value associated to the nominal behavior. Models with variable domains such as {ok, ab} or {,, 0, +, ++} belong to this category. In order to apply the approach to a generic qualitative model, we need to modify the Supervisor algorithm by: discarding the Horizontal Exception; modifying the Vertical Exception, so that it applies only to input variables. It is clear that giving up the Horizontal Exception means increasing the number of invocations to Local Diagnosers. What however matters is whether these invocations are useful for the Diagnostic Process; in other words, whether the level of abstraction in the model is suitable for the diagnostic task. For mixed models, that is models that combine deviation variables with other types of variables, it may neverthelsess be possible to define other exceptions to the Basic Rule, more restrictive than the one described here, that allow to reduce the number of invocations. However each type of model should be analyzed separately to discover whether this optimization is possible. 52 DX'06 - Peñaranda de Duero, Burgos (Spain)

63 H 0 LG.m LG. E HES. E α 0 H 1 LG.m LG. E HES.m HES. E HES. V HES. C DCP. C PT. V α 01 leak α 02 stuck α 03 α 04 H 2 LG.m LG. E HES.m HES. E HES. V HES. C HES. X DCP.m DCP. C DCP. X PT. V α 01 leak α 02 stuck α 03 α 04 fail H 3 LG.m LG. E HES.m HES. E HES. V HES. C HES. X DCP.m DCP. C DCP. X PT.m PT. V PT. T Eng. T α 01 leak 0 0 α 02 stuck 0 0 α 03 fail 0 0 α 04 fail 0 0 H 4 LG.m LG. E HES.m HES. E HES. V HES. C HES. X HES. Y LED. Y DCP.m DCP. C DCP. X PT.m PT. V PT. T Eng. T α 01 leak 0 0 α 02 stuck 0 0 α 03 fail 0 0 α 04 fail 0 0 H 5 LG.m LG. E HES.m HES. E HES. V HES. C HES. X HES. Y LED. Y DCP.m DCP. C DCP. X PT.m PT. V PT. T Eng. T α 01 leak α 02 stuck α fail 0 0 Figure 4: The set H in different stages of the Supervisor algorithm, during a diagnosis for the system in Figure 1. 5 Conclusions In this paper we discuss a supervised decentralized approach to diagnosis that allows to loosely couple multiple diagnosers, in order to deal with architecturally complex systems. The goal of the paper is to show that we can effectively perform the diagnostic task and define intelligent strategies for the Supervisor, even if it is not aware of the internal aspects of Local Diagnosers, the Local Diagnosers do not know each other, and their paths of interaction dynamically change. In this paper we do not face the problem of the implementation of Local Diagnosers. The discussion of efficient algorithms that execute the EXTEND operation is out of the scope of this paper, especially since such algorithms would strongly depend on the modeling language and specific assumptions of each individual Local Diagnoser. It is worth noting, however, that the decentralized approach can be hierarchically applied to subsystems, possibly having the same program playing both the role of the Supervisor and of the Local Diagnoser. In this case the advantage of the decentralized approach would be, rather than information hiding, to exploit the Supervisor to choose which parts of the model should be analyzed. At the lowest level (i.e., the level of basic components) this would still require to directly execute EXTEND; however for small models the results of this operation could easily be pre-compiled. Hierarchical scalability is indeed an important feature of our approach which allows to integrate hierarchical decomposition as a further means to deal with the complexity of the systems to be diagnosed. As we already mentioned, the approach proposed in this paper is a generalization of the one proposed in [Ardissono et al., 2005] to deal with diagnosis of Web Services. In particular, the definition of the EXTEND interface is borrowed from [Ardissono et al., 2005], while the Supervisor algorithm is generalized in order to loosen the assumptions and still produce all minimal consistency-based diagnosis. If we consider other approaches in the literature, we see that [Pencolé and Cordier, 2005] has some relevant similarity from the point of view of diagnostic architecture: a supervisor is in charge of computing global diagnoses exploiting the work of several local diagnosers. However, this work differs in some significant respects: the system to diagnose is modeled as a discrete-event system; the main problem is to avoid composing the whole model, because this would produce a state-space explosion; as a consequence, the supervisor is aware of the subsystem models, but cannot compose them: it can only compose local diagnoses to obtain a global one; due to the nature of the considered systems, reconstructing global diagnoses is a difficult task, and as such it is one of the main focuses of the paper. Thus [Pencolé and Cordier, 2005] actually focuses on quite DX'06 - Peñaranda de Duero, Burgos (Spain) 53

64 different theoretical and practical problems than those addressed in this paper. Other papers in the literature deal with completely distributed approaches to diagnosis, where diagnosers communicate with each other and try to reach an agreement, without a supervisor coordinating them. Our choice of a supervised approach was motivated by the need of having loosely coupled diagnosers, while a purely distributed approach requires diagnosers to establish diagnostic sessions and exchange a greater amount of information in order to compute diagnoses that are consistent with each other. Nevertheless, the distributed approach proposed in [Roos et al., 2003] has some similarity with ours because it is based on the idea of diagnosers explaining blames received from others, or verifying hypotheses made by others. Needless to say, the proposed algorithms are completely different, having to deal with a distributed approach. Moreover, in order to reduce computational complexity, the authors introduce some fairly restrictive assumptions on the models (e.g. requiring that explanations imply normal observations, or that output values are either completely undetermined or completely determined), which are not acceptable in our case. Another distributed approach on a similar basis is the one in [Kalech and Kaminka, 2004]; however in this case the focus is on diagnosing failures in the communication among agents in a team, a problem the authors refer to as social diagnosis. Thus the tackled problem is different from the one we cope with, since it deals with failures arising in the communication between subsystems with different sets of beliefs, rather than with failures happening inside a subsystem and propagating to others. Finally, an interesting distributed approach is tackled in [Provan, 2002]. Here the scenario is rather similar to ours: each diagnoser has a local model and observations, which are not shared with the others. Each local diagnoser computes local minimal diagnoses, and then engages in a conversation to reach a globally sound diagnosis. However, being the approach purely distributed, solutions are different. In particular, in order to propose a solution with reduced complexity, the author focuses on systems whose connection graph has a tree-like structure. On the contrary, our approach (thanks to the presence of the supervisor) does not need to make any assumption on system structure; in fact, system structure may even dynamically change. Work. on Principles of Diagnosis (DX-04), pages , [Malik and Struss, 1996] A. Malik and P. Struss. Diagnosis of dynamic systems does not necessarily require simulation. In Proc. 7th Int. Work. on Principles of Diagnosis, pages , [Mozetic, 1991] I. Mozetic. Hierarchical model-based diagnosis. Int. J. of Man-Machine Studies, 35(3): , [Pencolé and Cordier, 2005] Y. Pencolé and M.-O. Cordier. A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks. Artificial Intelligence, 164(1-2), [Picardi et al., 2002] C. Picardi, R. Bray, F. Cascio, L. Console, P. Dague, O. Dressler, D. Millet, B. Rehfus, P. Struss, and C. Vallée. IDD: Integrating Diagnosis in the Design of automotive systems. In Proceedings of the 15th European Conference on Artificial Intelligence (ECAI2002), pages , [Picardi et al., 2004] C. Picardi, L. Console, F. Berger, J. Breeman, T. Kanakis, J. Moelands, E. Arbaretier, S. Collas, N. De Domenico, E. Girardelli, O. Dressler, P. Struss, and B. Zilbermen. AUTAS: a tool for supporting FMECA generation in aeronautic systems. In Proceedings of the 16th European Conference on Artificial Intelligence (ECAI2004), [Provan, 2002] G. Provan. A model-based diagnostic framework for distributed systems. In Proc. 13th Int. Work. on Principles of Diagnosis (DX-02), pages 16 22, [Roos et al., 2003] N. Roos, A. ten Teije, and C. Witteveen. A protocol for multi-agent diagnosis with spatially distributed knowledge. In 2nd Int. Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS- 2003), Melbourne, Australia, July [Sachenbacher et al., 2000] M. Sachenbacher, P. Struss, and R. Weber. Advances in design and implementation of obd functions for diesel injection systems based on a qualitative approach to diagnosis. In SAE 2000 World Congress, References [Ardissono et al., 2005] L. Ardissono, L. Console, A. Goy, G. Petrone, C. Picardi, M. Segnan, and D. Theseider Dupré. Cooperative model-based diagnosis of web services. In Proceedings of the Sixteenth International Workshop on the Principles of Diagnosis (DX 05), Pacific Grove, California, [Genesereth, 1984] M.R. Genesereth. The use of design descriptions in automated diagnosis. Artificial Intelligence, 24(1-3): , [Kalech and Kaminka, 2004] M. Kalech and G.A. Kaminka. Diagnosing a team of agents: scaling-up. In Proc. 15th Int. 54 DX'06 - Peñaranda de Duero, Burgos (Spain)

65 Comparing diagnosability in Continuous and Discrete-Event Systems Marie-Odile Cordier INRIA Rennes, France Louise Travé-Massuyès and Xavier Pucel LAAS-CNRS Toulouse, France Abstract This paper is concerned with diagnosability analysis, which proves a requisite for several tasks during the system s life cycle. The Model-Based Diagnosis (MBD) community has developed specific approaches for Continuous Systems (CS) and for Discrete Event Systems (DES) in two distinct and parallel tracks. In this paper, the correspondences between the concepts used in CS and DES approaches are clarified and it is shown that the diagnosability problem can be brought back to the same formulation using the concept of signatures. These results bridges CS and DES diagnosability and open perspectives for hybrid model based diagnosis. 1 Introduction Diagnosis is an increasingly active research domain, which can be approached from different perspectives according to the type of system at hand and the required abstraction level. Although some recent works have considered diagnosis based on hybrid models [Williams and Nayak, 1996; Bénazéra et al., 2002; Bénazéra and Travé-Massuyès, 2003; Gupta et al., 2004], the Model-Based Diagnosis (MBD) community has developed specific approaches for Continuous Systems (CS) and for Discrete Event Systems (DES) in two distinct and parallel tracks. Algorithms for monitoring, diagnosis and diagnosability analysis have been proposed [Sampath et al., 1995; Jiang et al., 2001; Yoo and Lafortune, 2002; Cimatti et al., 2003; Rozé and Cordier, 2002; Jeron et al., 2006; Patton et al., 1989; Staroswiecki and Comtet-Varga, 1999; Frisk et al., 2003; Struss and Dressler, 2003]. The formalisms and tools are quite different : the CS community makes use of algebro-differential equation models or qualitative abstractions whereas the DES community uses finitestate formalisms. For diagnosability analysis, the CS approaches generally adopt a state-based diagnosis point of view in the sense that diagnosis is performed on a snapshot of observables, i.e. one observation at a given time point. The DES approaches perform event-based diagnosis and achieves state tracking, which means dynamic diagnosis reasoning achieved across time. This paper is concerned with diagnosability analysis, which proves a requisite for several tasks during the system s life cycle, in particular instrumentation design, end-of-line testing, testing for diagnosis, etc. In spite of quite different frameworks, it is shown that the diagnosability assessment problem stated on both sides can be brought back to the same formulation and that common concepts can be proposed for proving diagnosability definitions equivalent. This result provides solid ground for considering the analysis of hybrid systems diagnosability. 2 DES and CS modelling approaches This section presents the different theories used to model DESs and CSs. The principles underlying DES and CS model based diagnosis are given and diagnosability is introduced on both sides. Both approaches rely on the analysis of the observable consequences of faults, i.e. symptoms. The main difference between DES and CS diagnosability analysis processes is that the order of appearance of the symptoms is only taken into account in the DES approach. In the CS approach, fault occurrence assumes immediate and simultaneous observation of the symptoms, while the DES approach diagnosis relies on the observation of a sequence of symptoms after fault occurrence. Proof is given that, assuming the system observed a sufficiently long time, diagnosability conditions for DES and CS are conceptually equivalent. 2.1 The models DES model A DES is modelled by a language L sys E where E is the set of system events. L sys is prefix-closed, and can be described by a regular expression, or generated by a finite state automaton G = (Q, E, T, q 0 ) where Q is the set of states, E the set of events, T (Q E Q) the transition relation and q 0 the initial state. Each trajectory in the automaton corresponds to one word of the language, and represents a sequence of events that may occur in the system. The set of events E is partitioned into observable and unobservable events : E = E o E uo, and a set of faults E f E uo is given. The diagnosis process aims at detecting and assessing the occurrence of unobservable fault events from a sequence of observed events. The set OBS is defined as the set of all the possible observable events sequences, i.e, OBS = {(e 1 e 2... e n )} where n is any positive integer. DX'06 - Peñaranda de Duero, Burgos (Spain) 55

66 In this article, it is assumed that the automaton is deterministic (T : Q E Q is a function), generates a live language (every state has at least one outgoing transition), and contains no cycle of unobservable events. The diagnosis process makes use of a projection operation that removes all unobservable events from a trajectory. The inverse operation is applied to a set of observable events sequences and leads to the diagnoses. A fault is diagnosable when its occurrence is always followed by a bounded observable event sequence that cannot be generated in its absence (see definition 1). CS model The behavior model of a CS Σ = (R, V ) is generally described by a set of n relations R, which relate a set of m variables V. In a component-oriented model, these relations are associated to the system physical components, including the sensors. The set R is partitioned into behavioral relations which correspond to the internal components and observation relations which correspond to the sensors. The set of variables V is also partitioned into the set of observed variables O, whose corresponding value tuples are called observations, and the set of unobserved variables noted X. Observation values, possibly processed into fault indicators, provide a means to characterize the system at a given time. In a pure consistency-based approach, in which only the normal behavior of the system is modelled, the designer may use the model to establish a set of Analytical Redundant Relations, which can be expressed as a set of residuals. In that case, the observations result in a boolean fault indicator tuple. In the following, we will refer without loss of generality to the observation tuples and define the set OBS as the set of all the possible observation tuples, i.e., OBS = {(o 1, o 2,..., o k )} where k is the number of sensors. The observation value pattern is referred to as the observed signature whereas the expected value patterns for a given fault, obtained from the behavioral model, provide the fault signature. Note that several value patterns may correspond to the same fault, for example when the system undergoes several operating modes. The fault signature is hence defined as the set of all possible observable variable value tuples under the fault. The diagnosis process relies on comparing the observed signature with fault signatures. Fault signatures also allow one to test fault detectability. 2.2 The set of observables In the case of DES, observations consist in a sequence of observable events, while in the case of CS, observations consist in a set of values for observable variables, with no ordering. This paper focuses on comparing the notions based on observations that lead to diagnosability, making abstraction of the nature of the observations. It is shown that the concept of signatures can be defined in a way allowing to prove the equivalence of definitions. However, it does not imply that any system being diagnosable when modelled as a DES is diagnosable as a CS, due to the difference in the observations nature. The set of observables OBS is defined as the set containing all the observations that are possible for the system. It may represent the observations obtained from a DES (a set of ordered observable events) as well as those from a CS (a set of observable values). 3 Faults, diagnoses and fault signatures This section contains formal definitions of faults, diagnoses, and fault signatures. The definitions of diagnosability rely on these (see next section). 3.1 Faults and diagnoses The set of faults F sys associated to a system is partitioned into n types of faults, the partition is noted F. The following properties hold : - F i, F j F, F i F j i = j n - i=0 F i = F sys The occurrence of one or several faults of one type is called a single fault. When faults of several types have occurred, the system is said to be under a multiple fault. The set of possible faults that may occur in a system is the power set of F, noted P(F ). For example, describes the absence of faults, {F i } a single fault, and {F i, F j } a multiple fault. All three examples are elements of P(F ). Faults are assumed to be permanent. A diagnosis consists in a set of fault candidates. When a diagnosis contains only one fault, it is said to be determinate, while if it contains several faults it is indeterminate. The set of all possible diagnoses is the power set of the set of faults, noted P(P(F )). For example, { }, { {F i } } and { {F i, F j } } are determinate diagnoses, while {, {F i }, {F j, F k } } is an indeterminate diagnosis indicating that one of the three diagnosis candidates, {F i } and {F j, F k } have occurred. 3.2 Fault signatures Establishing fault signatures is the main part of our diagnosability analysis process. This concept is commonly used in the CS approach, but less in DES. The CSs notion of fault signature is generalized and extended to DESs, allowing one to write diagnosability criterions in a unified way. In a general way, one can consider a fault signature as a function Sig associating a set of observables to each fault. Sig : P(F ) P(OBS) Continuous systems The fault signature is a classical concept in the CS approach usually defined as follows. For a fault f of P(F ), let OBS f be the set of all possible tuples consisting of observed variable values under the fault f, regardless of time 1. Then : Sig(f) = OBS f P(OBS) Discrete event systems Fault signatures are based upon the projection over observable events, which are defined in a first step. They correspond to what is usually known as observable trajectories in the DES community. Language projection The language projection over the set of observable events E o, noted P obs, to a language L, associates the language formed by the words of L restricted 1 Note that under the fault f means that exactly all the faults in f occured, and no faults out of f occured. 56 DX'06 - Peñaranda de Duero, Burgos (Spain)

67 to the letters that are elements of E o. For example if L = {e 1, e 1 e 3, e 1 e 2, e 2 e 3, e 1 e 2 e 3 } and E o = {e 1, e 2 }, then P obs (L) = {e 1, e 1 e 2, e 2 }. The inverse projection P 1 obs, defined on P(OBS), to a set of observable events sequences, associates the set of trajectories (which is a language) whose projections belong to the antecedent set : O P(OBS), P 1 obs (O) = { s L sys, P obs ({s}) O } Fault language For each fault f P(F ), the f-language, or L f, describes all possible trajectories in which f occurs. L f is defined as the subset of the system s automaton s language L sys, restricted to the words containing at least one occurrence of every single fault event composing f, and no occurrence of any other fault event. L f describes all possible scenarios in which f occurs. The words of the f-language are called f-trajectories. Fault signature Because of our particular interest for diagnosability, among the set of f-trajectories, we pay special attention to those that can be obtained when the observation temporal window can be arbitrarily extended. This is done by considering, in L f, only words that end in an infinite cycle. They are defined as the maximal words, and form the maximal f-language L max f of the fault. Formally, a trajectory s of L f belongs to L max f if and only if t, u E, s = tu. Notation u refers to the word built as an infinite concatenation of word u, i.e., every u n u is a prefix of u. For each fault f P(F ), the projection of the maximal f-language L max f over the set of observable events is called the f-signature. Any f-signature is a subset of OBS as it is solely composed of observable events. With the above definitions, it is possible to define the signature function Sig as the function associating its f-signature to any fault f P(F ) : f P(F ), Sig(f) = f-signature P(OBS) 4 Diagnosability Formal definitions of diagnosability according to the DES and CS approaches are now given. 4.1 Discrete Event Systems We rely here on the (strong) 2 diagnosability definition as defined by [Sampath et al., 1995]. DES (strong) Diagnosability : a DES is (strongly) diagnosable if and only if 3 : F i F, n i N, s L sys /(F i s), t E /(st L sys ), t n i u Pobs( 1 Pobs (st) ) (1), F i u One can notice that the definitions are stated with respect to elements of F. The system is required to be diagnosable for each fault type, independently of the fact that they are single or multiple faults. 2 A definition for weak diagnosability is given in [Rozé and Cordier, 2002] for DES and in [Travé-Massuyès et al., 2004] for CS 3 The notation F i s means that s contains at least one fault event of F i. 4.2 Continuous systems In the CS approach, the classical definition of diagnosability is already given in terms of the fault signature concept as follows [Travé-Massuyès et al., 2004]. CS (Strong) Diagnosability : a CS is (strongly) diagnosable if and only if : f 1, f 2 P(F ), f 1 f 2, Sig(f 1 ) Sig(f 2 ) = (2) This definition applies to single or multiple faults and differs from the DES definitions in this respect. It is shown in the next section that this difference is not relevant and that the fault signature concept is a unifying concept allowing one to formally compare the two approaches. 5 Formal Comparison In this section, we give the proof of equivalence between the diagnosability definition in the DES and CS approaches. We first prove that the DES definition can be extended to multiple faults, which provides a better insight into the definition interpretation. As noted before, definition (1) is stated for elements of F, which corresponds to consider single faults. Let us extend it to multiple faults. The occurrence of a multiple fault f in a trajectory s is noted F i f, F i s. The diagnosability condition (1) is verified for each F i f with possibly different n i values. Taking the largest value of all these n i values as n f, it can be easily shown that definition (1) is equivalent to definition (1 ), which accounts explicitely for multiple faults f = {F i }. f P(F ), n f N, s L sys / ( F i f, F i s ), t E /(st L sys ), t n f u Pobs( 1 Pobs (st) ), F i f, F i u (1 ) This result shows that the DES diagnosability definition can be given in terms of faults (instead of fault types), whether single or multiple, like the CS diagnosability definition. The equivalence between diagnosibility definitions is now proved by considering the assessment upon absence of faults in a diagnosable discrete events system. Let us consider a diagnosable system, thus verifying (1), and trajectories of arbitrary length, in particular maximal trajectories which correspond to maximal words as defined in section 3.2. Let us consider such a maximal trajectory s belonging to the f-language L f. It means that s contains at least one occurrence of every single fault event composing f and no occurrence of any other fault. s belongs thus to L max f and its projection over the set of observable events belongs to the f-signature. Now suppose that there exists a (maximal) trajectory u such that P obs ({u}) equals P obs ({s}) and that u contains at least one occurrence of a fault F j which does not belong to f. By (1), it implies that all trajectories sharing the observable projection of u contain F j, which is contradictory with our hypothesis about s. Thus, there does not exist any trajectory having the same observable projection as s and containing a fault not belonging to f. This proves that f 1, f 2 P(F ), f 1 f 2, Sig(f1) Sig(f2) = which is exactly the definition (2) given in 4.2 for the Continous Systems. DX'06 - Peñaranda de Duero, Burgos (Spain) 57

68 6 Operational comparison This section contains an example that illustrates the concepts introduced before and compare the DES and CS approaches in an operational way. Bridges between state variables in the CS view and events in the DES view are provided and diagnosability analysis is performed along the state-based diagnosis and the dynamic diagnosis approaches. 6.1 Example c 1 c 2 Tank 1 Tank 2 y 1 delay τ 1 delay τ 2 y 2 Pump Figure 1: A water flow system The system represented in Figure 1 is inspired of [Puig et al., 2005]. It is composed of two water tanks with heights y 1 and y 2, and a pump connected by a water flow channel. Both tanks supply consumers c 1 and c 2. The delays τ 1, respectively τ 2, correspond to the time needed for the water to reach tank2 from tank1, and tank1 from the pump. It has two operating modes : pump on and pump off. We consider faults in sensors y 1, y 2, c 1 and c 2, named respectively F y1, F y2, F c1 and F c2. The example is limited to single faults and it is assumed that the system does not switch its operating mode between the occurrence of a fault and the apparition of its symptoms, in order to simplify the models of the system. 6.2 Continuous model, state-based diagnosis The discretized and linearized non-linear dynamic equations are : y 1 (t + t) = y 1 (t) k 1 c 1 (t) + k 2 u pump (t τ 2 ) k 3 u out (t) u out (t) = k y 1 (t) = k 4 y 1 (t) u pump = k[a(h y 2 ) 2 + b(h y 2 ) + c] = k 5 + k 6 y 2 (t) y 2 (t + t) = y 2 (t) k 7 c 2 (t) + k 8 u out (t τ 1 ) k 9 u pump (t) Where t is the sampling time. u pump being the flow through the pump, we can state that when the pump is off, we have u pump (t) = 0, which can be achieved by choosing k 5 = k 6 = 0. From these equations, it is possible to predict the values for y 1 and y 2 with : ŷ 1 (t + t) = (1 k 3 k 4 )y 1 (t) k 1 c 1 (t) +k 2 k 6 y 2 (t τ 2 ) + k 2 k 5 ŷ 2 (t + t) = (1 k 9 k 6 )y 2 (t) k 7 c 2 (t) +k 8 k 4 y 1 (t τ 1 ) k 9 k 5 From the equations above, two consistency tests can be obtained in the form of analytical redundancy relations : r 1 (t + t) = y 1 (t + t) ŷ 1 (t + t) = y 1 (t + t) [ (1 k 3 k 4 )y 1 (t) k 1 c 1 (t) + k 2 k 6 y 2 (t τ 2 ) + k 2 k 5 ] r 2 (t + t) = y 2 (t + t) ŷ 2 (t + t) = y 2 (t + t) [ (1 k 9 k 6 )y 2 (t) k 7 c 2 (t) + k 8 k 4 y 1 (t τ 1 ) k 9 k 5 ] Using these analytical redundancy relations and considering that k 5 and k 6 are null when the pump is off, we deduce the fault signature matrices shown in figure 2. The fault signature matrices indicate that the system is not diagnosable since, for example, the observable (p on, s 1 = 1, s 2 = 1) belongs to two fault signatures. F y1 F y2 F c1 F c2 r r Pump on mode F y1 F y2 F c1 F c2 r r Pump off mode Figure 2: Fault signature matrices for the system 6.3 Discrete event model, dynamic diagnosis For the DES model of the system, the following events are used : p on,p off, fired when the pump is turned on or off ; F S fired when a fault occurs on sensor S ; r 1, r 2 fired when analytical redundancy relations r 1 and r 2, are violated. The automaton is shown in Figure 3. An arc labelled a.b represents two arcs labelled a and b, a leading to a state in which only b may occur. a.b F c2.r 2 F c1.r 1 F y2.r 2.r 1 F y1.r 1.r 2 p on p off p on p F off y2.r2.p on.r 1 F y1.r 1.r 2 F c1.r 1 F c2.r 2 Figure 3: Automaton describing the system a b 58 DX'06 - Peñaranda de Duero, Burgos (Spain)

69 Fault F c1 F c2 F y1 F y2 Figure 4: bolded). Signature (p on.p off ) (p on.p off ).r 1.(p on.p off ) (p on.p off ).p on.r 1.(p off.p on ) (p on.p off ).r 2.(p on.p off ) (p on.p off ).p on.r 2.(p off.p on ) (p on.p off ).r 1.r 2.(p on.p off ) (p on.p off ).p on.r 1.r 2.(p off.p on ) (p on.p off ).r 2.p on.r 1.(p off.p on ) (p on.p off ).p on.r 2.r 1.(p off.p on ) Fault signatures (discriminant subwords are From the automaton and following section 3.2, it is possible to build the signatures for all the faults (see Figure 4). Recall that all the events except faults are observable. The fault signatures are disjoint sets, the system is hence diagnosable. 6.4 Results This example shows that, although DES and CS diagnosability definitions are formally equivalent, operational diagnosability assessment critically depends on the nature of observables. In the CS approach, diagnosability is not achieved, as fault signatures are not disjoint. (p on, r 1 = 1, r 2 = 1) is a signature for both F y1 and F y2, and (p off, r 1 = 0, r 2 = 1) is a signature for both F y2 and F c2. In the DES model, in the pump on mode, the symptoms r 1 = 1 and r 2 = 1 appear in the order (r 1 r 2 ) for F y1 and in reverse order (r 2 r 1 ) for F y2. Taking this order into account permits fault discrimination between F y1 and F y2 in dynamic diagnosis. In addition, in the pump off mode, both F y2 and F c2 are followed by the r 2 symptom, but only in the case of F y2, a p on command will be followed by the r 1 symptom. Notice that diagnosability stands on the assumption that the pump will be turned on some time : it is only after the p on command that the faults can be discriminated. 7 Related work In the context of continuous systems, diagnosability analysis is stated in terms of detectability and isolability [Chen and Patton, 1994]. [Basseville, 2001] reviews several definitions of fault detectability and isolability and distinguishes two types of definitions, namely intrinsic definitions that do not make any reference to a particular residual generator and performance-based definitions. In [Staroswiecki and Comtet- Varga, 1999], the conditions for sensor, actuator and component fault detectability are given for algebraic dynamic systems and isolability is discussed. Diagnosability analysis for continous systems is often focussed on finding the optimal sensor placement as [Travé-Massuyès et al., 2001], which uses a structural approach, or [Yan, 2004], and [Tanaka, 1989]. [Frisk et al., 2003] also follow a structural approach and show how different levels of knowledge about the faults may influence the fault isolability properties of the system. In [Travé-Massuyès et al., 2004], a definition for diagnosability in terms of fault signatures is proposed and is the one used in this paper. In [Struss and Dressler, 2003], the state-based approach is extended to take into account several operating modes, for which state signatures may be different. In this situation strong diagnosablity is hardly achieved and the paper proposes a definition to distinguish different discriminability situations. Two faults may be not discriminable, necessarily discriminable or possibly discriminable depending on the intersection pattern of their associated observation sets. This work is strongly related to the weak diagnosability definition provided in [Travé-Massuyès et al., 2004] for CS and [Rozé and Cordier, 2002] for DES. Comparing the formal definitions of weak diagnosability still remains to be done. In the DES context, the first definitions have been proposed in [Sampath et al., 1995]. Checking diagnosability is computationally complex and polynomial time algorithms have been designed to cope with this problem [Jiang et al., 2001; Yoo and Lafortune, 2002]. In [Cimatti et al., 2003], formal verification of diagnosability is based on model-checking techniques. More recently, [Jeron et al., 2006] propose a generalization of diagnosability properties to supervision patterns (describing various patterns involving fault events). To our knowledge, there is no existing work comparing and/or unifying diagnosability approaches coming from the CS and DES communities. Some diagnosis algorithms have been proposed for hybrid systems but diagnosability conditions have not been exhibited for such systems and this is one of our goals for future work. This paper is a direct continuation of the work done with the Imalaia group and devoted to bridge the gap between the two communities [Cordier et al., 2004] by comparing their respective approaches to modelbased diagnosis. 8 Conclusion In this paper, we propose a formal framework to compare in an adequate way the diagnosability definitions from the CS and DES community. The signature concept is generalized to trajectories and allows us to prove equivalence of the diagnosability definitions. The key issue is the way observations are defined, in a static way in the CS approach and as partially ordered sets (sequences) in the DES approach. On one hand, when temporal information is necessary to discriminate faults, the DES approach gives better results. On the other hand, it requires to wait a certain amount of time, before getting the result. In practical applications, this delay has to be estimated and must be realistic wrt existing risks and decisions to be taken. Another view is to enrich CS signatures with temporal information [Puig et al., 2005]. Having a common diagnosability analysis approach for both state-based and dynamic diagnosis opens interesting perspectives for analysing hybrid systems diagnosability. Some results along this line can be found in [Bayoudh et al., 2006]. Future work will address the extention of the comparison of DES and CS approaches for weak diagnosability definitions (as given in [Travé-Massuyès et al., 2004] for CS and in [Rozé and Cordier, 2002] for DES). This is an important issue because real world systems are generally weakly but not strongly diagnosable. Hence weak diagnosability is more DX'06 - Peñaranda de Duero, Burgos (Spain) 59

70 relevant than strong diagnosability from a practical point of view. References [Basseville, 2001] M. Basseville. On fault detectability and isolability. European Journal of Control, 7(8): , [Bayoudh et al., 2006] M. Bayoudh, L. Travé-Massuyès, and X. Olive. Hybrid systems diagnosability by abstracting faulty continuous dynamics. In Proceedings of DX 06, [Bénazéra and Travé-Massuyès, 2003] E. Bénazéra and L. Travé-Massuyès. The consistency approach to the on-line prediction of hybrid system configurations. IFAC Conference on Analysis and Design of Hybrid Systems (ADHS 03), Saint-Malo (France), [Bénazéra et al., 2002] E. Bénazéra, L. Travé-Massuyès, and P. Dague. State tracking of uncertain hybrid concurrent systems. In Proceedings of the International Workshop on Principles of Diagnosis(DX 02), pages , [Chen and Patton, 1994] J. Chen and R.J. Patton. A reexamination of fault detectability and isolability in linear dynamic systems. In Proceedings of the 2nd Safeprocess Symposium, Helsinki (Finland), pages , [Cimatti et al., 2003] A. Cimatti, C. Pecheur, and R. Cavada. Formal verification of diagnosability via symbolic model checking. Proceedings of IJCAI 03, pages , [Cordier et al., 2004] M.-O. Cordier, P. Dague, F. Lévy, J. Montmain, M. Staroswiecki, and L. Travé-Massuyès. Conflicts versus analytical redundancy relations : A comparative analysis of the model-based diagnostic approach from the artificial intelligence and automatic control perspectives. IEEE Transactions on Systems, Man and Cybernetics - Part B., 34(5): , [Frisk et al., 2003] E. Frisk, D. Düştegör, M. Krysander, and V. Cocquempot. Improving fault isolability properties by structural analysis of faulty behavior models: application to the DAMADICS benchmark problem. In Proceedings of IFAC Safeprocess 03, Washington, USA, [Gupta et al., 2004] S. Gupta, G. Biswas, and J. Ramirez. An improved algorithm for hybrid diagnosis of complex systems. In Proceedings of DX 04, [Jeron et al., 2006] T. Jeron, H. Marchand, and M-O. Cordier. Motifs de surveillance pour le diagnostic de systèmes évènements discrets. In Proceedings of RFIA 2006, [Jiang et al., 2001] S. Jiang, Z. Huang, V. Chandra, and R. Kumar. A polynomial time algorithm for diagnosability of discrete event systems. IEEE Transactions on Automatic Control, 46(8): , [Patton et al., 1989] R.J. Patton, P. Franck, and R. Clark. Fault diagnosis in dynamic systems - Theory and Applications. Prentice Hall International, London UK, [Puig et al., 2005] V. Puig, J. Quevedo, T. Escobet, and B. Pulido. On the integration of fault detection and isolation in model based fault diagnosis. In Proceedings of DX 05, pages , [Rozé and Cordier, 2002] L. Rozé and M.-O. Cordier. Diagnosing discrete-event systems : extending the diagnoser approach to deal with telecommunication networks. Journal on Discrete-Event Dynamic Systems : Theory and Applications (JDEDS), 12(1):43 81, [Sampath et al., 1995] M. Sampath, R. Sengputa, S. Lafortune, K. Sinnamohideen, and D. Teneketsis. Diagnosability of discrete-event systems. IEEE Transactions on Automatic Control, 40: , [Staroswiecki and Comtet-Varga, 1999] M. Staroswiecki and G. Comtet-Varga. Fault detectability and isolability in algebraic dynamic systems. Proceedings of the European Control Conference, [Struss and Dressler, 2003] P. Struss and O. Dressler. A toolbox integrating model-based diagnosability analysis and automated generation of diagnostics. In Proceedings of DX 03, [Tanaka, 1989] S. Tanaka. Diagnosability of systems and optimal sensor location. In R.J. Patton, P. Franck, and R. Clark, editors, Fault diagnosis in dynamic systems - Theory and Applications, chapter 5, pages Prentice Hall International, London UK, [Travé-Massuyès et al., 2001] L. Travé-Massuyès, T. Escobet, and Rob Milne. Model-based diagnosability and sensor placement application to a frame 6 gas turbine subsystem. Proceedings of IJCAI 01, pages , [Travé-Massuyès et al., 2004] L. Travé-Massuyès, T. Escobet, and X. Olive. Model-based diagnosability. Internal Report LAAS N04080, Janvier 2004, 12p. to appear in IEEE Transactions on System, Man and Cybernetics, Part A, [Williams and Nayak, 1996] B. C. Williams and P. P. Nayak. A model-based approach to reactive self-configuring systems. Proceedings of AAAI-96, Portland, Oregon, pages , [Yan, 2004] Y. Yan. Sensor placement and diagnosability analysis at design stage. MONET Workshop on Model- Based Systems at ECAI 04, August 22-26, Valencia, Spain, [Yoo and Lafortune, 2002] T. Yoo and S. Lafortune. Polynomial-time verification of diagnosability of partially-observed discrete-event systems. IEEE Trans. on Automatic Control, 47(9): , DX'06 - Peñaranda de Duero, Burgos (Spain)

71 Exploiting independence in a decentralised and incremental approach of diagnosis Marie-Odile Cordier Irisa Dream Rennes France cordier@irisa.fr Alban Grastien Irisa Dream Rennes France alban.grastien@rsise.anu.edu.au Abstract It is now well-known that the size of the model is the bottleneck when using model-based approaches to diagnose complex systems. To answer this problem, decentralized/distributed approaches have been proposed. The global system model is described through its component models as a set of automata and the global diagnosis is computed from the component diagnoses (also called local diagnoses). Another problem, which is far less considered, is the size of the diagnosis itself. However, it can also be huge enough, especially when dealing with uncertain observations. It is why we recently proposed to slice the observation flow into temporal windows and to compute the diagnosis in an incremental way from these diagnosis slices. In this context, we define in this paper two independence properties (transition and state independence) and we show their relevance to get a tractable representation of diagnosis. To illustrate the impact on the diagnosis size, experimental results on a toy example are given. 1 Introduction In this paper, we are concerned with the diagnosis of discrete event systems [Cassandras and Lafortune, 1999] where the system behaviour is modeled by automata. This domain is an active domain since the seminal work proposed by [Sampath et al., 1996]. It consists in finding what happened to the system from existing observations as in [Baroni et al., 1999; Cordier and Thiébaux, 1994; Console et al., 2000; Lunze, 1999; Rozé and Cordier, 1998; Cordier and Largouët, 2001]. A classical formal way of representing the diagnosis problem is to express it as the synchronised product of the system model automaton and an observation automaton. This formal definition hides the real problem which is to ensure an efficient computation of the diagnosis when both the system is complex and the observations possibly uncertain. It is now well-known that the size of the system model is one bottleneck when using model-based approaches to This work was partially made in NICTA, Canberra (Australia) diagnose complex systems. To answer this problem, decentralized/distributed approaches have been proposed [Pencolé and Cordier, 2005; Lamperti and Zanella, 2003; Benveniste et al., 2005]. Instead of being explicitly given, the system model is described through its component models in a decentralized way. From these local models, local diagnoses are computed to explain local observations. When it is needed to take a global decision, a global diagnosis is computed by merging local diagnoses in order to take into account the synchronisation events which express the dependency relation which may exist between the components. This merging step can be costly and merging strategies have been proposed as in [Pencolé and Cordier, 2005; Lamperti and Zanella, 2003]. The main result gained from these work is the importance of detecting concurrent subsystems in order to limit both the computation time and the representation size of the diagnosis. A problem, which is far less considered, is the size of the observation flow, which directly impact the size of the diagnosis itself. However, it can also be a problem, especially when dealing with uncertain observations as already remarked by [Lamperti and Zanella, 2003]. Moreover, increasing the observation period decreases the chance of finding independent subsystems. It is why we recently proposed to slice the observation flow into temporal windows and to compute the diagnosis in an incremental way from these diagnosis slices [Grastien et al., 2005]. The idea is then to detect independent subsystems on these limited subperiods and to exploit these properties to get an economical representation and computation of diagnosis. In this context of incremental and decentralised diagnosis, we define in Section 2 two independence properties (transition and state independence) on automata and we show their relevance to get a tractable representation of diagnosis. The first one, transition independence expresses that two models do not share any synchronisation events. The second one, s- tate independence, expresses that when decomposing a model into two submodels, no constraints on their initial states have been lost. We first examine in Section 3 the purely decentralised case and propose to represent the diagnosis by a set of transition-independent diagnoses. We show in Section 4 the specific problem related to the incremental computation and propose to use an abstract description of trajectories, from which the set of final states and the trajectories of the global DX'06 - Peñaranda de Duero, Burgos (Spain) 61

72 ^ ^ ^ ^ ^ ^ ^ b a ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ c ^ ^ Ÿ ^ d e ^ ^ ^ ^ ^ f ^ ^ 2 diagnosis can be easily retrieved. To illustrate the impact on our proposal on the diagnosis size, experimental results on a toy example are given in Section 5. We conclude in showing that the next step is to automatically find the best slicing points in order to maximally exploit the two independence properties which were defined. gjnm imh g/h ikj ikl i9j g/h 2 Preliminaries and independence properties We suppose in this paper that the behavioural models are described by automata. We thus begin by giving some definitions concerning automata which are needed in the following sections. Then, we define the independence properties that are central in this paper. Lastly, we recall the diagnosis definitions and state some hypotheses. 2.1 Automata, synchronisation and restriction Automata are used to describe the behavioural models of the system components. Let us recall the definition and introduce the notations. Definition 1 (Automaton). An automaton is a tuple where is a (finite) set of states, is a set of labels, is a (finite) set of transitions! #"$, with %&, ' is the set of initial states, ( is the set of final states. We suppose that )*,+-, the transition./ 0# exists. The label on the transition #"5 indicates which events as some states trigger the transition. PSfrag replacements(and transitions) can be removed by the trim operation. In the A trajectory is a path in the automaton joining an initial same way, the initial (resp. final) states of, v (resp. v ), state to a final state. PSfrag replacements are included in K (resp. K ). Definition 2 (Trajectory). A trajectory on an automaton 26 is a double sequence of states and transitions traj 2798;:$< =?>A@B@B@ =?> :DC BE such that: 8 +, E +, and )*F G 9HJILKM NH BHOP+. The set of trajectories of an automaton is denoted Traj. In the following, as we are interested mainly by trajectories and states passed through, the automata we consider are trim automata [Cassandras and Lafortune, 1999], i.e automata such that all the states belong at least to one trajectory. The trim operation transforms an automaton into its corresponding trim automaton by removing the states that do not belong to any trajectory. Remark that a trim operation does not remove any trajectory. It can however shrink the set of initial states and of final states. Let us consider the trim automaton in Figure 1. The initial states are represented by an arrow with no origine state, and the final states by a double circle. Then, Q3R S =?>VU <T R WYX =Z>\[ T R S =?>V] <T is a trajectory. _^ Let us consider synchronisation of two automata K and. The events which are common the transition labels of K and, i.e. KP`, are called synchronisation events. To be synchronizable, two transitions must either be labeled Figure 1: Example of automaton by events which are not synchronisation events, or have the same synchronisation events. The synchronisation operation on two automata builds the trim automaton where all the trajectories of both automata which cannot be synchronised according to the synchronisation events are removed. Definition 3 (Synchronisation of automata). Given K 2 O,K# 0 okm pkq 9Kq rk9 and two automata. The synchronised K and, denoted Kts, is the trim au- automaton of tomaton such that: 2u O wvp xvg 0yvz {v 2} ~qf 2ƒ K, 62 K, 2 ˆ K 0! 9 M" K 0M",Š Œ K 9 K 0 K M" K + KGŽ 0 ^M M" P+ Ž KM` K` 2ƒ ` Kn` 0 ŽL2 K, 2 9K, 2ƒ K. The set of states v is included in K Figure 2 gives an example of synchronisation. The automaton in Figure 1 and the automaton on the top of Figure 2 are synchronised leading to the automaton on the bottom. The synchronising events are the H events. 9 ž K Ÿ ^ ak j imh ž ^ ž Ÿ ^ K œb c b# MK Ÿ ^ ž K h e o^ d ikj k 9 K kšb š9 qœ šy k Figure 2: Example of synchronisation ikj ž K f ž K k B 9 k 62 DX'06 - Peñaranda de Duero, Burgos (Spain)

73 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ W ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ W ^ W ^ W W ^ W W The restriction operation of an automaton removes from all the initial states which are not in the specified set of states. Due to the trim operation, all the states and transitions which are no more accessible are removed from. Definition 4 (Restriction). Let 26 O 0 0 0Z 0 be an automaton. The restriction of by the states of ˆ", denoted ˆ", is the automaton "G2 "v o" v 0ˆ" v w" v 2' ~qf O ` ˆ". 2.2 Transition and State-independency The transition-independency property states that two (or more) automata do not have any transition labeled with synchronisation events. Definition 5 (Transition-independency). K 2 K 0 K K K 0 K and 2 O are transition-independent (TI) iff every label on a transition of pk or is such that ` ok ` 2.. For two TI automata, the synchronisation operation is e- quivalent to a shuffle operation. Property 1: Let K o^ and be two transition-independent automata and 2 Kzs. The final (resp. initial) states g replacements of correspond exactly to the Cartesian product of the final (resp. initial) states of K and o^. a gœh b gj c ak aa9 bœa c/ak akb# bb cnb a9c b#c ccn Figure 3: Example of two TI automata Figure 3 gives an example of two automata K and that synchronise on kh events. Since none of the automata has a transition labeled with a H event, the automata are transitionindependent. The synchronisation 2 KZs is represented on the bottom of the figure (for simplicity, the labels on the transitions are not represented). We see that the set of initial states (resp. ) is the Cartesian product of K ( K ) and ( ). Figure 1 and Figure 2 give an exemple of two automata K o^ and that are not transition-independent as they contain transitions with kh events. The set of final states of the synchronisation is only included in the Cartesian product of K and. 0 h j 0 0 b c In the next section 3, we are interested in representing a system model in a decomposed way by the set of its subsystems models, the main property being that it must be possible to retrieve the first one from the other ones by a composition (synchronisation) operation. In the following, we give the definitions of subsystems and the properties the set of subsystems models must satisfy to get a safe representation of the system model. Definition 6 (System and subsystems). A system can be described by its set of components. A subsystem is a non-empty set of components: w. A subsystem model describes the subsystem behaviour and is described by an automaton where is the set of events that can occur on this subsystem. Some of these events are shared with other subsystems and are synchronisation events between subsystem models. Let us now see the properties a set of subsystem models has to satisfy to be a good representation of the system model. We first define what we call a decomposition of in two automata. Definition 7 (Decomposition of ). Two automata K and are said to be a decomposition of an automaton iff 26 K s where are the initial states of. Remark that we do not require that we get exactly by synchronizing K and, but only a super-automaton of (i.e. an automaton that contains all the trajectories of and possibly more). In general, we have thus that the initial (resp. final) states of are included in qk3 (resp. K3 ). The idea is that, when you describe a system (whose model is ) by its subsystems, you have to describe the subsystem behaviours, which is done through the subsystem models (here K _^ and ) and the way the subsystems interact, which is done through the synchronisation events. Moreover, you have to do it in a proper way given by the Definition 7. But a point is still missing, as the constraints existing between the subsystem initial states in order to represent the system initial states can be lost in the decomposition of. It is why, when composing K ^ and by a synchronisation operation, we do not get always back exactly, but an automaton including. The state-independency property is a property of a decomposition K ^ and which ensures that we get exactly. Definition 8 (State-independency decomposition wrt ). Two _^ automata K 2 K 0 K K 0 K 0 K and 2( 0 0 are said to be a state-independent decomposition wrt (SIv ) iff they are a decomposition of and if 2 K s. Remark that, if K and have both a unique initial state, K K ^ and if they are a decomposition wrt, then o^ and are a state-independent decomposition wrt. Let us suppose two automata and which are a stateindependent decomposition wrt and are both transitionindependent. In this case, due to Property 1, the initial and final states of can be easily computed as the Cartesian product of the initial and final states of K and. This property means that, when you are mainly interested in these states, DX'06 - Peñaranda de Duero, Burgos (Spain) 63

74 ^ ^ W s 2 ^ s ^ you do not have to perform the synchronisation operation on the automata, which is costly in space. Property 2: Let K o^ and be two transition-independent automata forming a state-independent decomposition wrt. The initial (resp. final) states of are exactly the Cartesian product of the initial (resp. final) states of K and. When K o^ and are not a state-independent decomposition wrt, the only way not to lose any information is to add as extra information the initial states of to the decomposed representation of. 2.3 Diagnosis Let us recall now the definitions used in the domain of discrete-event systems diagnosis where the model of the system is represented by an automaton. 1 8 corresponds to the starting time and 1 to the ending time of diagnosis. Definition 9 (Model). The model of the system, denoted Mod, is an automaton. The model of the system describes its behaviour and the trajectories of Mod represent the evolutions of the system. The set of initial states Mod is the set of possible states at 18. We suppose as usual that Mod 2 Mod (all the states of the system may be final). The set of observable events is denoted Mod Obs Mod. in [Pencolé and Cordier, 2005] that the decentralised diagnosis is a decomposition of. As there is a unique initial state, it is also a state-independent decomposition. It is then possible to compute the global diagnosis of the system by merging Let us turn to observations represented by an automaton, all the local diagnoses as follows: s. where the transition labels are observable events of Mod PSfrag Obs. replacements Mod hqm 9m local diagnosis m Bm Mod Definition 10 (Observation automaton). The observation automaton, denoted Obs, is an automaton describing the observations emitted by the system during the period 1Y8n 1. Even if usually the observations are subject to uncertainties, we consider in the following that they are represented as a unique sequence of observable events. It allows us to simplify the presentation but it can be extended to the case of uncertain observations as we did for instance in [Grastien et al., 2005]. The diagnosis, denoted, is a trim automaton describing the possible trajectories on the model of the system compatible with the observations sent by the system during the period The diagnosis is then formally defined as resulting from the synchronisation operation between the system model Mod and the observation automaton Obs. Definition 11 (Diagnosis). The diagnosis, denoted trim automaton such that 2 Mod s Obs 3 Improving diagnosis representation in a decentralised approach, is a Real-world systems can often be seen as a set of (possibly abstract) interconnected components. Each component has a simple behaviour but the connections between the components can lead to a complex global behaviour. For this reason, the size of a global model of the system is generally untractable and no global model can be effectively built. To answer this problem, decentralised/distributed approaches have been proposed [Lamperti and Zanella, 2003; Pencolé and Cordier, 2005; Benveniste et al., 2005]. In this article, we consider the decentralised approach of Pencolé and Cordier. This approach is pictured on Figure 4. The idea is to describe the system behaviour in a decomposed way. The so-called decentralised model is thus Mod 27 Mod where ModH is the behavioural model of the component H. The decentralised model is built to be a decomposition of the global model Mod. The global model can thus be retrieved by Mod 2A s Mod / where is the set of initial states of Mod. We consider that the global model has a unique initial state (if it is not, an initialization transition can be added to ensure it) and that the component models have also a unique initial state. They are thus a state-independent decomposition wrt Mod and we have Mod 2 ModK s Mod. The observations Obs can generally be decentralised as follows: Obs 2 Obs such that ObsH contains the observations from the component H and such that: Obs 2 ObsK s Obs. Given the local model ModH and the local observations ObsH, it is possible to compute the local diagnosis 2 ModH#s ObsH. These diagnoses represent the local behaviours that are consistent with the local observations. It was shown synchronisation Mod diagnosis merging Figure 4: Principle of the decentralised computation of the diagnosis A first improvement in the diagnosis computation is that, rather than directly merging all the local diagnoses together, it is possible to incrementally compute the global diagnosis by successive synchronisation operations. Let K and be two disjoint subsystems (possibly being components) and let 2 K be the subsystem that contains exactly K and. The subsystem diagnosis can be computed by synchronising the two subsystem diagnoses and :. The diagnosis of the system is W < 2 W X. W W < W X The next point is that, in spite of the constraints generated by the observations, the size of the global diagnosis can still be large. It is mainly due to the fact that merging concurrent diagnoses corresponds to compute the shuffle of two automata which is costly in terms of number of states and transitions (see for instance Figure 3). A second improvement to avoid 64 DX'06 - Peñaranda de Duero, Burgos (Spain)

75 W W 2 s < ^ s + W < 2 W W < C H these costful shuffles is to represent the system diagnosis as a set of transition-independent subsystem diagnoses. Definition 12 (Decentralised diagnosis). A decentralised diagnosis is a set of subsystem such that W < W is the diagnosis of the subsystem H, M )*F + F 2 W < 2 s W < is a partition of the system, and and transition-independent. W are As seen before, a decentralised is a decomposition of the global diagnosis. It can thus bew computed, if needed, by synchronising all the subsystem diagnoses, or equivalently by a shuffle operation Its final states can be obtained by a simple Cartesian product W on the final states of all. Algorithm 1 shows how to compute the decentralised diagnosis from the local (component) diagnoses. Until all pairs of diagnoses are transition-independent, the algorithm chooses two transition-dependant diagnoses and merges them. Let us remark that the result is not unique and depends on the merging strategy which is also very important from a computation time point of view. It was proposed in [Pencolé et al., 2001a] to use a dynamic strategy, based on first synchronising the subsystem diagnoses which interact the most, in order to remove at first as many trajectories as possible. Algorithm 1 Algorithm to compute a decentralised diagnosis input: local while such that and are not transition-independent W < W X do W < W X 2 2 K 2 W < W X end while return: W < 4 Improving diagnosis representation in a decentralised and incremental approach W X In the previous section, we considered that the diagnosis was computed on a period. This means that the observation automaton represents the observations from the beginning to the end of the period, and the diagnosis represents the behaviour during the whole period. We have seen in the previous section that exploiting transition-independence enables to reduce the size of the diagnosis representation. However, when we consider a long period, as this may be the case when you have log files to diagnose, it is very seldom that you have independent behaviours since each component eventually interacts with most of its neighbours. It is why we recently proposed to slice the observations into temporal windows and to incrementally compute the diagnosis for each temporal window [Grastien et al., 2005]. Given these diagnoses on small windows, it can now be expected to have independent behaviours that can be efficiently represented by a decentralised diagnosis. The problem with the incremental approach is that it becomes difficult to ensure the state-independency property of the decomposition. This property allowed us, due to Property 2 of Section 2, to get the initial and final states of the global diagnosis without computing it explicitly. To keep the benefit of the decentralised representation of diagnosis, we propose a solution that enables us to get the initial and final states needed for an incremental diagnosis without having to merge diagnoses, even when state-independency is not satisfied. Let us first present a formalism-free generalization of the incremental computation by automaton slicing. We explain then why we lose the state-independency property and end by proposing a solution to this problem. 4.1 Incremental diagnosis The incremental diagnosis relies on the notion of temporal windows first introduced in [Pencolé et al., 2001b]. For a detailed presentation of the diagnosis by slices, refer to [Grastien et al., 2005]. Let be the diagnosis period and 1 be a sequence of dates. The temporal window H is the period 1 HIxK 1 H. Let Obs Obs be a slicing of the observations Obs. It is shown in [Grastien et al., 2005] that, given a slicing of the observations Obs 2 Obs Obs, a diagnosis on the period can be computed as a sequence of diagnoses ) corresponding to the windows H. It is also shown that, given this sequence of automata, it is possible, only if needed, to reconstruct the original automaton by appending the slices. The trajectories can be computed as follows: A trajectory on this sequence of automata is a sequence of trajectories traj H{K H 2 8 2' E H H. 8 H =?> =?> : E H H + Traj H where )?F, Let us reduce now the problem to two slices and suppose we have computed a diagnosis HJILK for the period 1!8n 01 HJILK. We do not presume the way this diagnosis is represented and will come back on this point later. We want to compute the diagnosis H by taking into account the observations Obs H on the next temporal window H. Let us first see how the diagnosis H can be computed. We can state that K 2 Mod s Obs K, and )*F 2 Q, H 2 Mod I s Obs H HIxK where Mod I 2 O Mod 0 Mod Mod Mod Mod and HIxK set of final states of HIxK. is the The F th diagnosis of the sequence can be theoretically computed by the synchronisation of the model (where all states are initial Mod I ) with the observations Obs H of the window. It is however important from a computational point of view to HIxK restrict the set of initial states with the set of final states of the previous automaton. It is then possible to describe H. Remark that the set of final states as the sequence HIxK of H is exactly the set of final states of H. DX'06 - Peñaranda de Duero, Burgos (Spain) 65

76 ^ ^ 2 Ÿ ^ ^ ^ Ÿ ^ 4.2 Loss of the state-independency property Our goal is to use, for this sequence of diagnoses, a decentralised computation based on a decentralised model, and a decentralised representation similar to the one proposed in Section 3 based on transition-independent diagnoses. We want to compute H in a decentralised way, which means that we build the local diagnoses before merging them (see Algorithm 1). The diagnosis of the component in H the temporal window H is computed as follows : Mod I s Obs H HIxK where HIxK is the projection operation of the final states of HJILK on component. By Algorithm 1, we get a set of transition-independent subsystem diagnoses. The problem that appears here is that this set is a decomposition of H, but it can not be ensured that it is a state-independent decomposition. Contrary to the case of Section 3, it can be the case that existing links with the initial states of the other components are lost when projecting HIxK on a component. This is illustrated by Figure 5. The figure represents the diagnosis of two components. These components can be either in a k state or aulty state. The figure presents a twowindow diagnosis, each in a box. During the first window, one of the two components failed but it is not possible to determine which component did. The initial states of each component at the beginning of the second window are obtained by projecting the final states of the first window and they are and for one component and w" and w" for the other one. Nothing happened during the second window. The algorithm proposes thus the two local diagnoses (up and bottom in Figure 5, right) but we can see that the links between the initial states were lost during the projection, and then we get a decomposition of the global diagnosis which is not stateindependent. We have 0 3"O 0w". To get the exact final states, the only solution would be to synchronize the local diagnoses and then to use the restriction operation with the final states of the first window as argument, which is not an economical way as expected. We propose below a solution to this problem. rag replacements h j Figure 5: Example of loss of information in a naive decentralised representation of the incremental diagnosis 4.3 TI + abstract representation The solution we propose is to add an abstract representation of the diagnosis to the set of transition-independent subsystem diagnoses. We first define what is an abstration, and then show that it allows us to keep the benefit of the decentralised representation even when it is not state-independent wrt the global diagnosis as shown in 4.2. An abstraction of an automaton only preserves as states the initial and final states of the original automaton, and abstracts the trajectories existing in the original automaton in a transition labeled by.. Definition 13 (Abstraction). Let 2 be a trim automaton. The abstraction of, denoted Abst, is the (trim) automaton "*2 O w" "O _" 0ˆ" w"$ where: w"?2ƒ, "?2., 2./ 0M"5 Š traj 2 8 =*> : =?> : + Traj Ž 8 2' Ž 2 M", ˆ"Z2, and w"z2. The following two properties can be easily proved. Property 3: Let K and be two transition-independent automata. Then, Abst K s Abst G2 Abst K s. Property 4: Let K and be two transition-independent automata _^ and let be a seto^ of states. Then, Abst K s Abst x2 Abst K s N. The main problem with the loss of the state-independency property is that we can no longer get the set of final states by a mere Cartesian product on the final states of the subsystem diagnoses. The abstraction allows us to compute them without having to perform the expensive synchronisation of the subsystems diagnoses. In fact, the final states are directly computed as the Cartesian product of the final states of the abstraction of the subsystems diagnoses which is a lot less expensive. Let us consider the F th window H. We know the set H of initial states of the current window as they are the final s- tates of the preceding one. This set can be in a decentralised form, ie described by a set of states k K 0 H such that H 2 K & H. As explained in 4.2, the subsystem diagnoses are computed using Algorithm 1 which returns a set H 2 H of transition-independent diagnoses. We needw < to get the W final states as they are used to restrict the initial states of the next window, but in absence of state-independency property, they can no longer be computed from the final states of H (in fact H ). To build the abstract representation, we propose to use Algorithm 2. To obtain the set of final states, the idea is, instead of synchronising the transition-independent automata, to synchronise their abstractions. Then, a restriction is performed using the initial states H, to get the exact final states H. As at the Ÿ end all the abstract subsystem diagnoses composing 91 H are state-independent, we know that the set of initial states of H is the set of initial states of 91 H. Moreover, we have the following property : Property 5: The set of final states of H is the set of final states of H DX'06 - Peñaranda de Duero, Burgos (Spain)

77 Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Q W Ÿ s Ÿ ^ Ÿ H Ÿ Ÿ Ÿ Ÿ Ÿ s ] Ÿ + Ÿ Ÿ + ^ Ÿ Ÿ ^ PSfrag replacements Algorithm 2 Computation of the abstract representation of the diagnosis of input: local diagnoses H 2 H + the set of initial states W < W 91 H 27 Abst H Š9) H H while 91 H W 91 H W 91 H such that 91 H W < and 91 H Ware X not state-independent wrt B1 W H< 91 H WYX K do H H 91 W < H 2 2 K 91 H H 2 end while return: 91 H B1 W!X H B1 B1 H 91 W < H W < WYX MK 91 W X H W It is then possible to get the set of final states of H without synchronising the transition-independent subsystem diagnoses. The decentralised representation of diagnosis on a temporal window is thus the set of its transition-independent subsystem diagnoses and the set of its transition and stateindependent abstract diagnoses. PSfrag replacements 5 Experiments In this section, we present an experimentation of the diagnosis using the decentralised and incremental approach. We present the system to diagnose and then give the results. intercon- 5.1 System The system we want to diagnose is a network of Q nected components as presented on Figure 6. U [ H Figure 6: Topology of the network Q QnQ Qk Q U Q Each component has the same behaviour: when a fault occurs on a component, it reboots and forces its neighbours to reboot too. When Ÿ asked to reboot, the component sends the observation 1 H (where F is the number of the component), and when the reboot is finished, it sends the observation H. When a component is asked to reboot, ž it can be asked Ÿ to reboot by another component (and then send the 1 H observation) at the beginning of the rebooting process. The model is presented Figure 7. The reboot! message indicates that reboot is sent to all the neighbours, and the reboot? message indicates that a neighbour sent the reboot message to the component. So, for example, on component Q, there are three transitions from state to state respectively labeled by rebootk,k, IRebootK, reboot^ K, IRebootK, and reboot K, IRebootK neighbours from component Q. since components, U and Q are fault,reboot!,ireboot end,iamback reboot?, IReboot end,iamback reboot? rebooting reboot?, IReboot Figure 7: Model of a component reboot? Let us remark that the decentralised modeling contains exactly Q states, while the global model would contain K nearly [ states. 5.2 Results The algorithms were programmed in Java, and run on a Linux machine with a 1.73 GHz Intel processor. We deal with 45 observations. The experiments results are given Table 1. The first experiment was made with a unique temporal window as presented in section 3. The computation was more than 26 mn and produced automata, one of which contains [ states and [[ U] transitions. It can be noted that taking into account the transition-independence property of diagnoses in the decentralised representation is interesting as four independent subsystems are identified. It prevents from computing the shuffle for these subsystem diagnoses which is certainly a very good point. However, due to the length of the window, one of the automata is still very large. Using the method described in section 4, the observations are now sliced into temporal windows. The diagnosis was computed in less that Q second, producing U small automata. The number of states is, that is [ of the number of states used in the previous automaton, and the number of transitions is U which represents less than Q of the transitions of the previous automaton. It confirms that slicing observations is beneficial in that it allows to increase the number of independent subsystems, and thus diagnoses. no slicing Q st slicing nd s. U rd s. nb states [ UoU [ nb trans [n[ U] U [U Q [ Q nb auto U [ Q ] time ] mn [[ s Q s Q s U mn [ s Table 1: Results of the experimentations Let us stress now the importance of the slicing on the good results of the method. In a third experiment, the first temporal window of the previous experiment was sliced into two. It can be noted that the number of states of the diagnosis increased by about and the number of transitions by U. Moreover, the computation time increased to 10 seconds. The reason is that you sometimes need to have enough observations on a subsystem to conclude that this subsystem did not communicate with another subsystem. DX'06 - Peñaranda de Duero, Burgos (Spain) 67

78 In a fourth experiment, two temporal windows of the first window are merged into one unique window. The corresponding computation time is then nearly minutes and the number of states and transitions exploded. It confirms that the slicing operation is a critical operation and that deciding what is the best slicing is an appealing perspective. 6 Conclusion In this paper, we consider the diagnosis of discrete-event systems modeled by automata. To avoid the state-explosion problem that appears when dealing with large systems, we use a decentralised computation of the diagnosis. This approach consists in dividing the system into transitionindependent subsystems. We show that the global diagnosis can be safely represented by the set of diagnoses of these transition-independent subsystems. An important point is that the transitions can be easily computed from this decentralised representation by relying on the state-independency property which we define. It is then clear that the smaller the transition-independent subsystems are, the best the diagnosis computation is, both according to time and space efficiency. When the period of observation is important, very seldom do you have independent subsystems, since each component eventually interacts with most of its neighbours. We propose thus to slice the diagnosis period into temporal windows, in order to get, on these windows, transition-independent subsystems. The problem that appears is that the stateindependency property does not hold anymore. We are then no more able to get the exact final states. On the one hand, such a set of diagnoses for transition-independent but not state-independent subsystems gives us only a superset of the global diagnosis, which is not satisfying. On the other hand, computing the set of transition-independent and stateindependent subsystem diagnoses would be too expensive. We thus propose to keep the decentralised diagnosis representation (a set of transition-independent subsystem diagnoses), and to add an abstract representation of both state- and transition-independent diagnoses, enabling us to compute in an economic and efficient way the final states. We show that we get a safe representation of the global diagnosis. Some points need to be analysed in more details. As can be seen in Algorithm 2, it is necessary to have an efficient way to check whether two abstract diagnoses are or not stateindependent, and we are currently working on this point. Another concern is about the slicing. As shown in section 5, a bad slicing can lead to a very little benefit. An interesting prospect would be to automatically find the best slicing to obtain a diagnosis represented as efficiently as possible. In this article, we considered that the observations were sure and ordered. In real-world systems, this hypothesis generally does not hold, and we proposed to represent the observation by an automaton [Grastien et al., 2005]. The results of this article can be extended to cope with that. A more difficult case to consider is when you have to slice on-line the observations, while not all the observations are yet received. Finally, since we deal with state-spaces that are different from a window to the next, it should be interesting to use these results for reconfigurable systems, the topology (the set of components and the connections between them) of which can evolve along time, as considered for instance in [Grastien et al., 2004]. References [Baroni et al., 1999] P. Baroni, G. Lamperti, P. Pogliano, and M. Zanella. Diagnosis of large active systems. Artificial Intelligence, 110: , [Benveniste et al., 2005] A. Benveniste, S. Haar, E. Fabre, and Cl. Jard. Distributed monitoring of concurrent and asynchronous systems. Discrete Event Dynamic Systems, 15(1):33 84, [Cassandras and Lafortune, 1999] C. Cassandras and S. Lafortune. Introduction to Discrete Event Systems. Kluwer Academic Publishers, [Console et al., 2000] L. Console, C. Picardi, and M. Ribaudo. Diagnosis and diagnosability using PEPA. In ECAI 2000, pages , [Cordier and Largouët, 2001] M.-O. Cordier and Ch. Largouët. Using model-checking techniques for diagnosing discrete-event systems. In Twelfth International Workshop on Principles of Diagnosis (DX-01), pages 39 46, [Cordier and Thiébaux, 1994] M.-O. Cordier and S. Thiébaux. Event-based diagnosis for evolutive systems. In DX 1994, pages 64 69, [Grastien et al., 2004] A. Grastien, M.-O. Cordier, and Ch. Largouët. Extending decentralized discrete-event modelling to diagnose reconfigurable systems. In Fifteenth International Workshop on Principles of Diagnosis (DX-04), pages 75 80, [Grastien et al., 2005] A. Grastien, M.-O. Cordier, and Ch. Largouët. Incremental diagnosis of discrete-event systems. In Sixteenth International Workshop on Principles of Diagnosis (DX-05), pages , [Lamperti and Zanella, 2003] G. Lamperti and M. Zanella. Diagnosis of Active Systems. Kluwer Academic Publishers, [Lunze, 1999] J. Lunze. Discrete-event modeling and diagnosis of quantized dynamical systems. In 10th International Workshop on Principles of Diagnosis (DX-99), pages , [Pencolé and Cordier, 2005] Y. Pencolé and M.-O. Cordier. A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks. Artificial Intelligence Journal, 164(1-2): , [Pencolé et al., 2001a] Y. Pencolé, M.-O. Cordier, and L. Rozé. A decentralized model-based diagnostic tool for complex systems. In The Thirteenth IEEE international conference on tools with artificial intelligence (ICTAI 01), pages , [Pencolé et al., 2001b] Y. Pencolé, M.-O. Cordier, and L. Rozé. Incremental decentralized diagnosis approach for the supervision of a telecommunication network. In Twelfth International Workshop on Principles of Diagnosis (DX-01), pages , [Rozé and Cordier, 1998] L. Rozé and M.-O. Cordier. Diagnosing discrete-event systems: an experiment in telecommunication networks. In Fourth International Workshop on Discrete Event Systems (WODES 98), [Sampath et al., 1996] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis. Failure diagnosis using discrete-event models. Control Systems Technology, 4(2): , DX'06 - Peñaranda de Duero, Burgos (Spain)

79 Multiple Fault Diagnosis in Complex Physical Systems Matthew Daigle, Xenofon Koutsoukos, and Gautam Biswas Institute for Software Integrated Systems (ISIS) Department of Electrical Engineering and Computer Science Vanderbilt University Nashville, TN Abstract Multiple fault diagnosis is a challenging problem because the number of candidates grows exponentially in the number of faults. In addition, multiple faults in dynamic systems may be hard to detect, because they can mask or compensate each other s effects. The multiple fault problem is important, since the single fault assumption can lead to incorrect or failed diagnoses when multiple faults occur. We present an approach to simultaneous and cascaded multiple fault diagnosis in dynamical systems. Our approach is based on the TRANSCEND fault isolation scheme, where fault effects are represented as qualitative fault signatures. A notion of multiple fault diagnosability is introduced with respect to most likely minimal candidates. The online fault isolation algorithm explores the candidate space in increasing candidate size to generate minimal candidates. A mobile robot example demonstrates the approach. 1 Introduction Fault detection and isolation (FDI) is a key component of any safety-critical system. When faults and degradations occur, it is important to quickly identify the fault that occurred so corrective actions can be taken in a timely manner and catastrophic situations can be avoided. In general, a number of different failures can happen in complex systems, and the likelihood of multiple faults occurring increases in harsh operating environments. FDI schemes that do not take into account multiple faults run the risk of generating incorrect diagnoses or even failing to find a diagnosis after faults occur. Our approach focuses on multiple fault diagnosis in complex physical systems. It is based on the TRANSCEND framework [Mosterman and Biswas, 1999; Manders et al., 2000], which employs a qualitative approach for analysis of fault transient behavior. The diagnosis model is used to generate fault signatures, which represent magnitude and higher order effects of faults on the measurements. Multiple fault diagnosis is a difficult problem in dynamical systems because interactions among fault effects can obscure the fault signatures. In this paper, we provide a systematic scheme for generation of multiple fault signatures from the single fault signatures. We analyze the multiple fault signatures to define the notion of n-diagnosability, which defines diagnosability with respect to most likely minimal fault sets, where n is the maximum allowed fault multiplicity. We then present an extension to the online fault isolation algorithm of TRANSCEND such that it finds the most likely minimal fault set that is consistent with the observed measurement deviations. If a system is n-diagnosable for some n, the algorithm will isolate a unique multiple fault candidate, if n or less faults occur. Previous work in multiple fault diagnosis has concentrated mostly on static systems. The approach in [de Kleer and Williams, 1987] is based on conflict recognition and candidate generation. The system, GDE, utilizes the notion of minimal candidates, and chooses the next best measurements to make based on a priori fault probabilities. In our approach, measurements must be selected at design time, and they are used to generate and refine fault hypotheses when deviations from nominal behavior are observed. The GDE approach parallels the consistency-based diagnosis approach of [Reiter, 1987], an extension of which is presented in [Ng, 1990] to handle diagnosis of devices whose behavior changes over time. The changes are modeled by a set of qualitative simulation states. A similar approach that handles behavioral modes is presented in [Subramanian and Mooney, 1996]. In contrast, our approach applies to continuous-time models and can handle both additive and multiplicative faults. A control theorybased approach based on residual structures is described in [Gertler, 1998]. A residual structure is derived to meet the desired isolation properties. Our approach to multiple fault representation is somewhat analogous, although our residuals map to a richer feature set. The paper is organized as follows. Section 2 describes the TRANSCEND approach to qualitative fault isolation and presents the example model. Section 3 formulates the representation of multiple faults and a notion of multiple fault diagnosability based on the representation. Section 4 extends the fault isolation algorithm of TRANSCEND to account for multiple faults. Section 5 demonstrates our approach to multiple fault diagnosis. Section 6 concludes the paper. 2 Background TRANSCEND [Mosterman and Biswas, 1999] is a welldeveloped methodology for diagnosis of abrupt faults in com- DX'06 - Peñaranda de Duero, Burgos (Spain) 69

80 plex physical systems with continuous dynamics. It employs a qualitative model-based approach for fault isolation. System models are constructed using bond graphs [Karnopp et al., 2000]. Faults are modeled as abrupt and persistent changes in parameter values of components in the bond graph model of the system. Fault isolation in TRANSCEND is based on a qualitative analysis of the transient dynamics caused by abrupt faults. Deviations in measurement values after a fault occurrence constitute a fault signature, where predicted deviations in magnitude and higher order derivative values are mapped to {+, 0,-} symbols, which correspond to a deviation above normal, no deviation, and a deviation below normal, respectively. Fault isolation in TRANSCEND utilizes a Temporal Causal Graph (TCG) representation, which can be derived directly from the bond graph model of the system. The TCG captures the causal and temporal relations between system variables. It specifies the signal flow graph of the system in a form where edges are labeled with single component parameter values or direct or inverse proportionality relations. Fault signatures are generated using a forward-propagation algorithm on the TCG to predict qualitative effects of faults on measurements. The qualitative effect of a fault,+or-, is propagated to all measurement vertices in the TCG to determine fault signatures for each measurement. We denote the set of all faults as F = {f 1,f 2,...,f κ } and the set of all measurements as M = {m 1,m 2,...,m λ }. For f F and m M, σ f,m is the fault signature for measurement m given fault f has occurred. Two faults f i,f j F are distinguishable using fault signatures if ( m M) σ fi,m σ fj,m. Relative measurement orderings [Daigle et al., 2005] are an extension to the original TRANSCEND algorithm. The extended algorithm uses predicted temporal orders of measurement deviations to discriminate between faults. This is extended for multiple fault diagnosis. Like fault signatures, measurement orderings are derived systematically from the TCG. They are based on common subpaths in the model. A measurement ordering is denoted as m 1 f m 2, meaning that if fault f occurs, measurement m 1 will deviate before measurement m 2. We denote the set of such orderings as Ω fi for fault f i F. Two faults are distinguishable using orderings if their ordering sets are in temporal conflict. Definition 1 (Temporal Conflict). Ω fi is in temporal conflict with Ω fj if ( m i,m j M)m i fi m j m j fj m i. Fault isolation starts with a backward propagation of an observed symbolic deviation to identify initial fault candidates. Once candidate hypotheses are identified, a forward propagation algorithm generates the fault signatures and measurement orderings, i.e., the effects of each hypothesized fault on measurements. Then observed deviations are compared to predictions using a progressive monitoring scheme to discriminate between the fault hypotheses. Throughout the paper we focus on a mobile robot as an example system. Details of the system model and TCG for this system are described in [Daigle et al., 2006] and very briefly here. The bond graph is shown in Figure 1. The robot model consists of inertia, capacitor, and resistor elements modeling Figure 1: Mobile robot bond graph Figure 2: Mobile robot TCG 70 DX'06 - Peñaranda de Duero, Burgos (Spain)

81 Fault v L v R θ Measurement Orderings A L 0-0* 0+ v L A v R,v L L A θ L A R 0* 0-0- v R A v L,v R R A θ R E L -+ 0* 0- v L E v R,v L L E θ L E R 0* v R E v L,v R R E θ R G θ G + v L,θ G + v R G θ G v L,θ G v R Table 1: Fault signatures for a robot system masses and inertias, mechanical stiffness, and energy dissipation in the system, respectively. The 1-junctions represent the common velocity points, and the 0-junctions common force points. The TCG is given in Figure 2. State variables are circled and measured variables boxed. Edges with a dt specifier imply an integration effect. All other edges are instantaneous. Table 1 shows fault signatures for actuator (left: A L, right: A R ), encoder (left: E L, right: E R ), and gyroscope (positive bias: G +, negative bias: G ) faults in the mobile robot system. The measurements include velocity of the left wheel, v L, velocity of the right wheel, v R, and heading, θ. The first symbol indicates a predicted magnitude change (discontinuity) and the second symbol indicates the first nonzero slope symbol in this measurement. A* indicates an indeterminate effect. It is indistinguishable from a+or-because it could manifest as either effect. For example, from the TCG we cannot determine whether A L causes a0+ or a0- effect on v R. Relative measurement orderings are also listed in the table. 3 Multiple Fault Diagnosability Single faults are isolated by comparing predicted to actual measurement deviations. The predictions depend on which measurements are selected in the system, because different measurements provide different discriminatory information. If the prediction models (fault signatures and measurement orderings) of two faults differ, we say that these two faults are distinguishable. Definition 2 (Single Fault Distinguishability). Two faults f i,f j F are distinguishable if ( m M) σ fi,m σ fj,m or ( m i,m j M)m i fi m j m j fj m i. Definition 3 (Single Fault Diagnosability). A system is single fault diagnosable if ( f i,f j F) f i and f j are distinguishable. For single faults, the isolation procedure compares the observed measurement deviations over time to those predicted by the fault signatures and measurement orderings. If the system is diagnosable, then there exists a unique fault which is consistent with these deviations. We expand our fault isolation procedure to deal with multiple fault candidates. Definition 4 (Candidate). A candidate is a set of faults c F that is consistent with the observations. The set of all candidates is denoted as C = P(F) and of all candidates of size n as C(n). Figure 3: Effect of fault occurrence times on symbol generation of residual r(t) Multiple fault diagnosis algorithms are more complex than single fault diagnosis algorithms for two reasons. First, the effects of a fault could be masked or compensated by the effects of another fault. For example, A L may occur, causing deviations of 0- on v L, 0- on v R, and 0+ on θ. Clearly, these observations are consistent with only A L occurring. However, if A R also occurred, but with a smaller magnitude so that the effects of A L dominate, the fault sets {A L } and {A L,A R } cannot be distinguished. So, we seek to define diagnosability with respect to most likely minimal candidates. The second complication in multiple fault diagnosis is that the same multiple fault can manifest in different ways. For example, A L with E L could either produce a0- effect or a -+ effect on v L, depending on which fault occurs first, and on the fault propagation delays in the system. If E L occurs first, we will see -+ because discontinuities are observed at the point of fault occurrence. However, if A L occurs first, we may see either0- or-+ depending on how soon E L occurs after A L. Figure 3 illustrates this point. If E L occurs close enough to A L, the deviation caused by A L may not be detected. The symbol generation on the measurement residual could compute either effect. The second change is also not helpful because it could either be caused by a new fault or the dynamics of the original fault. 3.1 Representing Multiple Faults Taking into account these issues, we represent the effects of multiple faults on a single measurement as the union of predicted single fault effects. For example, the fault set {A L,E L } could manifest either0- or-+ on v L, 0- or0+ on v R, and0- or0+ on θ. A multiple fault signature for a set of faults F F, denoted by σ F,m, is an element of the set of possible fault signatures for the faults in F, i.e., Σ F,m = {σ f,m f F }. We define a complete fault signature as follows. Definition 5 (Complete Fault Signature). A complete fault signature for fault f F, denoted σ f, is a tuple (σ f,m1, σ f,m2,..., σ f,mλ ) consisting of the signatures for f on each measurement. A complete multiple fault signature for fault set F F is an element of the set of complete fault signatures Σ F, where an element is denoted as σ F = (σ F,m 1,σ F,m 2,...,σ F,m λ ), such that ( σ F Σ F )( σ F,m i σ F ) σ mi Σ F,m i. Informally, a complete multiple fault signature for F is a complete signature which can be constructed by choosing and DX'06 - Peñaranda de Duero, Burgos (Spain) 71

82 v L v R θ Realizable? no yes no yes yes no yes no Table 2: The complete signatures of Σ {A L,E L } and their physical realizability combining signatures for single measurements from faults in the fault set F. As an example, Table 2 shows Σ {A }. L,E L A complete multiple fault signature can be created by choosing single signatures from 1 to F faults, where F is the size of the fault set F. As a result, a complete multiple fault signature set will consist of all those complete signatures of the individual faults it contains. Therefore, fault effects due to fault masking and compensation are included. In general, for F F, we have Σ F Σ F. This is evidenced in Table 2, e.g., {A L,E L } can produce (-+,0+,0-), and according to Table 1, so can E L by itself. The double fault {A L,E L } may occur, but the observed deviations may be consistent with A L or E L occurring by themselves. 3.2 Physically Realizable Fault Signatures Not all signatures in Σ F may physically manifest in the system behavior, determined by the fault propagation times inherent in the system. The set Σ F can be constrained by using temporal information in the system model. The resulting set is called the set of physically realizable fault signatures. Definition 6 (Physical Realizability). A physically realizable complete fault signature for a fault set F, denoted Σ R F, is the set of multiple fault signatures for F that is consistent with the TCG model of system behavior. Whether some σ F Σ F belongs in Σ R F can be determined using relative measurement orderings. Consider E L and G +. Both faults produce discontinuities (-+ or +-) on some measurement. Because discontinuities manifest at the point of fault occurrence, it is not possible for both faults to occur and not observe a discontinuity. We must either observe-+ on v L,+- on θ, or both. Therefore, (0+,0-,0-), for example, should not be in Σ R. {E L,G+ } This notion can be formalized with relative measurement orderings. Essentially, single fault orderings should be obeyed with respect to single fault signatures. If some fault f i produces a deviation on a measurement, m i, before another measurement, m j, and another fault f j produces a deviation on m j before m i, then if both faults occur, we cannot observe f i s effect on m j together with f j s effect on m i as the first effects on m i and m 1 j. To see f i s effect on m j, we would 1 We are only interested in the first observed measurement deviation since that is what the symbol generator provides. (a) Constraint 1 (b) Constraint 2 Figure 4: Realizability constraint representations have had to observe its effect on m i first. Similarly, to see f j s effect on m i, we would have had to observe its effect on m j first. For simplicity, we express this constraint in terms of two faults and two measurements. An automata representation is given as Figure 4(a). The top automaton represents the ordering m 1 f1 m 2 and the bottom m 2 f2 m 1. If f 1 effects m 1 first (event σ f1,m1 ) and f 2 effects m 2 first (event σ f2,m2 ), then we cannot observe both f 1 s effect on m 2 and f 2 s effect on m 1 as the first deviations on m 1 and m 2. If these are the only two measurements, then if f 1 and f 2 occur together, we must observe f 1 s effect on m 1 or f 2 s effect on m 2 as the first deviation on the respective measurements. This property is expressed by the synchronous composition of the two automata, and stated formally as the following lemma. Lemma 1 (Realizability Constraint 1). For two faults f i,f j F and two measurements m i,m j M, if m i fi m j and m j fj m i, then ( σ {fi,f j} Σ {fi,f j}), σ {fi,f j} / Σ R {f i,f j} if σ {f i,f j},m i = σ fj,m i σ fi,m i and σ {fi,f j},m j = σ fi,m j σ fj,m j. A related constraint evolves from this information. Consider again the fault set {A L,E L }. Orderings predict that both faults manifest in v L first. Therefore, if v L deviates as 0-, then A L will propagate to the rest of the measurements before E L does, so we will not see any effects inconsistent with A L, e.g., we will not see0- on θ. This is because E L cannot propagate from v L to θ any faster than A L can. The physical reasoning behind this constraint is that the ordering m i fi m j implies that the fastest way to reach m j is through m i given f i has occurred. So if some other fault reaches m i first, it will traverse this same path to m j, and cause m j to deviate from its effect propagating on this path (or from some faster path f j to m j ). Therefore when f i finally reaches m i, it cannot propagate to m j any faster than f j had, so we cannot observe its effect on m j. For simplicity, we express this constraint also in terms of two faults and two measurements. An automata representation is given as Figure 4(b). The top automaton represents the ordering m 1 f1 m 2 and the bottom represents the constraint that we will only observe the effect on a measurement from one fault. If f 2 effects m 1 first, then we cannot observe f 1 s effect on m 2. This property is expressed by the synchronous composition of the two automata, and stated formally as the following lemma. 72 DX'06 - Peñaranda de Duero, Burgos (Spain)

83 Lemma 2 (Realizability Constraint 2). For two faults f i,f j F and two measurements m i,m j M, if m i fi m j, then ( σ {fi,f j} Σ {fi,f j}), σ {fi,f j} / Σ R {f i,f j} if σ {f i,f j},m i = σ fj,m i σ fi,m i and σ {fi,f j},m j = σ fi,m j σ fj,m j. Table 2 lists the set of physically realizable signatures based on these constraints for {A L,E L }. Signatures 1, 3, 6, and 8 are not realizable due to the second constraint. An additional constraint that we impose is to only allow certain combinations of faults, as this will also limit the number of complete multiple fault signatures. It does not make sense to allow fault sets consisting of multiple changes of the same parameter because we assume fault effects are persistent. Therefore, examples such as {G +,G } are not valid candidates. We also employ practical knowledge about systems to limit the size of allowable fault candidate sets. The assumption is that candidates with a large number of faults are highly unlikely, therefore, we assume that the maximum candidate size is n. The set of all fault signatures for fault sets of size n is denoted as Σ(n) = {σ F Σ F F F, F n}. The set of all physically realizable fault signatures for fault sets of size n is denoted as Σ R (n) = {σ F Σ R F F F, F n}. The realizability constraints can be extended to multiple faults and measurements. A general way to describe the constraints is by using the automata representation. For a given fault set, we can describe its possible set of event trajectories (and thus physically realizable fault signatures) by taking the synchronous product of all the single fault orderings and the two-state automata that represent a measurement being effected by only one fault. To compute Σ R (n) from this, we need only restrict the trajectories to those including events from at most n faults. We can also define the measurement orderings that can be created by multiple faults as Ω {Fi,F j} = Ω Fi Ω Fj, for F i,f j F. That is, only shared measurement orderings will be consistent with both faults occurring in any order. This can be seen in the automata representation of the orderings. 3.3 n-diagnosability Based on the set of physically realizable multiple fault signatures and relative measurement orderings for multiple faults, we can define the notion of distinguishability between candidates for multiple faults. Definition 7 (Multiple Fault Distinguishability). Two fault sets F i and F j are distinguishable if Σ R F i Σ R F j = or Ω Fi is in temporal conflict with Ω Fj. Informally, two fault sets are distinguishable if it is not possible for them to manifest in the system measurements in the same way. We do not, however, define multiple fault diagnosability using this definition. We described previously how, due to fault masking and compensation, a fault set and a superset may manifest in the same way. If so, then for F F, Σ R F ΣR F, and Ω F Ω F. We, therefore, consider diagnosability only with respect to minimal candidates. Definition 8 (Minimal Candidate). A candidate c is minimal if there does not exist a candidate c such that c c. In addition to using minimal candidates, we also consider the likelihood of fault occurrence. The assumption is that all faults are equally likely, so candidates of smaller size are more likely than those of larger size. Therefore, the ultimate goal of the fault isolation procedure is in isolating the minimal candidate of smallest size. In general, {f 1,f 2 } and {f 3 } may both be minimal candidates, because one is not a subset of the other. We consider {f 3 } to be the simpler explanation because it is of smaller size. Therefore, the fault isolation procedure does not have to consider less likely candidates when more likely candidates exist. The main reason for operating with most likely candidates is that fault masking and compensation may prevent us from isolating the true set of faults that has occurred. We do not wish to classify a system as undiagnosable because we cannot distinguish between a candidate a superset. Like other work, we assume the principle of parsimony [Reiter, 1987] and consider a diagnosis as the simplest explanation given the observed measurement deviations. The assumption is further supported, in general, by the fact that the probability of failure occurrence decreases significantly as fault size increases. A diagnosis only represents a best effort result. A diagnosis of {f 1,f 2 }, for example, means that at least f 1 and f 2 must have occurred, but does not mean that some other fault f 3 has not also happened, rather, it only implies that f 3 could not have occurred by itself. Definition 9 (Fault Isolation Procedure). Given a candidate size limit n > 0 and the set of measurement orderings, the fault isolation procedure is a function I : Σ R (n) P(C(n)). Fault isolation operates in a progressive fashion as new measurements deviate. Because only physically realizable fault signatures for candidates of size n are given as input, this function will always return a nonempty set of candidates. Multiple fault diagnosability is defined in terms of the fault isolation procedure and the given candidate size limit. Definition 10 (n-diagnosability). Given a candidate size limit n, a system is n-diagnosable if after all measurements have deviated, ( σ F Σ R (n)) I(σ F ) = 1. Informally, a system is n-diagnosable if given any physically realizable multiple fault signature for candidates of size n, a single minimal candidate of smallest size n is isolated. We next describe our fault isolation procedure based on this notion of multiple fault diagnosability. 4 Diagnosing Multiple Faults We follow the conflict-based approach of [de Kleer and Williams, 1987], where a conflict is defined as a set of assumptions which cannot all be true, and thus support a symptom (e.g., a 1 a 2 a 3 ). In TRANSCEND, the TCG is used to create a direct mapping from faults to symptoms, i.e., fault signatures and measurement orderings. Instead of using conflicts, we refer to a hypothesis set, which represents all possible faults which can explain a particular symptom. Definition 11 (Hypothesis Set). A hypothesis set is a set of faults, at least one of which must have occurred given a particular set of measurement deviations that have occurred. DX'06 - Peñaranda de Duero, Burgos (Spain) 73

84 A hypothesis set is equivalent to a conflict, in that it represents a set of negated assumptions (an assumption being that a certain parameter is not faulty), at least one of which must be true (e.g., a conflict a 1 a 2 a 3 a 1 a 2 a 3 f 1 f 2 f 3, a hypothesis set). Hypothesis sets can be generated directly from the fault signature matrix and measurement orderings. Given a measurement deviation, we construct the hypothesis set to be the set of faults consistent with the deviation. For example, given a0- for v L and using only fault signatures produces the hypothesis set {A L,A R,E R,G }. Any of these faults occurring, or combinations of them, support the symptom. Candidate generation proceeds similar to [de Kleer and Williams, 1987]. As new measurements deviate, new hypothesis sets are generated. These hypothesis sets restrict the possible candidate space and result in a new set of minimal candidates. Given a new hypothesis set, new candidates are formed by adding a single fault from the new hypothesis set. Since a hypothesis set is a set of faults consistent with an observation, these new candidates will also be consistent with the new observation as well as all old observations covered by the base candidate. Because n-diagnosability only requires isolating a unique candidate of the smallest size, we introduce a candidate size limit into our procedure. As long as we have a candidate at our current size level, we do not explore candidates of larger size. Further, we only perform this analysis if we eliminate all candidates at the current level. To illustrate the general approach, consider the fault set {A L,A R,E L,E R }. The candidate space, which can be represented as a lattice of C, is shown in Figure 5. The candidate size limit is given as n = 2, and the starting size level is n = 1. Given the first measurement deviation-+ for v L generates the hypothesis set {E L }, because only that fault can produce that deviation on v L given v R and θ have not yet deviated. We now know that this fault must have occurred. At a later time point, we are given the deviation0- for v R. This generates the hypothesis set {A R,E L }, because only these faults can cause v R to deviate that way given θ has not yet deviated. A L is not included in this hypothesis set because it did not cause v L to deviate, so we can t see its effect on v R (this relates to the second realizability constraint). At this point, we still have a candidate of size 1, so we do not yet consider any of size 2. If we were to consider the complete fault set, then a deviation of+- for θ would rule out the possibility that E L by itself occurred, and we now consider candidates of size 2. If the system is 2-diagnosable, a unique candidate of size 2 will be identified. The pseudocode for the online diagnosis algorithm is shown as Algorithm 1. It works as follows. As new measurements deviate, hypothesis sets are formed and the candidate set refined by eliminating inconsistent candidates. This follows the TRANSCEND approach. Eliminated candidates are saved for later analysis. If a single unique candidate is found during this procedure, the candidate is returned as the most likely minimal candidate, barring any future measurement deviations. When faults at the candidate size level l are all eliminated, the discarded minimal candidates are used to produce new Figure 5: Candidate lattice for fault set {A L,A R,E L,E R } Algorithm 1 Fault Isolation Input: maximum candidate size n Variables: current candidates list, hypothesis sets list, eliminated candidates list When a new measurement deviates: Form the conflict and record it Eliminate inconsistent candidates if no candidates are left then Expand eliminated candidates to the next size end if if one candidate is left then Return the candidate end if minimal candidates of size l + 1 using the hypothesis sets gathered. This procedure is given as Function 2. For each eliminated candidate, new candidates of size l + 1 are formed using the hypothesis set which caused it to be eliminated. Since the hypothesis set caused the elimination, the hypothesis set and the eliminated candidate have no common fault, so a candidate of size l cannot be constructed. Since new candidates are formed by adding exactly one fault from the hypothesis set, only candidates of size l + 1 are formed. Each new candidate formed is then checked for consistency with hypothesis sets that were recorded after its base candidate was eliminated. If the new candidate is consistent with all of these, it is added to the current candidate list. If not, it is added to the eliminated candidates list, because applying a new hypothesis set would form a candidate of size l + 2, which we are not considering at that time. If no new candidates are found then the level is increased and the process repeated. If the size limit is reached, then an unmodeled fault or a fault combination of size > n has occurred. Theorem 1. Algorithm 1 will return a unique most likely minimal candidate if the system is n-diagnosable and a fault combination of size l n occurs. Proof. The algorithm never eliminates consistent candidates. The algorithm also only considers larger candidates when no smaller candidate can explain the observations. Therefore, the algorithm will find the smallest set of candidates at any level. If the system is n-diagnosable, then a unique candidate will exist of size n. If so, at the lowest possible level the 74 DX'06 - Peñaranda de Duero, Burgos (Spain)

85 Function 2 Expand Candidates Input: maximum candidate size n if candidate size limit is exceeded then Return failure end if for all eliminated candidates of the previous size do Construct new candidates using the conflict that caused its elimination end for Eliminate candidates inconsistent with the recorded conflicts if no candidates are left then Expand eliminated candidates to the next size else Return candidates end if algorithm will find a unique candidate. If n is fixed, the computational complexity of the algorithm is polynomial in the number of single faults, because O( F n ) multiple faults are considered. If n is left unspecified, we are limited to a fault multiplicity of F. In this case the algorithm is exponential in the number of single faults. In the single fault algorithm, as soon as a single fault is isolated, it is declared as the true fault, and future measurements deviating can be ignored. In the case of multiple faults, a single isolated fault does not necessarily indicate the true fault. It only indicates the current simplest diagnosis, given the deviations observed thus far. So, future measurement deviations may result in a better understanding of what faults actually occurred in the system. If there is a unique candidate at any point, the algorithm will return it. Because more measurement deviations can only expand this candidate, the current unique candidate is partially correct. Future deviations may or may not provide a more exact diagnosis. 5 Mobile Robot Example In this section, we go through a detailed example execution of Algorithm 1. First, however, we must analyze the diagnosability of the system to ensure we will get unique results. We let n = 2 for our analysis. Table 3 lists some of the physically realizable fault signatures for the robot system. There are several points to make here. First, the signature (0+,0-,0+) is absent. This is because it violates the realizability constraints. There are several double faults which contain this signature in their signature set. However, this signature is not physically realizable for any of them. Take for example, {A L,A R }. Only A R can produce 0+ on v L. Because A L causes v L to deviate first, this means that A R will affect θ first, however only A L can produce 0+ on θ. Thus, this signature violates the second realizability constraint for this double fault. We also see from Table 3 that the system is not 2- diagnosable. If θ deviates first, observing either (0-,0-,+-) or (0-,0-,-+) cannot be explained by a single fault, but two double faults are consistent with each. For example, consider observing (0-,0-,+-) with θ deviating first. If then Σ R (2) Smallest minimal candidates (0-,0-,0-) {A R } (v R first) or {A L,A R } (v L first) (0-,0-,0+) {A L } (v L first) or {A L,A R } (v R first) (0-,0-,+-) {A L,G+ }, {A R,G+ } (θ first) or {A L,G+ } (v L first) or {A R,G+ } (v R first) (0-,0-,-+) {A L,G }, {A R,G } (θ first) or {A L,G } (v L first) or {A R,G } (v R first) (0+,0-,0-) {A R } (v R first) (0+,0-,+-) {G + } (θ first) or {A R,G+ } (v R first).. (-+,-+,0-) {E L,E R } (v L or v R first) (-+,-+,0+) {E L,E R } (v L or v R first) Table 3: 2-Diagnosability analysis for the mobile robot both wheels start slowing down, this cannot be explained by G + by itself. However, given that both velocities are below nominal, we cannot determine which actuator fault caused it, because only θ allows us to discriminate between them in this case. Orderings do not help either, because even if we see v L or v R deviate next, we do not know if that deviation was due to G + propagating or an actuator fault appearing. Although we cannot distinguish which actuator fault occurred with G +, we still know that G + must have occurred, and that some actuator fault has also occurred. This can sometimes be helpful. We now consider a double fault which is distinguishable, and demonstrate the execution of the algorithm. Table 4 illustrates the approach for {E L,G+ } occurring. First, v L deviates with a -+. Only an encoder fault of the left wheel can produce such a deviation on v L given that no other measurements have deviated, thus the hypothesis set is {E L } which becomes our first candidate. Next, v R deviates with a 0-. Given that θ has not yet deviated, the hypothesis set becomes {A R,E L }. G+ is not included in this hypothesis set because we would have seen θ deviate if it had occurred (constraint 1), and neither is A L, because to observe its effect on v R would mean we would have seen its effect on v L (constraint 2). Since {E L } is consistent with this hypothesis set, it remains a candidate. Next, θ deviates with a +-. The hypothesis set is {G + } since only G + can cause θ to deviate in that way. Since {E L } is not consistent with this hypothesis set, it is eliminated. We now have to expand our eliminated candidates to explain the observations. Since the hypothesis set {G + } eliminated {E L }, we form the new candidate {E L,G+ }. Since all measurements have deviated, we can be sure that this is our smallest minimal candidate. Since {E L,G+ } is distinguishable from all other double faults, the algorithm gives a unique result. We next consider a case where, although the signature is realizable for a single fault, can only be explained by a double fault. The signature(0-,0-,0-) is realizable for A R, DX'06 - Peñaranda de Duero, Burgos (Spain) 75

86 Observation Hypothesis set Candidates Eliminated 1. v L -+ {E L } {E L } 2. v R 0- {A R, E L } {E L } 3. θ+- {G + } {E L } Apply (3) {E L, G+ } Table 4: Algorithm execution example 1 Observation Hypothesis set Candidates Eliminated 1. v L 0- {A L } {A L } 2. v R 0- {A L, A R } {A L } 3. θ0- {A R } {A L } Apply (3) {A L, A R } Table 5: Algorithm execution example 2 however if v R does not deviate first it cannot be only A R which has occurred. However, this signature is realizable for {A L,A R }, and we show how the algorithm derives this result. Table 5 summarizes the algorithm execution for this case. First, we see v L deviate with0-. Only A L is consistent with v L deviating first with this effect, thus the hypothesis set is {A L }. Next, we observe v R deviate with 0-. Given θ has not yet deviated, {A L,A R } is the hypothesis set for the new observation. E L is not included because to observe its effect on v R would mean we would have seen its effect on v L (constraint 2). Next, we see θ deviate with0-. Only A R can cause this (and not E L for the previous reason). Therefore {A L } is eliminated, and we expand the candidate into {A L,A R }. Again, we have a unique result. 6 Conclusions Multiple fault diagnosis in dynamical systems is complex due to fault masking, compensation, and the many ways multiple faults can manifest. We have presented here an approach to qualitative isolation of multiple faults as an extension of the TRANSCEND approach. We described a notion of multiple fault diagnosability defined over smallest minimal candidates, and presented an algorithm to isolate multiple faults based on this notion. We then discussed the 2-diagnosability analysis of a mobile robot system, and illustrated the algorithm on distinguishable double faults. Future work will address the scalability of the approach to larger systems and exploring conditions which satisfy n- diagnosability for a specific n. The notion of dealing with only the smallest l value and moving to the next l value may also be relaxed by taking into account a priori fault probabilities for the different component parameters, for which more efficient candidate generation strategies will be explored, such as conflict-directed A* [Williams and Ragno, to appear]. Exploring fault identification and fault-adaptive control in the presence of multiple faults is also an open area of research. Acknowledgment This work was supported in part by NSF CNS and NSF CNS References [Daigle et al., 2005] M. Daigle, X. Koutsoukos, and G. Biswas. Relative measurement orderings in diagnosis of distributed physical systems. In 43rd Annual Allerton Conference on Communication, Control, and Computing, pages , September [Daigle et al., 2006] M. Daigle, X. Koutsoukos, and G. Biswas. Distributed diagnosis of coupled mobile robots. In Proceedings 2006 IEEE International Conference on Robotics and Automation, pages , May [de Kleer and Williams, 1987] J. de Kleer and B. C. Williams. Diagnosing multiple faults. Artificial Intelligence, 32:97 130, [Gertler, 1998] J. Gertler. Fault Detection and Diagnosis in Engineering Systems. Marcel Dekker, New York, [Karnopp et al., 2000] D. C. Karnopp, D. L. Margolis, and R. C. Rosenberg. Systems Dynamics: Modeling and Simulation of Mechatronic Systems. John Wiley & Sons, Inc., New York, 3rd edition, [Manders et al., 2000] E.-J. Manders, S. Narasimhan, G. Biswas, and P.J. Mosterman. A combined qualitative/quantitative approach for fault isolation in continuous dynamic systems. In SafeProcess 2000, volume 1, pages , Budapest, Hungary, June [Mosterman and Biswas, 1999] P.J. Mosterman and G. Biswas. Diagnosis of continuous valued systems in transient operating regions. IEEE Transactions on Systems, Man and Cybernetics, Part A, 29(6): , [Ng, 1990] H. T. Ng. Model-based, multiple fault diagnosis of time-varying, continuous physical devices. In Sixth Conference on Artificial Intelligence Applications, volume 1, pages 9 15, May [Reiter, 1987] R. Reiter. A theory of diagnosis from first principles. In Matthew L. Ginsberg, editor, Readings in Nonmonotonic Reasoning, pages Morgan Kaufmann, Los Altos, California, [Subramanian and Mooney, 1996] S. Subramanian and R. J. Mooney. Qualitative multiple-fault diagnosis of continuous dynamic systems using behavioral modes. In The th National Conference on Artificial Intelligence, pages , August [Williams and Ragno, to appear] B. C. Williams and R. Ragno. Conflict-directed A* and its role in modelbased embedded systems. Special Issue on Theory and Applications of Satisfiability Testing, Journal of Discrete Applied Math, to appear. 76 DX'06 - Peñaranda de Duero, Burgos (Spain)

87 Improvement of Chronicle-based Monitoring using Temporal Focalization and Hierarchization. Christophe Dousson and Pierre Le Maigat France Telecom R&D, 2 avenue Pierre Marzin, Lannion cedex, France. {christophe.dousson,pierre.lemaigat}@orange-ft.com Abstract This article falls under the problem of the symbolic monitoring of real-time complex systems such as telecommunications networks or of the video interpretation systems. Among the various techniques used for the on-line monitoring, we are interested here in the temporal scenario recognition. In order to reduce the complexity of the recognition and, consequently, to improve its performance, we explore two methods: the first one is the focalization on particular events (in practice, uncommon ones) and the second one is the factorization of common temporal scenarios in order to do a hierarchical recognition. In this article, we present both concepts and merge them to do a focalized hierarchical recognition. This approach merges and generalizes the two main approaches in symbolic recognition of temporal scenarios: the Store Totally Recognized Scenarios (STRS) approach and the Store Partially Recognized Scenarios (SPRS) approach. 1 Introduction Symbolic scenario recognition arises in monitoring of dynamic systems in many areas such as telecommunications networks supervision, gas turbine control, healthcare monitoring or automatic video interpretation (for an overview, refer to [Cordier and Dousson, 2000]). Such scenarios could be obtained among other things by experts, by automatic learning [Fessant et al., 2004; Vautier et al., 2005] or by deriving a behavioral model of the system [Guerraz and Dousson, 2004]. Due to the symbolic nature of those scenarios, the engine performing the recognition is rarely directly connected to sensors. Often there is (at least) one dedicated calculus module which transforms the raw data sent by the system into symbolic events. Typically this module can compute a numerical quantity and sends symbolic events when the computed value reaches a given threshold. In cognitive vision, this module is usually a videoprocessing which transforms images into symbolic data. Often those scenarios are a combination of logical and temporal constraints. So in those cases, symbolic scenario recognition engine can process the scenarios uniformly as a set of constraints (like the Event Manager of ILOG/JRules based on a modified RETE algorithm for processing time constraints [Berstel, 2006]) or separate the processing of temporal data from the others like in [Dousson, 2002]. This article mainly deals with this second approach where temporal constraints are managed by a constraint graph between relevant time points of the scenarios. By the way, there are two approaches for dealing with temporal constraints: STRS recognizes scenarios by an analysis of the past [Rota and Thonnat, 2000] and SPRS which performs an analysis of scenarios that can be recognized in the future [Ghallab, 1996]. Two main problems in the SPRS approach are the fact that scenarios have to be bounded in time in order to avoid never ending expected scenario (in practice, when working on real-time systems, it is difficult to exhibit scenario which cannot be bounded in time); and that SPRS engine has to maintain all partially scenarios which possibly leads to use a large amount of memory space; in particular the combinatorial explosion in the case of multi-actors, i.e., scenarios with events involving variables. To partially avoid those drawbacks, the implementation of SPRS algorithms in [Dousson, 2002] introduces a clock and deadlines which are used to garbage collect the pending scenarios and introduces also variable instantiation (or propagation) mechanisms. On the other hand, the main problem with STRS algorithms is to maintain all previously recognized scenarios. To our knowledge, no work has been published on how long such scenarios should be maintained. A first attempt to take the benefits of both approaches was made in [Vu et al., 2003]. It consists of a hierarchization of the constraint graph of the scenario. It deals only with graphs where all information about time constraints can be retrieved from a path where temporal instants can be totally ordered. The hierarchy constructs an imbricated sequence of scenarios containing only two events at a time. The principle of the recognition is, at any instant, to instantiate elementary scenarios and when an event is integrated in a high-level scenario, looking for previously recognized elementary scenarios. The purpose of this article is to generalize this method; the starting point will be an SPRS approach and the generalization mixes reasoning on past and future. As a byproduct, STRS and SPRS methods appear as two extreme kinds of focalized hierarchical recognition. The next section presents the used SPRS approach and details some aspects which are relevant to this paper. The section 3 is dedicated to the temporal focalization used to fo- DX'06 - Peñaranda de Duero, Burgos (Spain) 77

88 cus the system on uncommon events. As already said, events could be not only basic events coming directly from the supervised system but aggregated indicators. So the focalization could be used in order to control the computation of such indicators on particular temporal windows and to avoid useless computation. As such indicators could be themselves chronicles. Section 4 presents how the hierarchical recognition deals with common subpatterns of chronicles. Finally, as said, we show that both concepts could be merged and experimentally lead to good improvement of performances. This will be the object of the section 5. We conclude in section 6 by experimentations on detecting naive server in a Reflected Distributed Denied of Service (RDDoS) Attack and on detection of abnormal patterns in cardiac behavior [Carrault et al., 1999]. 2 Chronicle Recognition System Our approach is based on the chronicle recognition as proposed in [Dousson, 2002] which falls in thefield SPRS methods. A chronicle is given by a time constraint graph labeled by predicates. This formalism is based on a reified logic (a chronicle is a conjunctive formula). In this article we choose to present in details the focalization from the time constraint graph point of view. Variables and predicates other than events (persistency, noevent,...) are not discussed in this paper but illustration is provided by experiments in section 6. An instance of event is a pair (e, t) where t is the date of the event and e is its type. When no ambiguity results, we sometimes do not distinguish between an event and its type. Figure 1 shows a chronicle (more precisely a chronicle model) which contains four events: the event e (if instantiated) must occur between 2 and 3 units of time after an instantiation of f, the event g must occur between 0 and 3 units of time after e and between 1 and 4 units of time after e. f [2,3] e e' [0,3] [1,4] Figure 1: The chronicle model C 2.1 Recognition Algorithms Let CRS (Chronicle Recognition System) denote the algorithm of recognition. Basically the mechanism of CRS is, at each incoming event, to try to integrate it in all the pending (and partial) instances of the chronicle model and/or created a new instance and calculating new forthcoming windows for all the forthcoming events of each instance. An instance of a chronicle model is then a partial instantiation of this model and forthcoming windows fc(e) of a non-instantiated event e is the (extended 1 ) interval where the occurrence of an event could lead to a recognition 2. 1 An extended interval is defined as a union of intervals. 2 This does not imply that, for all non instantiated events e, if (e, t) occurs with t fc(e) then the instance is recognized. This a g The forthcoming windows are calculated using constraints propagation. For chronicle C, if the monitored system emits (f,1)(e, 3)(e, 5), then, when CRS receives f, it creates the new instance I 1 (see figure 2), updates the forthcoming windows (showed in nodes). When e occurs, instance I 2 is created (instance I 1 is not destroyed, waiting for a potential e at 4). Now, when e occurs, I 3 is created and I 1 is destroyed as no more event e could from now be integrated into. Finally if g arrive at 6, all events of the instance I 3 are instantiated and then chronicle C is recognized thanks to I 3. f,1 e,[3,4] I1 e',[-1,6] f,1 g,[3,7] e,3 e',5 I3 f,1 g,[6,6] e,3 e',[-1,5] I2 g,[3,6] Figure 2: Created instances of the chronicle of figure 1 by the incoming stream (f,1)(e, 3)(e, 5). 2.2 From Clock to assert no more Event. In first implementations of chronicle recognition, a clock was introduced in order to discard impossible instances when the clock is out of one of the forthcoming window of an instance (in other words, when one missing event could never be received from the system in the forthcoming window). In order to take into account some jitter in data transmission a possible delay δ can be taken into account. This delay is equal to the maximum difference observed at reception between two events sending at the same time by the (possibly distributed) system 3. Basically, the event integration algorithm could be written as following: integrate((e, t)); setgarbageclock(t δ) The main drawback is that it implies that the events arrive roughly in a FIFO manner (the allowed jitter is bounded by δ): so, when the FIFO hypothesis should be relaxed (and it is often the case when monitored systems are distributed), δ should be increased and the garbage efficiency decreases. In order to avoid this, instead of a clock, we define the assertion: assertnomoreevent(e, I), where e is an event type and I an extended interval. It specifies to CRS that, from now, it will not receive events of type e with an occurence date in I. We do not describe here the algorithm since it is very similar to the CRS one but, intuitively, all the forthcoming windows of e are reduced (fc(e) \ I) and propagated according to the constraint graph of the chronicle. As in previous CRS, if a forthcoming window becomes empty, the instance difference between SPRS mechanism where integration of events is incremental and some STRS mechanism where integration is made by block in a backward manner. 3 In case of focalization, an other side effect of using clock is the creation of false instances. This will be explained in section DX'06 - Peñaranda de Duero, Burgos (Spain)

89 is discarded. So, the previous behavior of CRS is given by the following: integrate((e, t)); e, assertnomoreevent(e, ],t δ]) As we allow I to be an extended interval, more complex garbage management could be easily implemented: it could be different from ], clock] and we could process assertnomoreevent(e, [10, 20] [30, 40]) and, then, assertnomoreevent(e, [0, 10]). This mechanism is implemented in CRS by Streams (one for each event type) which manage the garbage window (gw) when receiving an Assert- NoMoreEvent message. Integrate event (e, t): - for stream e do: - integrate (e, t) in instances involving e. - update gw(e) - update fc(e) and propagate. - stream e e do: - update gw(e ) - update fc(e ) and propagate. This point of view changes slightly the manner the engine works: the progress of time is no more driven by the supervised system but by an internal control, this control is given by the knowledge of the temporal behavior of the system. We will see in the next section how this new feature in used by the temporal focalization. 3 The Temporal Focalization. 3.1 General Description In some cases, not all events have the same status: event f could be very frequent and event e extremely uncommon. Due to this difference of frequency, the recognition of the chronicle could be impossible in practice, indeed each event f potentially creates a new instance of the chronicle waiting for other events: if a thousand f arrive between 0 and 1, a thousand instances will be created. As event e is extremely uncommon most of those instances would be finally destroyed. In CRS, the number of creation of instances has a great impact on performance of the recognition. In order to reduce this number, we focalize on event e in the following manner: when events f arrive, we store them in a collector and we created new instances only when events e occur, then we will search in the collector which f could be integrated in the instances. Potentially the number of created instances will be the number of e and not the number of f. In order to be not limited to uncommon events, we introduce a level for each event type. The principle of the temporal focalization is then the following: Begin the integration of events of level n +1only when there exists an instance such that all events of level between 1 and n had been integrated. For the example of figure 2, if e and e are uncommon, f is frequent and g is very frequent, we define level(e) = level(e )=1, level(f) =2and level(g) =3. The engine will integrate e at t =3and e at t =5then it will search in collector the events f between t =0and t =1. The collector finds (f,1) and sends it to the engine, this leads to the creation of the instance I 3. Technically we made the choice of sending f not only to instances waiting for events of level 2 but to all the pending instances. As the number of f sending from the collector to the engine is small, only a small number instances are created. In our example, under the FIFO hypothesis, (f,1) can not instanciate a new instance as the last received event is e at t =5. In addition to the recognition engine, we develop some modules (see figure 3): a module named Event Router which routes the events coming from the supervised system to either the engine or the collector; a module named Finished Level Detector which detects when a particular instance of chronicle has finished to integrate all events of level lesser than n, then looks at the forthcoming windows of events of level n +1and sends them to the collector (label?f [a, b]); and, finally, a module called No Need Window Calculator, which computes for all current instances the windows when they do not need events of level greater than 2 and sends this information to the collector (in the figure, the engine does not need f on ],x] [y, z]); this window is the intersection of the complementary of the forthcoming windows of all instances of all chronicle models involving f. The collector itself is split into Collecting Streams: one for each event type of level 2 (high-level events); in the previous example, there will be one collecting stream for f and one for g. The collecting stream manages 3 particular temporal windows: i) the assert no more window which contains all the occurrence dates for that no more event will be received from now (this window is used to discard pending instances), ii) the exclusion window which contains all the occurrence dates for which the pending chronicles don t care (this win- Incoming Events (e,d)(f,t)... Event Router store (f,t) Recognition Engine (CRS) Stream e Stream f Finished Level Detector gw(e) gw(f)?f [a,b] f [-,x] [y,z] (f,t) or ' ' f [t,b] Collector Collecting Stream f: (f,t),(f,t')(f,t'')... Collecting Stream g: (g,d),(g,d')... instances No Need Window Calculator ew(f) anmw(f) nw(f) ew(g) anmw(g) nw(g) Figure 3: Architecture of the focalized recognition. DX'06 - Peñaranda de Duero, Burgos (Spain) 79

90 dow is used to discard the collector), and iii) the focus window which contains all the occurrence dates for which incoming event should be integrated immediately by the engine without be stored by the collector. The following subsections detail how these windows are updated. 3.2 The assert no more Window. When the Event Router module receives an event, it executes: Route event (e, t): -Sende either to the engine or to the collector. - e of level 1 do: - for stream e, update and propagate gw(e ), fc(e ) - foflevel 2 do: - for collecting stream f, update anmw(f) Thus, we do not update the garbage window for streams (in the engine) of high-level events. This operation will be done by message sending from the collector. 3.3 The Exclusion Window There are two mechanisms for cleaning the collector: the first one is the emission from the recognition engine of a no need event message (for instance, no need f), on a particular window (which is the intersection of the complementary of the forthcoming windows of all instances of all chronicle models involving f); then the collector simply cleans corresponding events and updates the exclusion window, ew(f), inthecollecting stream f: Exclude window (f,i): - remove all f in the extended interval I - ew(f) I The exclusion window expresses the fact that if an event occurs in this window, the collector do not have to store it as no chronicle is not interested in the event : Receive Event (f,t): -if(t ew(f)) {store (f,t)} The second case happens when the recognition engine asks for some events on a particular window, the collector sends found collected events to all instances and cleans them (events in a collector are sent one and only one time to all instances): Extract event f on I I: - (f,t) I, do: -send(f,t) to recognition engine - T = I ],t] anmw(f) - ew(f) ew(f) T - Engine: assertnomoreevent(f,ew(f)) - T = I anmw(f) - ew(f) ew(f) T - Engine: assertnomoreevent(f,ew(f)) In this algorithm we assume that the collector sends events f ordered by date. The extended interval T = I ],t] anmw(f) expresses the fact that after receiving f at t, the engine will not receive any more f whose date is in I ],t] except if the collector can still receive some f before t. As the forthcoming window of an event is decreasing in the subset order from ], + [ to, the exclusion window is increasing for the same order. Note that the assert no more window and the exclusion window are different and generally do not include each other. 3.4 The Focus Window There are (at least) three reasons for introducing the focus window: the main one is that, in addition to the search in the past for some high-level events, we want to predict forthcoming windows for high-level events; the second one is when two events are not totally ordered, for example: e [a,b] f (in constraint graph) with a < 0,b > 0 and level(f) = 2, level(e) =1; the third one, when f [a,b] e, with a, b 0, with e FIFO but f having a delay greater than b. When (f,t) arrive after e inside the fw(f), it must be immediately sent to the engine. Receive focused event (f,t): -send(f,t) to recognition engine - T =fw(f) ],t] anmw(f) - ew(f) ew(f) T - Engine: assertnomoreevent(f,t) - fw(f) fw(f) T Contrary to the assert no more window and the exclusion window the focus window is neither increasing nor decreasing for the subset order. 3.5 Summary Finally, the whole algorithm is composed with three blocks: Main Algorithm: - (e, t) incoming event: - Integrate event (e,t) - f of level 2 do - Compute exclusion window of f: I f - Collector: Exclude window (f,i f ). Finish Level Detection: -if( instance of C which finishes integration for level n) - f of level n +1do - I f = forthcoming window of f. - Collector: - fw(f) fw(f) I f - Extract Event (f,i f ) - fw(f) fw(f) ew(f) Collector integrate (f,t): -if(t fw(f)){receive focused event (f,t)} else{receive event (f,t)} 80 DX'06 - Peñaranda de Duero, Burgos (Spain)

91 3.6 Partial order and relative frequencies In figure 1, if e is much more frequent that f, there is no need to leveled the chronicle, as, if no f was arrived, no e will initiate new instances, thus the number of instances will be the number of f. So relatively to an other event e, event f (very frequent) should be leveled only if e f (in the partial order induced by the constraints), in particular when e and f are not comparable, for example e [ 1,2] f. We can decompose instances into two categories: the instances where f is before e for which the leveled will be particular efficient and the other part where f is after e, in this case the event f should be directly sent to the recognition engine which is done by the use of the focus window. An other case where focalization is necessary: let suppose a chronicle made of p +1events, an event a and p events e 1,...,e p, where e i a. Suppose that relative frequencies between all events are similar, but the sum of the frequencies for the e i are much greater than this of a. The previous mechanism could be optimized using a hierarchical structure. Note that, this is also necessary when more than one occurrence of f is required for the recognition (see section 6). 4 Hierarchical Recognition of Chronicles. 4.1 Definition A hierarchical chronicle is a pair (C, {h 1,...,h n }) where C is the base chronicle and h i are the sub-chronicles; weassume that event types involved in C can take value in the set of sub-chronicle labels {h 1,...,h n }. We treat only deterministic hierarchical chronicles 4, i.e., we do not allow two sub-chronicles having the same label, so, in the following, we make no distinction between the chronicle and its label. Moreover we suppose that each h i has a distinguished event b i. The hierarchical chronicle C will be recognized if it is (classically) recognized where integrated events labeled by a sub-chronicle h i have the date of a integrated event b i in a recognized instance of h i. In other words, when a subchronicle h i is recognized, it sends to all chronicles the event (h i, date of b i ). In figure 4, the hierarchical chronicle H = (C, {h, k}) possesses two sub-chronicles h and k, the chronicle k is formed with two types of events: a basic one, f (which comes from the supervised system) and one representing the sub-chronicle h. The distinguished events are respectively e and h. For a log containing (e, 0)(g, 1)(e, 3) an instance of h would be recognized, so an event (h, 3) will be sent to other chronicles. If the engine receives (f,4), an instance of k is recognized and produces event (k, 3). So, for the log (e, 0)(g, 1)(e, 3)(e, 4)(g, 4)(f,4)(e, 6), C is recognized with events (h, 3)(k, 3)(e, 4)(h, 6) (event (h, 6) is produced by (e, 3)(g, 4)(e, 6)). Let C be a chronicle with constraint graph G and with an event h, where h is a subchronicle with distinguished element b and with constraint graph V. Expanding the chronicle 4 There is no technical difficulty to consider non-deterministic hierarchical chronicles. The main difference is that expansion (see below) leads to more than one chronicle. This is a possible way to introduce disjunction in chronicle formalism h g [1,1] [0,1] k C e [2,3] [-1,2] h e h h [1,1] [2,3] k [1,2] Figure 4: The hierarchical chronicle (C, {h, k}). e [2,3] e e [2,3] [1,1] [0,1] e [-1,2] e g [1,2] [2,3] f e [2,3] e [1,1] Figure 5: The expansion of the hierarchical chronicle C. C (or its graph) is replacing the node h G by the graph V, specifying that the constraints between b and G\V are the constraints of h G. Let the relation h i h j be defined if h j contains the event type h i. The hierarchical chronicle is structurally consistent if i, h i h i. In this case the expansion of a hierarchical chronicle is well defined (the graph of the figure 5 is the expansion of (C, {h, k})). A hierarchical chronicle C is consistent if there exists a set of events s.t. C is recognized. It is straightforward that a (structurally consistent) hierarchical chronicle is consistent iff when expanding sub-chronicles the obtained constraint graph is consistent. 4.2 Condition for Hierarchizing a Chronicle Hierarchical chronicles can come from two ways: the first one is that the chronicle is initially specified in a hierarchical manner, for example if the architecture of the system is itself hierarchical; the second one is, starting from a flat chronicle, we identify identical patterns inside this chronicle. The next proposition is a necessary and sufficient condition to the hierarchization of a (temporal) pattern and shows we have to take care in pattern factorization: two subgraphs could be identical (with same constraints) but one of them can satisfy the condition and the other not. Proposition 4.1 Let G be a minimal constraint graph and U G a subgraph. Let a U a (distinguished) node and H the hierarchical chronicle defined by G =(G\U) {h} and h = U, where h G has the same constraints as node a. Then H and G recognize the same events iff b U, c G\U, D bc = D ba + D ac (1) where D xy are the time constraints between G s nodes. Proof: It suffices to prove that the minimal graph of the expansion graph, G H, of H coincide with G. This is immediate by noting that when computing the minimal graph g e f g DX'06 - Peñaranda de Duero, Burgos (Spain) 81

92 stream e stream f stream g Chronicle Model h (level 2): instances h1,h2,... Chronicle Model k (level 3): instances k1,k2,... stream e stream h Chronicle Model C : instance C1,C2,... stream h stream k Chronicle Model C : instance C1,C2,... stream k Chronicle Model k (level 3): instances k1,k2,... Figure 6: Streams for the hierarchical chronicle of figure 4; bold arrows correspond to the feeding of event streams (by incoming events or by events produced by recognized chronicles) and thin arrows correspond integration of events by the recognition engine. of G H the constraints of G H are not reduced and all added constraints are between U and (G\U) and those constraints are redundant constraints. 4.3 Recognitions In the architecture of the hierarchical engine, instead of using one recognition engine per sub-chronicle, we put all together in the same engine and the streams accept events coming from the system and from recognized instances of subchronicles. For example, for the chronicle of the figure 4, the stream corresponding to the high-level event type h (see figure 6) accepts events coming from instances of chronicle model h (bold arrow) and is integrated into instances of chronicle models k and C (thin arrows). 5 Focalized Hierarchical Recognition of Chronicles. In this section, we present how the focalization and the hierarchization could be mixed. In the architecture of the collector, we need to add a module which transforms exclusion and focus window of high level events into those of the different collecting streams. The assert no more window should also be adapted. Updating the assert no more window When receiving an event, we update the garbage window of all streams except those corresponding to high-level events and we update the assert no more window for collecting streams. For example, when receiving f, we update stream e and collecting streams e, g and f. Updating the exclusion window When the collector receives a message sent by the recognition engine, saying it does not need high level event f on the extended interval, the collector needs to update the exclusion window of all events in the chronicle f, but needs also to take into account all the exclusion windows of other high-level events. So, the exclusion window is given by: ew(e) = ew(f i )/D e,bi e f i where D e,bi is the time constraint from e to the distinguished node b i of f i and where [u, v]/[a, b] =[u a, v b] if u a v b and [u, v]/[a, b] = otherwise and with natural extension to the set of extended intervals. Col. stream e Col. stream f Engine Col. stream g Col. stream h Collector Chronicle Model h (level 2): instances h1,h2,... Figure 7: Communications between streams, collecting streams and instances. Updating the focus window The focus window of the collecting stream of an event e involved in high-level events f 1,...,f n is given by i fw(f i)+ D bi,e where D bi,e is the time constraint from the distinguish node b i of f i to e. Recognition From the structural point of view, recognition follows the same principle as hierarchical recognition but with communications (in both ways) between collecting streams and instances. For our example in figure 4, with level(h) = 2 and level(k) = 3, communications are described in figure 7: events of type e are sent to stream e (and then to C)andto collecting stream e (bold arrows). Collecting streams e and g send required events to instances of h (thin arrows). If h is recognized, an event h is sent to stream h (and then to C)and to collecting stream h andsoon. Remark: An example of gain of focalization vs. technique presented in [Vu et al., 2003], is that for subgraphs with [0, 0] constraints (co-occurrence), we don t have to explore combination of partial instantiation but only store events and wait for an event in the future in order to extract from the collector events with the good parameters; thus all the combinatorial explosion of elementary scenarios is avoided. 6 Experimentation To summarize, the first example shows a major gain of performance mainly obtained by the focalization technic and the gain for the second example is principally due to the hierarchisation. Those experiments had been realized under Mac OS X: 2*1GHz PPC G4-1Gb SDRAM. The algorithms are implemented in Java (JRE 1.4). 6.1 Reflected Distributed DoS Attack In RDDoS attack, machines with the spoofed IP address of the victim send SYN messages to naives servers. Then, those 82 DX'06 - Peñaranda de Duero, Burgos (Spain)

93 min 2 Reco. 1L 6 Reco. 4 Reco. 1 Reco. 2 L 3 L 4 L 0 Reco. 5 L opt. CRS +gw CRS + clock CRS + gw fh-crs + clock fh-crs + gw Figure 8: Comparison of process time for the RDDoS log. servers will reply by sending to the victim SYN/ACK messages generating a massive flooding. Characteristic of the attack is that SYN traffic to naive server is low, persistent and, taken alone, do not trigger an alarm. We want to identify the naive servers. In our experiment, information on the global traffic are sent way up by some core routers. Preprocessing is done by a particular numerical algorithm which computes throughput between pairs of IP addresses and sends alarms when this throughput is greater than two thresholds: a low one (L events) and a high one (H events). So CRS receives two kinds of events: H[ip dest] and L[ip scr, ip dest] where variables are IP address of the source and the destination of the throughput. The leveled hierarchical chronicle is (the flat one is obtained using expansion): Chronicle RDDoS[ip naif]: (level 2) SynLow[ip dest, ip naif], ts (level 1) H[ip dest], ta (level 1) H[ip dest], tb ts [0,120] ta [0,60] tb Chronicle SynLow[ip src, ip dest] L[ip src, ip dest],t1 L[ip src, ip dest],t2 L[ip src, ip dest],t3 (distinguished event) t1 [0,120] t3, t1 <t2 <t3 We do not discuss here the choice of threshold nor the relevance of the chronicle, our aim is to present performance of focalization face to huge amount of data. The log contains events (L and H), its period is equal to 6min, frequencies are 660 L/sec and 3 H/sec. Due to a lack of synchronization of routers the delay is set to60s. Onfigure 8, we compare processing time for 5 engines: on the expanded (flat) chronicle, previous CRS with clock and CRS with garbage window, and on the corresponding hierarchical chronicle using the focalized hierarchical (fh) engine. We also compare to an optimal log where all H events are sent before L events (only possible on fixed log). In order to change the complexity, we fluctuate the number of events L in sub-chronicle SynLow (from 1 L to 5 L). For the chronicle with 3 L events, results are summarized in the following table: processing created maximal time instances collector s size CRS+c 17 min n.a. CRS+gw 14 min n.a. fh-crs+c 3 min fh-crs+gw 7s opt. CRS+gw 34 s n.a. So the amount of created instances is considerably reduced when using focalization. The difference between number of instances when using clock (3550) and when using garbage window (1840) is explained on this example: for leveled chronicle C of figure 1, when f arrives at t =1, we store it and the clock is set to 1; butwhen(e, 3) arrives, the clock can not be set to 3 otherwise all current instances would be discarded; so, even if in the FIFO hypothesis, it is necessary to set the delay to 3 (the upper bound of the constraint between f and e). By this artifact, a (possible large) number of false instances are uselessly created. The introduction of this delay has also an incidence on the collector s size. 6.2 Mutualisation of Cardiac Patterns Our second experiment concerns the identification of cardiac patterns [Carrault et al., 1999] 5. The log contains 3600 events which are of two types: p wave[?x] and qrs[?x] with?x {normal, abnormal}. Those two events are extracted from electroencephalograms (EEG) and represent two characteristic events in heart cycle: the arrival of a P wave and of a QRS complex. The leveled hierarchical chronicle we use is: Chronicle test[]: (level 1) qrs[abnormal], t1 (level 2) pattern2[normal], t2 (level 2) pattern1[normal], t3 (level 2) pattern1[abnormal], t4 t1 <t2 <t3 [0,600] t4 t1 cycle k t4 Chronicle pattern1[?x] noevent (qrs[*],ts,t1) p wave[],t1(distinguished event) ts [0,600] t1 where constraint cycle k is equal to [0,k 2000]. In the 5 We wish to acknowledge the authors for providing cardiac chronicles and EEG data. DX'06 - Peñaranda de Duero, Burgos (Spain) 83

94 sec CRS + gw instances fh-crs +gw instances cycle1 cycle2 cycle3 cycle4 cycle5 cycle6 created instances 1M cycle7 267 Reco Reco Reco Reco Reco Reco Reco. Figure 9: Comparison of process time for the cardiac log. 900K 800K 700K 600K 500K 400K 300K 200K 100K sub chronicle pattern1, we use the predicate noevent which means that we do not want event qrs between ts and t1. The sub-chronicle pattern2 is quite the same as pattern1. In this experiment the frequency of events are similar; indeed, in the log there are 1987 qrs and 1550 p wave events. So, when the number of considered cycles k increases, the combinatory greatly increases and the gain obtained by the hierarchisation of common patterns becomes important. The results are presented in figure 9, triangles represent the number of created instances (right graduation). As we said in the introduction, we can see an immediate correlation between this number and the performance of the recognition. 7 Conclusion This paper presents two improvements of chronicle recognition. The first one is the focalization on particular events which allows the system to reason on the past and on the future in a homogeneous manner. The second concerns the hierarchical recognition based on subpatterns. We also showed that mixing both improvements increases the efficiency and also fills the gap between SPRS and STRS approaches which are completely covered. In practice, our approach is sufficiently adaptive in order to fine-tune the recognition system. For instance, the order of event integration could be different from their arrival order (for instance, to take into account the relative frequencies). Moreover, leveled events could postpone some expensive numerical computation to generate events (by setting a high level to these kind of event) and avoid useless computation. Future works will take two directions: the first one is to define a more flexible way to factorize common patterns - condition 1 (section 4.2) is too restrictive for many cases. As dealing with the event frequency substantially increases the efficiency of CRS, the second direction will be focused on online analysis of event frequency in order to adapt dynamically the hierarchical focalization. References [Berstel, 2006] Bruno Berstel. Extending the RETE algorithm for event management. Ninth International Symposium on Temporal Representation and Reasoning (TIME 02), pages 49 51, July IEEE Transactions. [Carrault et al., 1999] G. Carrault, M.O. Cordier, R. Quiniou, M. Garreau, J.J. Bellanger, and A. Bardou. A modelbased approach for learning to identify cardiac arrhythmias. Artificial Intelligence in Medecine and Medical Decision Making, 1620: , W. Horn et al. editors. [Cordier and Dousson, 2000] M. O. Cordier and C. Dousson. Alarm Driven Monitoring Based on Chronicles. In Proc. of the 4 th Symposium on Fault Detection Supervision and Safety for Technical Processes (SAFEPROCESS), pages , Budapest, Hungary, June IFAC, A.M. Eldemayer. [Dousson, 2002] C. Dousson. Extending and unifying chronicle representation with event counters. In Proc. of the 15 th ECAI, pages , Lyon, France, July F. van Harmelen, IOS Press. [Fessant et al., 2004] F. Fessant, C. Dousson, and F. Clérot. Mining of a telecommunication alarm log to improve the discovery of frequent patterns. 4 th Industrial Conference on Data Mining (ICDM 04), July [Ghallab, 1996] M. Ghallab. On chronicles : Representation, on-line recognition and learning. Proc. of the 5 th International Conference on Principles of Knowledge Representation and Reasoning (KR-96), pages , November Morgan-Kauffman. [Guerraz and Dousson, 2004] B. Guerraz and C. Dousson. Chronicles Construction Starting from the Fault Model of the System to Diagnose. 15 th International Workshop on Principles of Diagnosis (DX 04), pages 51 56, June [Rota and Thonnat, 2000] Nathanaël Rota and Monique Thonnat. Activity recognition from video sequences using declarative models. 14 th ECAI, pages , August W. Horn (ed.), IOS Press. [Vautier et al., 2005] Alexandre Vautier, Marie-Odile Cordier, and René Quiniou. An inductive database for mining temporal patterns in event sequences. Workshop mining spatio-temporal data (in PKDD05), October [Vu et al., 2003] Van-Thin Vu, François Bremond, and Monique Thonnat. Automatic video interpretation: A novel algorithm for temporal scenario recognition. 18 th IJCAI, August DX'06 - Peñaranda de Duero, Burgos (Spain)

95 Model-based Test Generation for Embedded Software M. Esser 1, P. Struss 1,2 1 Technische Universität München, Boltzmannstr. 3, D Garching, Germany 2 OCC M Software, Gleissentalstr. 22, D Deisenhofen, Germany {esser, struss}@ in.tum.de, struss@ occm.de Abstract Testing embedded software systems on the control units of vehicles is a safety-relevant task, and developing the test suites for performing the tests on test benches is time-consuming. We present the foundations and results of a case study to automate the generation of tests for control software of vehicle control units based on a specification of requirements in terms of finite state machines. This case study builds upon our previous work on generation of tests for physical systems based on relational behavior models. In order to apply the respective algorithms, the finite state machine representation is transformed into a relational model. We present the transformation, the application of the test generation algorithm to a real example, and discuss the results and some specific challenges regarding software testing. 1 Introduction Over the last decade or so, cars have become a kind of mobile software platform. There are tens of processors (Electronic Control Units, ECU) on board of a vehicle; they are communicating with each other via several bus systems, and software has a major influence on the performance and safety of a vehicle. The software embedded in the mechanical, electrical, pneumatic, and hydraulic car subsystems becomes increasingly complex, and it comes in many variants, reflecting the context of different types of vehicles, the manufacturer-specific physical realization, versions over time etc. Testing such embedded software becomes increasingly challenging and has been moving away from test drives under various conditions to automated tests performed on test benches which can partly or totally simulated the car as a physical system. But for the reasons stated above, namely complexity of the software and its variation, generating the test suites becomes demanding and time consuming and demands for computer support. Automating the generation of such tests based on a specification of the desired behavior of the software together with the physical system promises benefits regarding both the required efforts and the completeness of the result. In [Struss 94, 94a], we presented the theoretical and technical foundations for automated test generation for physical systems based on models of their (nominal and faulty) behavior. Such behaviors are represented as (finite) relations over system variables which characterize the possible states under different modes of behavior. On this basis, tests can be computed as sets of stimuli that trigger disjoint projections of the behavior relations to the space of observables. An extension of this approach to cover also software would be highly beneficial, because it would provide a coherent solution to testing both physical systems and their embedded software. More concretely, the software test could start from a specification of the intended behavior of the entire system (including physical components and software), and also the tests could reflect the particular nature of the embedded software, namely using stimuli and observations of the physical system rather than directly of the software system. The case study described in this paper concerns a reallife example (the measurement and computation of the fuel level in a vehicle tank) based on the requirement specification document of a car manufacturer. We continue by summarizing the basis for our relationbased implementation of test generation. In order to extend it to software, the requirement specification has to be turned into a relational representation. In the respective document, the skeleton of this specification is provided in a state-chart manner. Therefore, section 3 of this paper proposes a behavior specification as a special finite state machine, and section 4 presents the transformation into a relational representation. A major challenge in the application of the test generation algorithm to software is to provide relevant and appropriate fault models against which the software should be tested (section 5). The final sections present results of the case study and discuss problems and insights. 2 The Background: Model-based Test Generation In the most general way, testing aims at finding out which hypothesis out of a set H is correct (if any) by stimulating a system in such a way that the available observations of the system responses to the stimuli refute all but one hypotheses (or even all of them). This is captured by the following definition. Definition (Discriminating Test Input) DX'06 - Peñaranda de Duero, Burgos (Spain) 85

96 Let TI = {ti} be the set of possible test inputs (stimuli), OBS = {obs} the set of possible observations (system responses), and H = {h i} a set of hypotheses. ti TI is called a definitely discriminating test input for H if (i) h i H obs OBS ti h i obs O, and (ii) h i H obs OBS if ti h i obs O then h j h i ti h j obs P. ti is a possibly discriminating test input if (ii ) h i H obs OBS such that ti h i obs O and h j h i ti h j obs O. In this definition, condition (i) expresses that there exists an observable system response for each hypothesis under the test input. It also implies that test inputs are consistent with all hypotheses. i.e. we are able to apply the stimulus, because it is causally independent of the hypotheses. Condition (ii) formulates the requirement that the resulting observation guarantees that at most one hypothesis will not be refuted, while (ii ) states that each hypothesis may generate an observation that refutes all others. While testing for fault identification has to discriminate between each single pair of hypotheses (if possible), testing for confirming (or refuting) a particular hypothesis h 0 requires only discrimination between h 0 and any other hypothesis. Usually, one stimulus is not enough to perform the discrimination task which motivates the following definition. Definition (Confirming Test Input Set) {ti k} = TI TI is called a discriminating test input set for H = {h i} and h 0 H if h j with h 0 h j ti k TI such that ti k is a discriminating test input for {h 0, h j}. It is called definitely confirming if all ti k are definitely discriminating, and possibly confirming otherwise. It is called minimal if it has no proper subset TI TI which is discriminating. Remark Refutation of all hypotheses h j h 0 implies h 0 only, if we assume that the set H is complete, i.e. i h i Such logical characterizations (see also [McIlraith- Reiter 92]) are too general to serve as a basis for the development of an appropriate representation and algorithms for test generation. Here (in test generation for physical systems), the hypotheses correspond to assumptions about the correct or possible faulty behavior of the system to be tested. They are usually given by equations and implemented by constraints, and test inputs and observations can be described as value assignments to system variables. The system behavior is assumed to be characterized by a vector v S = (v 1, v 2, v 3,, v n) of system variables with domains DOM(v S) = DOM(v 1) DOM(v 2) DOM(v n). Then a hypothesis h i H is given as a relation R i DOM(v S). For conformity testing, h 0 is given by R 0 = R OK, the model of correct behavior. Observations are value assignments to a subvector of the variables, v obs, and also Figure 1: Behavior description of a correct and an open resistor the stimuli are described by assigning values to a vector v cause of susceptible ( causal or input) variables. We make the reasonable assumption that we always know the applied stimulus which means the causal variables are a subvector of the observable ones: v cause v obs. Example To illustrate and motivate the following formal treatment, let us consider a trivial task: generate tests that discriminate an open from a correct resistor. The obvious proposed test is to apply a voltage drop which is guaranteed to be non-zero and check whether or not we observe a non-zero current. Fig. 1 displays descriptions of the behavior under each mode in qualitative terms for the stimulus (a pair of voltages) and the observable current. They capture the information that only applying (qualitatively) different voltages guarantees a non-zero flow in the ok mode which can be distinguished from a zero current enforced by an open resistor and, hence specifies a definitely discriminating test input. As suggested by the example, the basic idea underlying model-based test generation ([Struss 94]) is that the construction of test inputs is done by computing them from the observable differences of the relations that represent the various hypotheses. Fig. 2 illustrates this. Firstly, for testing, only the observables matter. Accordingly, Fig. 2 presents only the projections, p obs(r i), p obs(r j), of two relations, R 1 and R 2, (possibly defined over a large set of variables) to the observable variables. The vertical axis represents the causal variables, whereas the horizontal axis shows the other observable variables (representing the observable system response). To construct a (definitely) discriminating test input, we have to avoid stimuli that can lead to the same observable system response for both relations, i.e. stimuli that may lead to an observation in the intersection p obs(r i) p obs(r j) shaded in Fig. 2. In our example, this intersection contains only current=0. These test inputs are computed by projecting the intersection to the causal variables: p cause(p obs(r i) p obs(r j) ). This yields the pairs with equal voltage in the example. The complement of this (i.e. the pairs of unequal voltages) is the complete set of all test inputs that are guaranteed to 86 DX'06 - Peñaranda de Duero, Burgos (Spain)

97 Not discriminable (NTI) Possibly discriminable (PTI) Definitely Discriminable (DTI) v cause R R 1 2 v obs\cause Figure 2 Determining the inputs that do not, possibly and definitely discriminate between R 1 and R 2 produce different system responses under the two hypotheses: DTI ij = DOM(v cause) \ p cause(p obs(r i) p obs(r j)). Lemma 1 If h i=r i, h j=r j, TI=DOM(v cause), and OBS=DOM(v obs), then DTI ij is the set of definitely discriminating test inputs for {h i, h j}. Please, note that we assume that the projections of R i and R j cover the entire domain of the causal variables which corresponds to condition (i) in the definition of the test input. We only mention the fact, that, when applying tests in practice, one may have to avoid certain stimuli because they carry the risk of damaging or destroying the system or to create catastrophic effects as long as certain faults have not been ruled out. In this case, the admissible test inputs are given by some set R adm DOM(v cause), and we obtain DTI adm, ij = R adm \ p cause(p obs(r i) p obs(r j)). In a similar way as DTI ij, we can compute the set of test inputs that are guaranteed to create indistinguishable observable responses under both hypotheses, i.e. they cannot produce observations in the difference of the relations: (p obs(r i) \ p obs(r j)) (p obs(r i) \ p obs(r i)). Then the non-discriminating test inputs are NTI ij = DOM(v cause)\ p cause((p obs(r j)\ p obs(r i)) (p obs(r i)\ p obs(r j))) All other test inputs may or may not lead to discrimination. Lemma 2 The set of all possibly discriminating test inputs for a pair of hypotheses {h i, h j} is given by PTI ij = DOM(v cause)\ (NTI ij DTI ij ). The sets DTI ij for all pairs {h i, h j} provide the space for constructing (minimal) discriminating test input sets. Lemma 3 The (minimal) hitting sets of the set {DTI 0j} are the (minimal) definitely confirming test input sets for H, h 0. A hitting set of a set of sets {A i} is defined by having a non-empty intersection with each A i. (Please, note that Lemma 3 has only the purpose to characterize all discriminating test input sets. Since we need only one test input to perform the test, we are not bothered by the complexity of computing all hitting sets.) This way, the number of tests constructed can be less than the number of hypotheses different from h 0. If the tests have a fixed cost associated, then the cheapest test set can be found among the minimal sets. However, it is worth noting that the test input sets are the minimal ones that guarantee the discrimination of h 0 from the hypotheses in H. In practice, only a subset of the tests may have to be executed, because some of them refute more hypotheses than guaranteed (because they are a possibly discriminating test for some other pair of hypotheses) and render other tests unnecessary. The algorithm has been implemented based on software components of OCC M s RAZ R ([OCC M 05]) which provide a representation and operations of relations as ordered multiple decision diagrams (OMDD). The input is given by constraint models of correct and faulty behavior of components taken from a library which are aggregated according to a structural description. Finally, we mention that probabilities (of hypotheses and observations) can be used to optimize test sets ([Struss 94a], [Vatcheva-de Jong-Mars 02]). 3 State Charts for Specification of Software Requirements State charts and finite state machines (FSM) are frequently used in specifications of software requirements. Figure 3 shows a FSM extracted from a requirement specification produced by an automotive manufacturer. The machine describes a process to detect refueling of a passenger car: if the car stops for more than 8 seconds and if a remarkably higher tank filling is detected then the software sets the output flag RFD (ReFilling Detected) to true. Otherwise RFD is always false. Let us define the used type of FSM in a more formal way: an automata m a = (E,(I,O,L),(S,A),T,s 0,l 0 ) is described by the set E of events e 1,, e ne, the ordered set I of input variables i 1,,i ni, the ordered set O of output variables o 1,, o no, the ordered set L of local variables l 1,, l nl, the set S of control states s 1,, s ns, the set A of state expressions a 1,, a ns defining a relation δ a,i dom(i) x dom(l) x dom(o) x dom(l) for each state s i, the set T of transitions T 1,, T nt with T i S x P(dom(E) x dom(i) x dom(l)) x S where P(X) denotes to the power set of X, the initial control state s 0 and the vector l 0 with the initial values of L. Each machine has a special local variable l 1 called stime indicating the time elapsed since the machine has entered the actual control state. It is special because each time the control state is switched, the variable is reset automatically. Every variable v in I, O or L has a finite domain dom(v). DX'06 - Peñaranda de Duero, Burgos (Spain) 87

98 Figure 3: FSM describing a refilling detection in a personal car With the inputs (i 1,, i n ) and the events (e 1,, e n ) the machine produces the outputs (o 1,, o n ) according to the following operating sequence: 1. Set t = Evaluate the state expression a i of the current state s t =s i to calculate the new values of the output and local variables : (i t+1, l t, o t+1, l t+1 ) δ a,i 3. If T contains a transition T i=(s src, IF, s dst) with s src =s t and (e t+1,i t+1,l t+1 ) IF then set s t+1 =s dst, otherwise set s t+1 =s t. 4. If s t+1 s t then reset stime. 5. Set t=t+1 6. Jump to Step 2. In our example, the FSM has two input variables car moves and time, one output variable RFD, stime as a the only local variable and the events nothing, car starts moving, car stops and increased tank filling. The variable time is set according to the time elapsed since the previous event occurred. Its value is always added to the stime variable, which could be used in a precondition of a transition. Dependent on the chosen set of input variables I and the events E, the test generation system needs more information in order to produce meaningful tests, because the values of some variables might depend on the occurrence of an event. E.g. if car moves t =true then the event car starts moving can not occur next. In our example, the following rules are necessary: car moves t = false car moves t+1 = true e t = car starts moving car moves t =true car moves t+1 =false e t =car stops In the next section, we describe how the FSM is transformed into a relational representation. 4 Transformation of a FSM into a Compositional, Relational Representation The conversion of a FSM of the described type produces a compositional model, i.e. a model that preserves the structure and the elements of the FSM. As a consequence, a modification of one part of the FSM results in the modification of only one part of the compositional model (As it will turn out this is not fully accomplished for fundamental reasons). The compositional model also provides the possibility of relating defects to the various elements (and also to record and trace their effects e.g. in diagnosis). The basic step is the transformation of the entire FSM into a component C1Step and its internal structure (Figure 4). C1Step takes the state s t, values of local variables l t, the input vector i t+1, and the event e t+1 and generates the subsequent state s t+1, the new values of local variables l t+1, and the output vector o t+1, reproducing the calculations of the FSM in one step (one iteration in the listed operation sequence). C1Step consists of the two components CState and CTrans. The former encodes the state expressions δ a,i, while CTrans represents the transitions T i. CState constrains s t, l t, l t+1, o t+1 and i t+1 independently from the next event e t+1. It contains ns atomic components Ca i, one for each state expression a i, which are placed in parallel (Figure 5). The expressions are conditioned by their respective state and, hence, exactly one component Ca i defines the proper values of the variables. Hence, a change in one a i results in the modification of only one component and a maximum of locality is achieved. Ca i determines l t+1 and o t+1 depending on s t, l t and i t+1 according to a i. The relational model R Cai of such an atomic component is: t t + 1 t t + 1 t + 1 t t + 1 t t + 1 t + 1 s, i, l, l, o s s j ( i, l, o, l ) δ = a j RCa = j t t + 1 t + 1 ( s s j l L o O) CTrans correlates all the variables except the output o t+1 and consists of nt parallel atomic components CT i, one for each transition, and a component CT Default (Figure 6). Exactly one component CT i determines s t+1 depending on s t, l t+1, i t+1 and e t+1 according to T i. The relational model R CTi of these atomic components are: t t + 1 t + 1 t t + 1 {( ( ) ) t t + 1 t + 1 t + 1 t T ( (,, ) ( (,, ) ) ) i T Ti = s IF s e i l IF t t + 1 t t t + 1 t + 1 t t + 1 ( Ti = ( s%, IF, s% ) ( s s% ( e, i, l ) IF ) s S )} R : s, e, i, l, s CTi ( ( ) ( ) ) ( ) In all cases where no transition is executed, the atomic component CT Default defines the values according to the automata definition: the state does not change, s t+1 = s t. Therefore its relation is ( ( ) ) t t + 1 t + 1 t t + 1 s, e, i, l, s RCT = Default t t t + 1 t + 1 t t t+ 1 ( s s% ( e, i, l ) IF T ) i = ( s%, IF, s% ) Now one iteration of the operating sequence can be simulated. To simulate n iterations, C1Step is copied n times and placed in series. But this shows also a limitation: the model can simulate only a fixed number of steps, and the more C1Step components are interconnected the bigger the model (the relation of the entire model) grows. The number of steps needed for test generation depends on the respective FSM and the failure. In order to 88 DX'06 - Peñaranda de Duero, Burgos (Spain)

99 discriminate the ok-model from the failure model, n has to be at least as long as the shortest path in which effects of the fault becomes observable. One solution to this problem could be to start with a small number of steps and increase it until the system produces some tests. A violation of locality becomes evident when the set of transitions is changed, e.g. by deleting, adding, or modifying one. In such cases, not only the respective CT i component has to be removed, added, or changed, but also the default behavior in CT Default has to be updated. 5 Fault Models As described in section 2, our approach to testing is based on trying to confirm the correct behavior by refuting the models of possible faulty behaviors. When testing systems that are composed of physical components only, these models are obtained in a natural way from the fault models of elementary components, which usually have a small set of (qualitatively different) foreseeable misbehaviors due to the underlying physics. Faults due to additional interactions among components are either neglected or have to be anticipated and manifested in the model. In summary, for physical systems, the specific realization of the system determines the possible kinds of misbehavior, and testing compares them to a situation where all components work properly. In software testing, this does not apply. First, the space of possible faults is not restricted by physical laws, but only by the creativity of the software developer when making mistakes. This space is infinite, and the occurrence of structural faults is the rule rather than an exception. Second, the assumption that correct functioning of all (software) components assures the achievement of the intended overall behavior does not hold. This marks an important difference between testing physical artifacts and software. For the former, we can usually assume it was designed correctly (which is why correct components together will perform correctly), but for the software we cannot. It is just the opposite: testing aims at revealing design faults. In our application, the situation is complicated by the fact that it starts from the functional requirements rather than a detailed software design or even the code which might suggest certain types of bugs to check for (e.g. no termination of a while loop). On the positive side, this may lead to a smaller, qualitatively characterized set of possible misbehaviors. In our example about the detection of fuel refilling, a failure one might think of is that the software does not poll the car s movement during driving and therefore does not detect a stop. This means the machine stays in its current control state instead of performing T 3. The Transition T 3 could be seen as deleted. The construction of such a failure model could be achieved by applying the following operator on the ok-model: remove-if-condition: (m a, T i) (m a ) where m a = m a [IF i ] and T i=(s t,if i,s t+1 ). Operation m a [A B] results in a FSM m a which is equal to m a except that element A is substituted with B. Another faulty behavior would occur, if the software treats an increased tank level after 8sec in standstill exactly as if the car starts moving. W.r.t. the FSM, this s t l t means executing T 6 instead of T 7. The proper failure model can be constructed by the operator move-if-condition-to: (m a,t i, T j) (m a ) where m a = m a [IF i IF i IF j, IF j ] and T i=(s t,if i,s t+1 ). 6 Results CState o t+1 (l t,s t ) l Ca t+1 1 (l t,s t ) l t+1 l t+1 Ca (l t,s t ) 2 (l t,s t ) i t+1 l t+1 o t+1 e t+1 i t+1 (l t+1,s t ) CT 1 (l t+1,s t ) s t+1 (l t+1,s t ) CT 2 s t+1 In this section the discrimination of two failure models from the ok model is discussed. These failures are: m delt3 = remove-if-condition(m ok,t 3) m delt5 = remove-if-condition(m ok,t 5) A relational model that simulates 7 steps of the FSM is used here. Discrimination between m ok and m delt3 Only two types of tests are generated to discriminate these two models. Figure 7 lists them, where * stands for any value in the domain of the respective variable. The trajectories caused by the test inputs are shown in Figure 9. The input sequence of the first test could be formulated more naturally as following: Ca n o t+1... s t+1 e t+1 i t+1 (l t+1,s t ) s t+1 CT n CT (l t+1,s t ) Default s t+1 Figure 6: CTrans and its internal structure... i t+1 CTrans i t+1 l t+1 e t+1 s t+1 s t+1 Figure 5: CState and its internal structure l t+1 Figure 4: C1Step and its internal structure DX'06 - Peñaranda de Duero, Burgos (Spain) 89

1. starting from the initial state one waits 4s long, 2. then the car starts moving and 3. directly after this, it stops again and 4. one waits again 4s. 5. After waiting a third time 4s, 6.

The two tests of the previous discriminations are also among them. Some of them are shown in Figure 8. In test 2, the second and the third event occurring are increased tank filling.

100 1. starting from the initial state one waits 4s long, 2. then the car starts moving and 3. directly after this, it stops again and 4. one waits again 4s. 5. After waiting a third time 4s, 6. a significant increase of the tank filling is detected. Discrimination between m ok and m delt5 To discriminate these two models, 36 different types of tests are generated. The two tests of the previous discriminations are also among them. Some of them are shown in Figure 8. In test 2, the second and the third event occurring are increased tank filling. These events are unnecessary. Without these two steps the test input still discriminates the fault from the ok model. The reason that the system generates these is the fixed number of steps of the relational model. So some steps have to be filled with events having no effects but serving as placeholders. This explains why so many different tests are generated. Figure 7: tests discriminating m ok from m delt3 Figure 8: tests discriminating m ok from m delt5 Eliminating unnecessary stimuli is addressed in [Struss05]. Discrimination of both pairs Tests discriminating between both pairs (m ok from m delt3 as well from m delt5 ) are the two from the first discrimination, because these are also in the generated set of the second one. In our example this is not surprising. To distinguish between an ok automata and a fault automata where a transition is deleted, one of these both has always to reach state S 6, because only there the output is different to the one of the others. 7 Related Work Comparison to classical test generation approaches Classical approaches generates tests optimized in respect to a certain coverage criteria like state, transition or MCDC coverage [Beizer95]. In our approach, with carefully chosen sets of failure models tests will be generated that achieve also classical coverage criteria. To obtain a state coverage, for example, a set of failure models M fail could be constructed as follows. For each state s i, there exists one failure model m fail,i in M fail which differs from the ok-model in the output of state expression a i only. The outputs of these two models are complementary. For the case that m ok is a deterministic automaton, the equivalence is proven in [Esser05]. Comparison with the diagnoser approach Also the diagnoser of [Sampath 96] could be used for test generation (although the authors are not aware of any publication describing this): In this approach a diagnoser is generated from the system model, both FSMs, for calculating diagnosis and diagnosability. The transitions of a diagnoser are labeled with observable events, whereas the states are labeled with the behavior modes consistent with the events that occurred so far. For test generation, the set of observable events could be split into causal and non-causal observable events. Then, the task is to find a causal event sequence where each diagnoser path consistent with these causal events and any possible noncausal observables has either only a ok-label or no ok-label at all. Each 90 DX'06 - Peñaranda de Duero, Burgos (Spain)

101 (s 2,l 2 ) = f S -1 (s 2 ') (ι,l 1,ο,l 2 ) δ A (s 1 ) Τ i = (s 1,IF,s 2 ) Τ(Α) mit (e,i,l 2 ) IF Finally, the resulting state machines A' for the different behavior modes have to be merged to a single one by introducing transitions which use failure events and which specifies the differences between the individual machines. An analysis and comparison of the efficiency would be interesting. However our main goal here was to find a formalism to use the same relational approach on software as we use already for test generation of physical devices. Figure 9: trajectory of the FSM for the two tests of Figure 7 sequence of causal and observable events is a valid test for the modeled failures. We expect that the two approaches can be transformed into each other. The kind of automata used in the diagnoser: do not include input/output/local variables. Instead, observable/ unobservable/failure events are used. Instead of seperate failure models, only one model is used that includes all relevant bahavior modes. A mapping from a FSM A as described in this paper to a state machine A' of the system model for the diagnoser approach could be outlined as following: Each possible value combination of input and output variables as well as for the event in A is represented by its own observable event E' in A': f E : (I,O,E) E' Each possible combination of state s and values of local variables L in A is mapped to its own state in A'. This may result in a quite huge state set. f S : (S,L) S' A transition T' with an observable event e' between two states s 1 ',s 2 ' in A' exists, iff there exists a proper transition T and state expression X in A: T' = (s 1 ',e',s 2 ') exists, iff (i,o,e) = f E -1 (e') (s 1,l 1 ) = f S -1 (s 1 ') 8 Discussion The problem which is central to our approach is finding appropriate fault models representing realistic and relevant faults. On the one hand, they are difficult to obtain for software and even more so, when one starts from a functional specification, as we do. This may seem to be a disadvantage in comparison with the other testing heuristics, like coverage criteria. However, it is not true that they do not involve fault models. In fact, they are based on assumptions about possible faults, but these are implicit. The fact that our approach makes them explicit is a major advantage and the basis for more progress. It also bears the potential to generate tests whose power and coverage grows together with the refinement of the specification during the development process. We consider the results of this experiment as encouraging and will continue this work in a project with Audi AG. It has raised a number of issues that need to be addressed in this project. A basic one concerns the question whether the current modeling formalism, a specific type of finite state machine, is appropriate. This has several aspects: First, it has to be checked whether it is expressive enough to capture the requirements on embedded software. Second, the impact of the representation on the complexity of the algorithm has to be analyzed (Handling absolute time is an important issue, as stated below). These aspects have to be confronted with the most important guideline: appropriateness for current practice. Our project is not an academic exercise, but aims at tools that can be easily used in the actual work process. Current requirement specifications at the development stage that matters in our context comprise mainly natural language text together with a few formal or semi-formal elements, such as state charts (provided they are written at all!). Assuming the existence of formal, executable specifications is unrealistic. Any formal representation of the requirements as we need them as an input to our tools needs to take into account whether they can be produced in the current process, by the staff given its education and background, and the limited efforts that can be spent in a real project where meeting deadlines and reducing development time has top priority. Whenever the use of new tools and additional work is required, this needs a rigorous justification by a significant pay-off (in our case in the time spent on testing and the quality of its results). DX'06 - Peñaranda de Duero, Burgos (Spain) 91

102 On the technical side, an adequate handling of time is needed. In our example, time elapsing in a particular state (e.g. 8s with no motion ) seems to be local. However, the respective event has to be stated in a way that can be interpreted properly in other states as well, which may have been reached due to a fault. Introducing global absolute time tends to enforce using the smallest time increments required for some state and event, which appears prohibitive. Acknowledgements Thanks to Torsten Strobel who implemented the algorithm, Oskar Dressler for discussions and support of this work, the Model-based Systems and Qualitative Modeling Group at the Technical University of Munich and the reviewers for their helpful comments. We also thank Audi AG, Ingolstadt, and, in particular, Reinhard Schieber for support of this work. References [Esser 05] Esser, M: Modellbasierte Generierung von Tests für eingebettete Systeme am Beispiel der Tankanzeige in einem Kraftwagen, Technical University of Munich, 2005 [Beizer95] Beizer, B.: Black-Box Testing, John Wiley and Sons, New York, NY, 1995 [McIlraith-Reiter 92] McIlraith, S., Reiter, R.: On Tests for Hypothetical Reasoning. In: W. Hamscher, J. de Kleer und L. Console (Hg.). Readings in Model-based Diagnosis: Diagnosis of Designed Artifacts Based on Descriptions of their Structure and Function. Morgan Kaufmann, San Mateo, 1992 [OCC M 05] [Sampath 96] Sampath, M., Senupta, R., Lafortune, S., Sinnamohideen, K., Teneketzis, D.: Failure Diagnosis using Discrete Event Models. In: IEEE Transactions on Control Systems Technology, 4(2) 1996, pp [Struss 94] Struss, P.: Testing Physical Systems. In: Proceedings of AAAI-94, Seattle, USA, [Struss 94a] Struss, P.: Testing for Discrimination of Diagnoses. In: Working Papers of the 5th International Workshop on Principles of Diagnosis (DX-94), New Paltz, USA, [Struss 05] Struss, P.: Automated Test Reduction. In: B. Rinner et al. (eds.), 19th International Workshop on Qualitative Reasoning QR-05, May 18th - 20th, 2005, Graz, Austria, pp Also in: Dearden, R. and Narasimhan, S. (eds), 16th International Workshop on Principles of Diagnosis (DX-05), June 1-3, 2005, Monterey, California, pp [Vatcheva-de Jong-Mars 02] Vatcheva, I., de Jong, H., Mars, N.: Selection of Perturbation Experiments for Model Discrimination. Proceedings of ECAI-02, DX'06 - Peñaranda de Duero, Burgos (Spain)

103 A Multi-Valued SAT-Based Algorithm for Faster Model-Based Diagnosis Alexander Feldman, Jurryt Pietersma and Arjan van Gemund Delft University of Technology Faculty of Electrical Engineering, Mathematics and Computer Science Mekelweg 4, 2628 CD, Delft, The Netherlands Tel.: , Fax: Abstract Finite integer domains offer an intuitive representation of fault diagnosis models of real-world systems. Approaches that encode multi-valued models to the Boolean domain suffer from combinatorial explosion. Prompted by recent advances in multi-valued SAT solving, in this paper we present a multi-valued diagnosis algorithm. This sound and complete algorithm is based on multi-valued SAT and A*, and does not require Boolean encoding. The resulting diagnostic engine is specifically designed to suit the characteristics of the diagnosis search and better exploits the locality which is present in the multi-valued variable domains of a wide-range of model-based diagnosis problems. Results from experiments on both synthetic and real-world problems are in agreement with recently reported good performance of multi-valued DPLL consistency checkers. Models used for experimentation include NASA s X-34 propulsion system and ASML s wafer scanner subsystems. The empirical results show that, depending on the domain size and number of variables, the multi-valued approach can deliver up to two orders of magnitude speedup over Boolean approaches. 1 Introduction When considering a multi-valued model of the X-34 propulsion system [Sgarlata and Winters, 1997], one option is to model it in the Boolean domain and to use a SAT based diagnosis algorithm or, alternatively, to create a model, for which each variable is in the finite-integer domain. The latter approach allows for intuitive modeling of components with multiple fault modes and discretization of continuous values in areas like qualitative reasoning. Within this modeling technique, we face the choice of using directly a multi-valued diagnostic algorithm or, as an alternative, to encode the model in the Boolean domain by using an appropriate mapping. Boolean encoding is not without a price. Many diagnostic (and SAT) algorithms work on a normalized representation of a model (e.g., CNF). In the Boolean encoding phase, the model loses important locality information (treating the different values of a variable in connection to each other increases the reasoning speed). This problem of breaking apart the multi-valued variables is often aggravated later in the normalization as the Boolean encodings of a single multi-value variable can be spread-through a model after it has been encoded. Directly using a model represented in multi-valued normal form preserves the locality while a multi-valued diagnostic algorithm retains the simplicity of a Boolean diagnosis algorithm. The main algorithm, introduced in this paper, is a multivalued A* search which computes the diagnoses in best-first order, starting with a minimal diagnosis (note that the diagnoses produced subsequently are not necessarily minimal). Thus, not all minimal diagnoses need be computed (the number of all minimal diagnoses can be still exponential in the number of components [Vatan, 2002]). The method, proposed in this paper, is not the only one which achieves speed-up due to exploiting locality. A complementing approach is to exploit system structure and hierarchy [Feldman et al., 2005; Fattah and Dechter, 1995]. Reasoning in representations which are closer to the raw model reduces the number of pre-processing transformation steps and, in general, achieves faster diagnosis times. All these techniques can be combined for achieving diagnosis in real-time for a wide-class of real-world applications. Surprisingly, there are few publications [Bandelj et al., 2002] concerning the application of non-boolean search algorithms to qualitative reasoning and model-based diagnosis. In the satisfiability field, CAMA [Liu et al., 2003] isanovel extension of the classic DPLL algorithm. The CAMA algorithm uses unit-clause propagation and conflict-learning to increase the performance of the satisfiability checking. Similarly [Frisch and Peugniez, 2001] studies direct non-boolean stochastic local search where a well-known Boolean SAT engine (Walksat) is modified to handle non-boolean problems. The authors of both CAMA and NB-Walksat find their approaches faster for multi-valued SAT instances with increasing domain-size, compared to a Boolean DPLL run on the equivalent Boolean encodings. Encoding multi-valued problems in the Boolean domain is a technique discussed in many papers. The study of the use of both a Boolean truth maintenance system and the finite domain approach for solving a class of constraint satisfaction problems (CSP) dates back to [de Kleer, 1989]. The lat- DX'06 - Peñaranda de Duero, Burgos (Spain) 93

104 ter paper suggests sparse Boolean encoding for finite-domain integer solvers. Multi-valued reasoning is complementary to other locality-based diagnosis search techniques [Provan, 2001]. A study on the use of multi-valued propositional encodings for finite-domain variables and the possibility of combinatorial loss-of-performance due to the increase of the propositional theory size, however, is beyond the scope of that paper. CSP-based algorithms for model-based diagnosis [Sachenbacher and Williams, 2004; Williams and Ragno, 2004] already consider multi-valued variable domains. An extensive study has been performed on CSP decomposition techniques [Stumptner and Wotawa, 2003] and some empirical results discussed by the same authors in [Stumptner and Wotawa, 2001] show good performance results. The latter decomposition technique results in faster diagnosis time due to tractability achieved by transforming the original problem to an equivalent one with restricted structure. Our technique differs from the above by the fact that it is based on multi-valued propositional search, hence allowing more aggressive optimization by borrowing learning algorithms and heuristics from the satisfiability domain. The method discussed in this paper can be generalized to almost any approach for propositional reasoning. Multiplevalued decision diagrams (MDD) [Srinivasan et al., 1990] are a natural extension to binary decision diagrams (BDD). To our knowledge the state-of-the-art compilation technique of Darwiche [Darwiche, 2004] has not been generalized to multi-valued propositional logic. Hence an approach similar to ours would be complementary to this method. Similarly, a propositional truth-maintenance system (TMS) (e.g., [Nayak and Williams, 1998]) and Boolean model-based reasoning systems [Frohlich and Nejdl, 1997] can benefit from being able to reason over non-boolean literals. The type of Boolean encoding influences the performance of the DPLL search and the topic is further studied in [Hoos, 1999]. This paper distinguishes between compact and sparse encodings (also referred to as unary and binary in other papers) and compares the performance of a state-of-the art CSP solver and a Boolean SAT consistency checker. The results show increased performance of the multi-valued approach. A hybrid finite-domain constraint solver for circuits is suggested in [Parthasarathy et al., 2004]. This approach combines a Boolean DPLL checker and a finite-domain integer CSP solver to allow for more compact problem representation, avoiding the house-keeping constraints imposed by unary or binary encodings (where the number of variables is not a power of 2). This results in a faster solver which has wide application including consistency-based fault diagnosis. The algorithm described in this paper provides a sound and complete method for computing diagnoses in best-first order. Heuristic functions working on multi-valued representations are provided with additional locality information in comparison to their counterparts working on encoded Boolean models. The extra search dimension which is added by the multivalued domain of the model variables facilitates faster computation of leading diagnoses. In particular, our method is compared to Boolean algorithms working on both sparse and dense encodings. Sparse encoding leads to combinatorial explosion even with small models, while dense encoding, still slower than the multi-valued approach, imposes difficulties on constructing efficient heuristic functions. All this motivates the use of the direct multi-valued reasoning described below. The remainder of the paper is organized as follows. In Section 2 we introduce some basic terminology and show a multivalued consistency checking algorithm, which will be used later in the diagnosis process. In Section 3 we describe the main diagnostic algorithm and illustrate its workings with a small example. Section 4 contains experimental performance results. Finally, conclusion, notes and future works are presented. 2 Multi-Valued Satisfiability The technique presented in this paper searches for a diagnosis by checking for consistency of a possible health assignment, the system description, and the observation, while discarding the states which are inconsistent. The consistency check is done using a DPLL-based algorithm in the multi-valued domain. The main difference between the multi-valued DPLL and its well-known Boolean counterpart comes from the fact that a multi-valued consistency checking routine can branch on more than two values. Before we show this multi-valued variant of DPLL and discuss the opportunities for its optimization, we introduce the basic terminology necessary for multi-valued satisfiability. Definition 1 (Multi-Valued Literal). A multi-valued variable v i V takes a value from a finite domain, which is an integer set D i = {1, 2,...,m}. A positive multi-valued literal l j + is a Boolean function l j + (v i = d k ), where v i V,d k D i. A negative multi-valued literal l j is a Boolean function l j (v i d k ), where v i V,d k D i.if not specified, a literal l j can be either positive or negative. Note, that a variable v i can assume at most one value or its complement in D i. The need for having negative literals comes from the fact that frequently in models, the process of converting to multi-valued CNF results in many negations. For a variable with sufficiently large domains using this notation instead of all complementary values leads to significantly shorter formulae. Next we introduce a multi-valued conjunctive normal form (CNF) which will be our representation for the diagnostic problem 1. Definition 2 (Multi-Valued CNF). A multi-valued conjunctive normal form is a conjunction of disjunctions of multivalued literals, that is C = σ 1 σ 2... σ n and σ i = l i1 l i2... l ik for i =1...n. Throughout this text we will also use an alternative notation for a formula in CNF a clausal set. In this case the clausal set is a set of clauses and each clause is a disjunction of multivalued literals. Definition 3 (Multi-Valued Assignment). A multi-valued assignment φ or a multi-valued term is a conjunction of multivalued literals, that is φ = l 1 l 2... l p. 1 Translation from multi-valued propositional logic to multivalued CNF is well-studied and we will omit it for brevity. 94 DX'06 - Peñaranda de Duero, Burgos (Spain)

105 Obviously, we can convert a multi-valued assignment to a clausal set in which each clause contains a single literal, and vice-versa, by using the De Morgan s law: l 1 l 2... l p = l 1 l 2... l p. The latter proves useful when, later in Algorithm 2, we have to add an observation represented as an assignment to the clausal set of the system description. If an assignment contains all the variables in a multi-valued CNF we will call it full, otherwise it is partial. Nextwedescribe an algorithm, which given a multi-valued CNF C returns True iff there is an assignment (a conjunction of literals) φ such that C = φ. The worst-case time complexity of Algorithm 1 is exponential of the number of variables in φ. Formulae from model-based diagnosis, however, are highly structured and rarely expose this worst-case performance. If we consider all the domain sets D i D, the one with the highest cardinality is d max =argmax Di D D i, that is d max is the number of values in the largest variable domain in φ. The space complexity of Algorithm 1 is then O( V d max ), where V is the number of variables in φ. Algorithm 1 Multi-valued DPLL consistency checking. function MVSAT(C, V, D, φ) inputs: C, a clausal-set V = {v 1,v 2,...,v q }, a set of variables D = {D 1,D 2,...,D q }, a set of domain sets φ, an assignment, initially empty if (φ φ UNITPROPAGATE(C, φ)) = then return False end if 5: if σ C, σ φ = then return True end if for all {v i V,k D i : l (v i = k),l φ} do if MVSAT(C, V, D, φ l) =T then 10: return True end if end for return False end function The workings of Algorithm 1 is very similar to the original Boolean algorithm, except that it branches for every possible value of the selected variable v i. An important part of the algorithm is the unit propagation, implemented in the UNITPROPAGATE routine. Note that, unlike in the Boolean case, unit-propagation works only for clauses which are unitopen with a free positive literal (in the Boolean case we can assign a value to negative literals as well). Algorithm 1 is subject to the same optimization techniques as the Boolean DPLL. An important source of speed-up can be conflict learning 2. In addition to the classical Boolean 2 Actually, in the cases of inconsistent input we need a conflict extraction mechanism, the result of which is to be used for pruning the diagnostic tree in Algorithm 2. Due to the limited scope of this paper we will not discuss mechanisms for conflict extraction. speed-up methods, an additional source of speedup would be the use of various heuristic techniques which are possible due to the extra search dimension caused by the multiple values. When choosing a value for a given variable, for example, it is possible to select (either dynamically for the remaining values and clauses or statically) the one which will satisfy the biggest number of clauses. Algorithm 1 has the potential of determining faster satisfiability in comparison to a Boolean DPLL running on an encoding. Consider the example formula C = (((x =1) (y = 2)) ((x =3) (y 1))), with the domains of the variables x and y, D x = {1, 2, 3} and D y = {1, 2}, respectively. A sparse Boolean encoding would map x =1to x 1, x =2to x 2, x =3to x 3, y =1to y 1 and, finally, y =2to y 2. In addition to that it needs to impose the constraints x 1 x 2,x 1 x 3,x 2 x 3 and y 1 y 2. The sparsely encoded Boolean formula is C s =(x 1 y 2 ) (x 3 y 1 ) (x 1 x 2 x 3 ) ( x 1 x 2 ) ( x 1 x 3 ) ( x 2 x 3 ) (y 1 y 2 ) ( y 1 y 2 ). While the multi-valued formula C has 2 variables with 2 and 3 values each and total of 2 clauses, its Boolean encoding C s has 5 variables and 8 clauses. In the multi-valued case we can determine the satisfiable assignment (x =2) (y =2)in 4 steps only (ignoring the effect of the unit propagation), while in the Boolean encoding we have to perform 6 recursive calls. 3 Diagnosis of Multi-Valued Models Modeling of physical artifacts and representing them by using first-principles is a topic on its own and in this paper we will assume that a correct model is converted to a multi-valued CNF and what remains to perform is the computationally intensive task of finding all possible explanations of a given observation. In order to suggest such an algorithm we first formalize the notion of a diagnostic problem and diagnosis. Definition 4 (Diagnostic Problem). A multi-valued diagnostic problem DP is defined as the ordered triple DP = SD,H,OBS, where SD = V,D,C is the set of variables, their domains and a system description represented as a multivalued CNF, H is a set of health variables such that H V and the observation OBS is a variable assignment over some of the variables in V \ H. If a device is malfunctioning then assigning nominal values (let us name such a nominal assignment x) to all the health variables of its model manifests an inconsistency with the observation, i.e., OBS SD x =. A sound and complete diagnostic algorithm has to find all assignments which explain the observation OBS, that is x : x X, OBS SD x =. Definition 5 (Multi-Valued Diagnosis). A diagnosis or partial diagnosis for the system DP = SD,H,OBS, SD = V,D,C is the assignment x = l 1 l 2... l n such that SD OBS x =. The central problem of model-based diagnosis is that for n health variables we may have as much as 2 n possible diagnoses. In the multi-valued domain, the complexity is even worse due to the multi-valued domains of the variables, i.e., it becomes O(d n ), where d is the cardinality of the biggest domain set. Although the number of diagnoses depends on the model (e.g., if it is a weak or strong fault model) and the DX'06 - Peñaranda de Duero, Burgos (Spain) 95

106 size of the observation, we rarely need all diagnoses. An informed search strategy such as A* is a suitable method for computing only these diagnoses which optimize an objective function g. Such an approach allows us to stop the diagnostic algorithm after we have found the first N diagnoses maximizing g. As a heuristic function, we typically employ a greedy estimator of the probability of an assignment φ. Let us assign a-priori probabilities to all possible values of all health variables. The a-priori probability function of a health variable h i we define as p i (h i ), where 0 p i (x) 1, x D i and x D i p i (x) = 1. The probability estimator of a health assignment φ = l 1 l l k is defined as g(φ) = p 1 (h 1 )p 2 (h 2 )...p n (h n ), where h 1,h 2,...,h n are all health variables h i H and: p(h i )= p i (l j ) if l j + φ 1 p i (l j ) if l j φ arg max k Di p i (k) otherwise Next we show the actual multi-valued A* algorithm for model-based diagnosis. Algorithm 2 Multi-Valued A* Diagnosis. procedure MVA*(SD, OBS) inputs: SD = V,D,C, a system model OBS, an observation term local variables: Q, a priority queue x, a health assignment PUSH(Q, INITIALSTATE()) while (x POP(Q)) = do if MVSAT(SD OBS x, V, D, ) then 5: if ISFULLDIAGNOSIS(h) then OUTPUTDIAGNOSIS(x) ENQUEUESIBLINGS(Q, SD OBS,x) else PUSH(Q,CHILDSTATE(SD OBS,x)) 10: end if else ENQUEUESIBLINGS(Q, SD OBS,x) end if end while 15: end procedure An implementation of (1) is used for the ordering in the priority queue Q. For the manipulation of the nodes in this queue (each node is an ordinary multi-valued health assignment) we use the standard functions PUSH and POP, the latter returning when the queue is empty. The queuing of the nodes is performed by the CHILD- STATE and ENQEUESIBLINGS routines. The former extends a partial assignment with its most probable descendant and the latter pushes on the queue the most probable siblings of each ancestor of the current node. Note that due to the fact that the search is non-systematic we have to keep a list of the visited nodes (this can be organized somewhat more optimally in a trie). (1) As in many cases diagnostic models can be over- or underconstrained (depending on the modeling technique), Algorithm 2 can be extended with conflict-learning mechanism for pruning parts of the search tree [Williams and Ragno, 2004]. Finding the set of all minimal conflicts is a problem which is itself NP-hard, hence such a technique has no choice but to perform a limited amount of analysis (e.g., through resolution) for finding conflicts of good quality. We will illustrate the advantages of the multi-valued diagnosis with a simple model of a valve. The health state of the valve we denote as the health variable h, the control variable c denotes the commanded position of the valve, and for the input and output we use i and o, respectively. The domains of the four variables h, c, i, and o, are D h = {1 (nominal), 2 (stuck open), 3 (stuck closed), 4 (unknown)}, D i = D o = {1 (low pressure), 2 (high pressure)}, and D c = {1 (open), 2 (close)}, respectively. Additionally, we define the a-priori probability estimator of h as p(h =1)=0.9, p(h =2)=p(h =3)=0.045, and p(h =4)=0.01. The model is encoded as the multi-valued propositional formula: 8 >< M = >: (h =1) (c =1) (o = i) (h =1) (c =2) (o =1) (h =2) (o = i) (h =3) (o =1) Next we convert M to a clausal set and the result is the following: 8 ((c =2) (i =2) (o =1) (h 1)) ((c =2) (i =1) (o =2) (h 1)) >< ((c =1) (o =1) (h 1)) C = (3) ((i =2) (o =1) (h 2)) ((i =1) (o =2) (h 2)) >: ((o =1) (h 3)) Let us assume an observation OBS =(c =1) (i =2) (o = 1), that is the valve is stuck-open. The first step of Algorithm 2 would be to check the health assignment x 1 = (h =1)which has the highest probability estimator P (x 1 )= 0.9. Algorithm 1 will determine that SD OBS x 1 =, hence we have to try the second-probable health-assignment x 2 =(h =2)and in this case we have SD OBS x 2 =. In this case x 2 is a diagnosis and due to the admissibility of the heuristics involved, we can pronounce x 2 to be the mostlikely diagnosis. Next, we consider the sparse Boolean encoding of C (normally, we would encode M as M s and convert M s to CNF which is even worse than encoding directly C to C s ) in (3). To preserve the order in which we generate diagnoses, we assign a probability estimator p (h n ) to each Boolean health variable h n encoding a state h = n. We define p(h n ) = p(h = n) and p( h n )=1 p(h = n). It is easy to show that such an assignment of the probability estimators preserves the order in which diagnoses are generated. 8 >< C s = >: (c 2 i 2 o 1 h 1) (c 2 i 1 o 2 h 1) (c 1 o 1 h 1) (i 2 o 1 h 2) (i 1 o 2 h 2) (o 1 h 3) (c 1 c 2) ( c 1 c 2) (i 1 i 2) ( i 1 i 2) (o 1 o 2) ( o 1 o 2) (h 1 h 2) (h 1 h 3) (h 2 h 3) ( h 1 h 2) ( h 1 h 3) ( h 2 h 3) A dense encoding of C results in a shorter representation, having the same number of clauses as in the multi-valued (2) (4) 96 DX'06 - Peñaranda de Duero, Burgos (Spain)

107 CNF: 8 >< C d = >: (h 1 h 2 c i o) (h 1 h 2 c o i) (h 2 h 1 c o) (h 2 i h 1 o) (h 2 o h 1 i) (h 1 h 2 o) While in this example C d is as compact as C M,wehave twice as many health variables and we would need 8 instead of 4 consistency checks if we were to add an extra health state for h in (2). The dense encodings expose another significant problem with representing the probability estimator as the health variables h 1 and h 2 are not independent. Another disadvantage of the dense encoding is apparent in the cases when the number of states for a multi-valued variable is not a power of 2 (i.e., k 2 n for n N, where n is the number of states of a variable v). In this case additional constraints should be added or 2 n k states would use two Boolean encodings per state (such health encodings, however, would result in 2 n k cases in which the same diagnosis is computed twice, necessitating the storing of the already generated diagnoses). A Boolean A* diagnosis 3 on the sparse-encoded model has to perform 2 3 consistency checks in (4) for finding all diagnoses, while Algorithm 2 computers all of them with 3 checks only (in addition to that multi-valued consistency checking is more efficient as we have seen in Section 2). With dense encodings we need 4 consistency checks which is still less efficient than the multi-valued case. The performance difference is better illustrated when we are interested in the first (most-likely) diagnosis only. In this case the multi-valued algorithm needs 2 consistency checks over a clausal set comprising 6 clauses, the sparse encodings allow computation of a leading diagnosis with 5 checks over 18 clauses and the dense encodings do not facilitate an implementation of a heuristic function preserving the probability order. The number of clauses in C s is 18 due to the extra inequality constraints which additionally delays the Boolean reasoning. As we are going to show in the next section, these differences translate to significant savings (in favor of the multi-valued approach) with bigger models. 4 Experimental Performance Evaluation To demonstrate the improved performance of the multivalued solver we have compared it experimentally with the performance of a Boolean solver for sparse and dense encoded models. In these experiments we use the diagnosis time t as performance metric. This is the processing time required to generate N diagnoses, for given OBS. It is measured by the diagnosis engines as CPU time in milliseconds. All the experiments described in this paper are performed on a 3 GHz Pentium IV CPU. The algorithm discussed in this section were performed as a part of the LYDIA model-based diagnosis toolkit 4. The 3 Note that (4) can be considered as a multi-valued diagnosis problem with two values per variable. 4 The LYDIA package for model-based fault diagnosis can be downloaded from (5) benchmark models and test-vectors are available upon request. The multi-valued polycell models, discussed below were synthetically generated. Our experimentation could further benefit from the existence of a scalable multi-valued benchmark for model-based diagnosis. For a multi-valued problem and its sparse and and dense encodings, we denote t with t m, t s, and t d respectively. The speedups s s and s d of the multi-valued search over the Boolean sparse and dense encodings respectively, are calculated as s s = t s /t m and s d = t d /t m. Let W denote the non-observable, non-health variables, W = V \ (OBS H). We investigate t, s s and s d in relation to the domain size D i and model complexity. As discussed in the valve example, D i and the type of encoding determine the number of clauses and consistency checks which affect t. Hence we investigate the relation between t and D i. We use the well-known synthetic, integer domain Polycell model [de Kleer and Williams, 1987] which allows for practical and meaningful variation of D i. Let D i = d H, h i H and D i = d W, w i W. With the Polycell we perform experiments for d H =2, 3,...,9 and d W =2, 3,...,42, for nominal OBS with N equal to the maximum number of consistent solutions K. We perform similar performance experiments on realworld models to investigate whether the expected improved performance of the multi-valued approach also holds for practical applications. We use nine models of variations of ASML wafer scanner subsystems and one of NASA s X-34 propulsion system. In contrast to Polycell, these models have varying values of D i for H and W. Therefore we cannot investigate a one-to-one relationship of t with D i. Instead, we investigate the dependencies between t and the model complexity. We use S, the number of all possible value assignments of H and W, as measure for this complexity. Let S H = h D i H i, h i H, S W = w D i W i, w i W, and S = S H S W. Table 1 shows these numbers for all realworld models and variations. Model S H S W S ASML1A E +05 ASML1A E +08 ASML1A E +12 ASML1B E +07 ASML1B E +14 ASML1B E +21 ASML2A E +14 ASML2B E +25 ASML2C E +36 X E +24 Table 1: S H, S W, and S for 10 real-world models and variations. We consider three scenarios with OBS caused by nominal, single fault, and double faults for N = 1 and N = min( H,K), respectively. Because we do not currently have a probability heuristic in place for the dense encoding, its diagnoses is unsorted. Hence comparison with the sorted multivalued diagnosis for N<K, is not always correct. DX'06 - Peñaranda de Duero, Burgos (Spain) 97

108 For the Polycell model Figure 1 shows a logarithmic plot of t against d W with d H =2, for all encodings. For d W 42, s s 6 and s d 1, the latter indicating similar performance. The down-ward spikes in the dense encoding plots are due to the efficient dense encoding for log 2 d W N. Experiments performed with non-nominal or fewer observations show similar results. [ms] d W t s t d t m Figure 1: Time for computing of all diagnoses for Polycell models with d H =2and a nominal observation vs. variable d W. Similarly Figure 2 shows the results for d H =3.Ford W 42, s s 26 and s d 3. Thus, except for the spikes at d W =16or d W =32the multi-valued approach now also clearly outperforms the dense encoding. This is due to the inefficiency of dense encodings for log 2 d H / N. [ms] d W t s t d t m Figure 2: Time for computing of all diagnoses for Polycell models with d H =3and a nominal observation vs. variable d W. Figure 3 shows a double logarithmic plot of t against d H. We omit sparse encodings because t s for d H > 3 becomes prohibitively large for practical experiments. The increase of t d is exponential log 2 d H, hence the staircase shape of this plot. Let γ d (d H ) be the increase according to, γ d (d H )= t d(d H log 2 d H = k),k =2, 3, 4 (6) t d (d H log 2 d H = k 1) For k =2, 3, 4, γ d (d H ) 17, 21, 31. The straight line for multi-valued encodings agrees to t d H 5. The plot shows that for d H < 9 and log 2 d H log 2 d H, s d < 1, i.e., better performance for dense encoding. As, t m and t d for d H > 9 become prohibitively large for experiments we need to extrapolate their relations with d H to compute s, s d = γ d(d H ) log 2 dh d H 5 (7) If, e.g., we consider a conservative approach and assume γ d (d H )=31, then 0.8 s d 23.9, for d H 256. If, as expected from the experimental results γ d (d H ) continues to increase, then for γ d (d H ) > 2 5, 1.0 <s d and the speedup of multi-valued over dense encoding will not have an upper bound. Thus multi-valued encoding also has unbounded speedup for dense encodings. [ms] 1e+07 1e+06 1e+05 1e+04 1e+03 1e d H t d t m Figure 3: Double logarithmic plot of time for computing of all diagnoses for multi-valued and dense encoded Polycell models with d W =42and a nominal observation vs. variable d H. For the real-world models, Figure 4 shows the double logarithmic plot for t vs. S for N =1for the multi-valued solver for nominal, single, and double faults. As expected, it shows an increasing trend for t vs. S, and for increased cardinality of the faults. Figure 5 shows s s and s d. Note that data points above the thick line s =1indicate an actual speedup of the multi-valued approach. In the nominal case, s s < 1 which is caused by a larger overhead of the multi-valued approach for N =1. The effect of this overhead disappears for single and double faults where 10 1 <s s < Because of the omitted probability heuristic mentioned earlier, the analysis of s d is less straightforward. We note that in roughly half the cases the speedup is similar to the sparse encoding. Figures 6 and 7 show t and s for N =min( H,K). As N 1 the initial overhead effect for the multi-valued approach is amortized over the multiple diagnoses resulting in 10 1 <s s < 10 4 for nominal, single, and double faults. For s d, it is interesting to see that the effect of the omitted probability heuristic is most apparent for the double faults. This can be explained from the fact that in these cases the solver generates many-fault solutions with low probability faster than the multi-valued solver generates true double-fault solutions. The latter have higher t because of more constrained consistency checking. 98 DX'06 - Peñaranda de Duero, Burgos (Spain)

109 tm [ms] nominal health single fault double fault 0.1 1e+05 1e+15 1e+25 1e+35 S Figure 4: Double logarithmic plot of diagnosis time with N = 1 and observations consistent with different fault cardinalities vs. variable model complexity. tm [ms] 1e nominal health 10 single fault double fault 1 1e+05 1e+15 1e+25 1e+35 S Figure 6: Double logarithmic plot of diagnosis time with N = min( H,K) and observations consistent with different fault cardinality vs. variable model complexity e+05 1e+15 1e+25 1e+35 S s s, F =0 s d, F =0 s s, F =1 s d, F =1 s s, F =2 s d, F =2 Figure 5: Double logarithmic plot of speedup with N = 1 and observations consistent with nominal health (F =0), single fault (F =1), and double fault (F =2) vs. variable model complexity. In summary of the experiments, the speedup of multivalued approach over the sparse encoding is demonstrated clearly both in relation to the domain size and number of states. Speedups of 10 2 are readily achieved. The same conclusion holds for dense encodings as far as the domain size experiments are concerned. Especially for larger domain sizes the speedup is considerable. For smaller domain sizes closer proximity to 2 i,i N means better performance for the dense encodings. Because of the lacking probability heuristic, the relation of s d with S remains somewhat unclear, despite the fact that s d > 5 in many cases. 5 Conclusion This paper introduces a new algorithm for computing diagnoses which works directly on the multi-valued representation of a model. The sound and complete algorithm comprises a DPLL-based multi-valued consistency checking and a multi-valued A* search. The two routines eliminate the need of model encodings, thus preventing loss of locality in e+05 1e+15 1e+25 1e+35 S s s, F =0 s d, F =0 s s, F =1 s d, F =1 s s, F =2 s d, F =2 Figure 7: Double logarithmic plot of speedup with N = min( H,K) and observations consistent with nominal health (F =0), single fault (F =1), and double fault (F =2) vs. variable model complexity. formation which often leads to performance degradation. In contrast to dense Boolean encoding, the multi-valued A* algorithm allows an intuitive assignment and interpretation of a-priori probabilities, which combined with the greedy A* search, described in this paper, allows for a precise control over the termination criteria for the diagnostic computation. This allows the generation of only these leading diagnoses that contain significant probability mass (thus turning the search into incomplete). While sparse Boolean encoding is more suitable than dense encoding for heuristic search based on a-priori probability, the combinatorial explosions caused by the introduction of new variables makes it suitable for the tiniest diagnosis problems only. We have empirically compared the performance of the algorithm to sparse and dense Boolean encodings. These experimental results confirm our analysis and show that the multivalued search outperforms both types of encoding, in particular being two orders of magnitude faster than sparse Boolean DX'06 - Peñaranda de Duero, Burgos (Spain) 99

110 encoding. While dense encoding is faster than sparse encoding (but still slower in comparison to the multi-valued approach), it is less amenable to heuristics based on a-priori health probability estimation, widely used in model-based diagnosis. In future work, we aim to address this estimator problem, allowing the three methods to be compared not only in the cases of generating all diagnoses but also when computing the first N leading diagnoses. In this case we would be able to analyze how the probability assignment influences the diagnostic search. Last, we note the need of representative multivalued benchmarks which would enhance the experimental results of this paper. Furthermore, we envision our algorithm used in combination with other reasoning-methods which result in improved performance by using locality. Analysis [Ramesh et al., 1997] and experience has shown that the higher-level reasoning engine we use (by higher-level we mean a diagnostic engine which uses a model representation closer to the original model), the faster performance results we get. In the future, we would be interested in combining, e.g., non-clausal search methods, hierarchical search [Feldman et al., 2005], and the multi-valued approach described here for a more efficient model-based diagnosis and related diagnostic reasoning. Acknowledgments We extend our gratitude to the anonymous reviewers for their valuable feedback. References [Bandelj et al., 2002] A. Bandelj, I. Bratko, and D. Šuc. Qualitative simulation with CLP. In Proc. QR 02, [Darwiche, 2004] Adnan Darwiche. New advances in compiling CNF into DNNF. In Proc. ECAI 04, pages , [de Kleer and Williams, 1987] J. de Kleer and B. Williams. Diagnosing multiple faults. JAI, 32(1):97 130, [de Kleer, 1989] J. de Kleer. A comparison of ATMS and CSP techniques. In Proc. IJCAI 89, pages , [Fattah and Dechter, 1995] Yousri El Fattah and Rina Dechter. Diagnosing tree-decomposable circuits. In IJCAI 95, pages , [Feldman et al., 2005] A. Feldman, A.J.C van Gemund, and A. Bos. A hybrid approach to hierarchical fault diagnosis. In Proc. DX 05, pages , [Frisch and Peugniez, 2001] Alan Frisch and T. Peugniez. Solving non-boolean satisfiability problems with stochastic local search. In Proc. IJCAI 01, pages , [Frohlich and Nejdl, 1997] Peter Frohlich and Wölfgang Nejdl. A static model-based engine for model-based reasoning. In Proc. IJCAI 97, pages , [Hoos, 1999] H. Hoos. SAT-encodings, search space structure, and local search performance. In Proc. IJCAI 99, pages , [Liu et al., 2003] C. Liu, A. Kuehlmann, and M. Moskewicz. CAMA: A multi-valued satisfiability solver. In Proc. IC- CAD 03, pages , [Nayak and Williams, 1998] Panduarng Nayak and Brian Williams. Fast context switching in real-time propositional reasoning. In Proc. AAAI 98, pages 50 56, [Parthasarathy et al., 2004] G. Parthasarathy, M. Iyer, K.T. Cheng, and L. Wang. An efficient finite-domain constraint solver for circuits. In Proc. DAC 04, pages , [Provan, 2001] G. Provan. Hierarchical model-based diagnosis. In Proc. DX 01, pages , [Ramesh et al., 1997] A. Ramesh, G. Becker, and N. Murray. CNF and DNF considered harmful for computing prime implicants/implicates. JAR, 18(3): , [Sachenbacher and Williams, 2004] Martin Sachenbacher and Brian Williams. Diagnosis as semiring-based constraint optimization. In Proc. ECAI 04, pages , [Sgarlata and Winters, 1997] P. Sgarlata and B. Winters. X- 34 propulsion system design. In Proc. AIAA Joint Propulsion Conference and Exhibit, [Srinivasan et al., 1990] Arvind Srinivasan, Timothy Kam, Sharad Malik, and Robert Brayton. Algorithms for discrete function manipulation. In Proc. ICCAD 90, pages 92 95, [Stumptner and Wotawa, 2001] Markus Stumptner and Franz Wotawa. Diagnosing tree-structured systems. JAI, 127(1):1 29, [Stumptner and Wotawa, 2003] Markus Stumptner and Franz Wotawa. Coupling CSP decomposition methods and diagnosis algorithms for tree structured systems. In Proc. IJCAI 03, pages , [Vatan, 2002] F. Vatan. The complexity of the diagnosis problem. Technical Report NPO-30315, Jet Propulsion Laboratory, California Institute of Technology, [Williams and Ragno, 2004] Brian Williams and R. Ragno. Conflict-directed A* and its role in model-based embedded systems. JDAM, DX'06 - Peñaranda de Duero, Burgos (Spain)

111 A general method for diagnosing axioms Gerhard Friedrich, Stefan Rass, and Kostyantyn Shchekotykhin Universitaet Klagenfurt, Universitaetsstrasse 65, 9020 Klagenfurt, Austria, Europe Abstract Full support of debugging knowledge bases must not stop at the level of axioms. In this work, we present a general theory for diagnosing faulty knowledge bases, which not only allows the identification of faulty axioms, but also the pinpointing of those parts of axioms, which must be changed. Based on our theory, we present methods for computing these diagnoses and show the feasibility by extensive test evaluations. The proposed approach is broadly applicable, since it is independent of a particular logical language (with monotonic semantics) and independent of a particular reasoning system. 1 Introduction A broad adoption of knowledge based systems requires effective methods to support the test and debug cycle of knowledge bases. In this cycle the knowledge engineer has to diagnose the knowledge base (KB) in order to identify those parts, which must be changed such that the intended behavior is achieved. This task becomes challenging even in moderately sized knowledge bases with hundreds of axioms. Therefore, considerable research effort [Schlobach and Cornet, 2003; Parsion et al., 2005; Friedrich and Shchekotykhin, 2005; Haase et al., 2005; Wang et al., 2005] was put into the improvement of debugging. All these approaches have in common that they either are not based on a well-founded diagnosis theory or consider an axiom as the smallest entity, which could be faulty. Consequently, no general diagnosis approach exists that identifies those parts of axioms which must be changed such that all tests and requirements for a KB are fulfilled. Diagnosing axioms becomes especially important in knowledge bases, where axioms are of remarkable size, e.g. the Galen ontology 1 comprises axioms with more than 100 arguments in a logical operator. In this case, debugging is still difficult even if diagnosis provides a set of axioms that need further investigation by the knowledge engineer. 1 A test version is included in a benchmark suite for description logic, e.g. see RACER s version at Consequently, an important research question is, whether it is possible to develop a general theory and practically applicable algorithms for the diagnosis of knowledge bases, which improve the resolution of diagnoses by identifying faulty parts of axioms. Such an algorithm could be very supportive as an interactive debugging tool for knowledge engineers who want to inspect diagnoses and axioms on a more fine grained resolution level. We address this question by the development of a general theory for the diagnosis of knowledge bases, which are expressed in a declarative knowledge representation language. This theory not only allows the identification of faulty axioms, but also the faulty parts of axioms. Furthermore, this theory is applicable for all declarative knowledge representation languages, which are based on a variant of first-order logic (FOL), e.g. description logic (DL) or the OWL language family, which plays an important role for the implementation of the Semantic Web. We will show that this theory is a generalization of the diagnosis theory of [Friedrich and Shchekotykhin, 2005], which on one hand provides a general theory for diagnosing knowledge bases but on the other hand considers just axioms as the finest granularity of diagnoses. For the implementation, we employ a transformation of the axioms in a set of axioms such that the original diagnosis algorithms presented in [Friedrich and Shchekotykhin, 2005] can be applied. As a consequence, this approach provides a sound and complete method for the generation of diagnoses for axioms independently of a particular reasoning system. Therefore, we are not limited to special properties of the knowledge bases (e.g. acyclic) or restrictions of the representation language. We present enhancements of this algorithm, which lead to considerable improvements of the running time for the diagnosis of axioms. Finally, we show the feasibility of our methods by exploiting a standard test library. In the following Section, we present our basic idea to diagnose axioms. In Section 3, we introduce a general theory for diagnosing axioms and relate this theory to the existing theory of diagnosing knowledge bases. The foundation for computing axiom diagnoses is given in Section 4, followed by an evaluation in Section 5. The paper closes with a discussion of related work and final conclusions. DX'06 - Peñaranda de Duero, Burgos (Spain) 101

112 2 Limitation of current approach For the introduction of our concepts, we consider the DL knowledge base 2 bike2.tkb, available through the benchmark suite for the RACER system. This KB comprises 154 axioms. Let us assume that in the knowledge acquisition process one of the axioms was incorrectly stated. In our exemplification, the axiom defining concept C13 is incorrectly specified, i.e. in the depicted Axiom 66 the correct expression (SOME R11 C75) has been replaced by the faulty expression (ALL R11 C75): [66:] (DEFINE-CONCEPT C13 (AND (SOME R22 C63) (SOME R11 C74) (ALL R11 C75) (AT-MOST 3 R11) (AT-LEAST 2 R11) (SOME R14 *TOP*) (SOME R30 *TOP*) (AT-LEAST 2 R19) (SOME R4 *TOP*) (SOME R23 *TOP*) (SOME R2 *TOP*))) As a consequence the KB bike2.tkb becomes incoherent because of Axiom 66 and the axioms [145:](IMPLIES C74 (NOT C75)) [146:](IMPLIES C75 (NOT C74)) A knowledge base is incoherent, iff there exists a concept or role which is incoherent. A concept or role is incoherent, iff it has an empty extension in all models. In our example C13 is incoherent in bike2.tkb. The sets of axioms 66, 145 and 66, 146 are the minimal incoherent subsets of the example KB. Such sets are called minimal conflicts in the terminology of model-based diagnosis [Friedrich and Shchekotykhin, 2005]. As expected, the diagnosis-engine of [Friedrich and Shchekotykhin, 2005] correctly returns Axiom 66 as the only single fault diagnosis. More formally, a KB-Diagnosis problem is defined as follows. Definition 1 (KB-Diagnosis Problem) A KB-Diagnosis Problem (Diagnosis Problem for a Knowledge Base, [Friedrich and Shchekotykhin, 2005]) is a tuple (KB, B, TC +, TC ), where KB is a knowledge base, B is a background theory, TC + is a set of positive and TC a set of negative test cases, which the KB has to be consistent or inconsistent with, respectively. The test cases are given as sets of logical sentences. We assume that each test case and the background theory on its own are consistent. The principal idea of the following definition of a diagnosis for a KB is to find a set of axioms that must be changed (respectively deleted) and, possibly, some axioms that must be added s.t. all test cases are satisfied. The symbol expresses a contradiction. Definition 2 (KB-Diagnosis) [Friedrich and Shchekotykhin, 2005] A KB-Diagnosis for a KB-Diagnosis Problem 2 We assume the reader to be familiar with the basics of description logic. For otherwise, [Baader et al., 2003] provides an excellent introduction. In addition, we would like to stress that using DL knowledge bases as examples does not imply any limitation of our approach to DL. We used DL because of its importance for the Semantic Web. (KB, B, TC +, TC ) is a set D KB of sentences such that there exists an extension EX, where EX is a set of logical sentences added to the knowledge base, such that 1. e + TC + : (KB D) B EX e + = 2. e TC : (KB D) B EX e = A minimal diagnosis is a diagnosis such that no proper subset is a diagnosis. A minimum cardinality diagnosis is a diagnosis such that there exists no diagnosis with smaller cardinality. According to [Friedrich and Shchekotykhin, 2005, Corollary 1], we may characterize EX by the conjunction of all negated negative test cases. In particular, D is a diagnosis for (KB, B,TC +,TC ) iff e + TC + : (KB D) B e + e TC ( e ) is consistent. In order to keep the example simple, we do not specify a background theory or test cases; but we require coherence and consistency. Presenting Axiom 66 as the only minimal single fault diagnosis does not provide information about the parts of Axiom 66, which caused the incoherence. In particular, we would like to identify those parts of a faulty axiom, which must be changed such that the requirements (e.g. coherence and compliance with test cases) are fulfilled. In order to achieve this task, we base our principal idea on the observation that axioms are composed by structures according to a predefined grammar. E.g. Axiom 66 consists of an AND-structure with 11 arguments. Each argument is a structure itself. In general, such structures are defined by a grammar, where the terminal symbols are literals. Exploiting this observation, we recognize various possibilities for restoring coherence. Either the AND-structure must be changed (e.g. deleting the second argument of the AND-structure) or one of its arguments. Analyzing these arguments, we recognize that only the arguments (SOME R11 C74) and (ALL R11 C75) are relevant for producing an incoherent KB. Changes to arguments or operators of these structures will resolve the incoherence, e.g. by replacing the ALL operator by a SOME operator or by changing the names of concepts or roles. As a consequence, we can pinpoint parts of the axiom, which must be changed, thus exonerating the greater part of Axiom 66. In the following, we will generalize the ideas presented in the example. 3 Diagnosis of axioms The refinement of a KB-Diagnosis is based on the structure of the axioms according to the underlying syntax of the knowledge representation language. For our methods, we assume that the syntax of the language is defined by a context-free grammar G. In particular, we assume the usual structuring of logic-based languages expressed by production rules according to the following prototypical rule: V 0 op(v 1,..., V n {, N j } ), where V 0 is a non-terminal, and the right-hand side is a logical structure defined by a logical operator op and arguments V 1,..., V n, which are logical structures. {, N j } denotes a possibly empty list of non-logical arguments, i.e. 102 DX'06 - Peñaranda de Duero, Burgos (Spain)

113 numbers. Logical structures correspond to literals or otherwise can be recursively composed by applying operators to simpler structures. Consequently, a logical structure is either an operator application or a literal. Context free grammars may have also single non-terminal symbol on the right-hand side (V 0 V m ). In this case the syntax-tree [Linz, 1996] comprises intermediate nodes, which can be replaced by the successor of this node. Therefore, we assume the following structuring of axioms, where L is a logical structure, and LI is a literal: L op(l,..., L{, N j } ) LI Note that op depends on the language. In case of FOL, op corresponds to the usual logical connectives and to quantifications of variables (e.g. x). However, in DL, op corresponds to one of the logical operators defined in specific DL variants (e.g. ALL, SOME, AND, OR). Furthermore, a LISPlike notation may be chosen. A simple grammar used for DL within the RACER system can be found in [Patel-Schneider and Swartout, 1993], e.g. C 0 CN (AND C 1... C n ) (ALL R C 0 )... where CN is an atomic concept (i.e. a literal), C i are concepts and R is a role. In addition to logical arguments V i of an operator, a language may exploit non-logical arguments N j, which are modifying the meaning of this operator e.g. a DL could comprise the (AT-LEAST N R) operator, where N is a natural number and R is a role. Viewing it as a first order logic statement, a logical structure is introduced, which depends on N. Regarding the meaning, we make the usual assumption that the semantics of logical structures is given denotationally using an interpretation I = I, ( ) I, where I is a domain (non-empty universe) of values and ( ) I is an interpretation function. Literals are subsets of I or relations over I, and the semantics is inductively defined. Since we are interested in a general theory for diagnosing logical knowledge bases, we constraint a semantics as little as possible. In order to deal with variable symbols, we have to define a partial function µ, which provides a substitution of some variables by elements of I. We assume that the semantics of op is defined by op(v 1,..., V n {, N j } ) I,µ := op(v I,µ 1,...,Vn I,µ {, N j } ), where op is defined as a partial function that maps V I,µ 1,..., Vn I,µ {, N j } to a value, depending on the logic defined, e.g. to truth values in case op is a logical connective in FOL or subsets of I in case op is an operator of a DL. An axiom is satisfied by I, µ if it is true. Note that non-logical structures are not interpreted by I. E.g. the semantics of (AT-LEAST N R) I is defined as {a I {b (a, b) R I } N}, where N is a nonlogical argument. Finally, we assume a monotonic semantics of the logic. As a consequence, every axiom can be represented by its syntax-tree (cf. fig. 1. Every operator, literal, and non-logical argument of an axiom is represented by a node. There is a directed arc n 1 n 2 from node n 1 to node n 2, iff n 2 is an argument of n 1. We then say that n 1 is the predecessor of n 2. More generally, we can view the whole KB as one tree (called the KB-tree), where the first level (root node) is Figure 1: Syntax tree for our example axiom. Arcs represent relationships between operators and their corresponding arguments. the start symbol of the grammar, and the second level corresponds to the operators (logical structures) defining axioms (e.g. DEFINE-CONCEPT). A complete subtree rooted at a node in the KB-tree defines a logical expression. An axiom is a logical expression itself. Based on this representation, we can generalize the concept of KB-Diagnosis in order to identify faulty logical structures. For this generalization, let every logical structure L (which is either represented by an operator or by a literal) of a KB and its associated logical expression E be uniquely identified by a marker M(L) rsp. M(E). The following example shows a possible marking of Axiom 66: (DEFINE-CONCEPT 66.0 C (AND 66.2 (SOME 66.3 R C ) (SOME 66.6 R C )... )) Furthermore, we will exploit a replacement operator KB[L/R], where the logical structure L is replaced by a logical expression R in the knowledge base KB. Replacements of logical structures are regarded as repairs for faulty descriptions. Therefore, we will only consider syntactically valid replacements of a structure. In our example some of these syntactically valid replacements for AND 66.2 could be the change to an OR operator, the addition of a NOT before the AND, or the introduction of a new operator (e.g. an OR) combining some arguments. KB[L 1 /R 1,..., L n /R n ] denotes the simultaneous replacement of L 1,..., L n by R 1,...,R n. The principal idea of the following definition is to identify those logical structures, which are the cause for not satisfying the requirements (e.g. failed test cases or incoherent KB). A set of logical structures is a cause for not satisfying the requirements iff there exists a replacement of these structures (a possible repair) such that all requirements are satisfied. More formally: Definition 3 (AX-Diagnosis) Let LS be the set of logical structures of the knowledge base KB. An AX-Diagnosis for a KB-Diagnosis Problem (KB, B, TC +, TC ) is a set D = {L 1,...,L n } LS such that there exist syntactically valid replacements R i for each logical structure L i D and an extension EX, where EX is a set of logical sentences added to the knowledge base, such that 1. e + TC + : KB[L 1 /R 1,..., L n /R n ] B EX e + = DX'06 - Peñaranda de Duero, Burgos (Spain) 103

114 2. e TC : KB[L 1 /R 1,..., L n /R n ] B EX e = We say that such a replacement KB[L 1 /R 1,..., L n /R n ] clears (or repairs) the fault. A minimal AX-Diagnosis D is defined as usual by requiring that no subset D D is an AX-Diagnosis. Likewise, D is a minimum cardinality AX- Diagnosis, if there exists no AX-Diagnosis with smaller cardinality than D. In our running example, there exist eight minimum cardinality (single fault) AX-diagnoses, marked by boxes: ( DEFINE-CONCEPT 66.0 C13 ( AND 66.2 (SOME R22 C63) ( SOME 66.6 R C ) ( ALL 66.9 R C ) (AT-MOST 3 R11) (AT-LEAST 2 R11) (SOME R14 *TOP*) (SOME R30 *TOP*) (AT-LEAST 2 R19) (SOME R4 *TOP*) (SOME R23 *TOP*) (SOME R2 *TOP*))) Note that in fact each of these structures can be replaced in order to restore the coherence of bike2, e.g. in AND 66.2 the second and third argument could be replaced by a new logical structure combining them by an OR, the concepts and roles C , C , R , and R could be replaced by other roles and concepts not mentioned in bike2. ALL 66.9 could be replaced by SOME and SOME 66.6 may be preceded by a NOT. Finally, with respect to a replacement of DEFINE-CONCEPT 66.0, a complete deletion of Axiom 66 (the axiom may be out-dated) is a possible repair. Note that replacements of the logical structures in the example axiom, which are not one of the eight single fault diagnoses do not restore the coherence of bike2. More generally, logical structures that are not contained in minimal diagnoses need not be considered for replacement. Consequently, a large fraction of the example axiom does not require further investigations by the knowledge engineer. Note that if we focus the diagnosis process on minimum cardinality AX-Diagnoses then the exoneration of logical structures can be extended. Let N be the cardinality of the minimum cardinality diagnoses. Then it is clear that any replacement of a logical structure not contained in a minimum cardinality diagnosis can only resolve the fault if at least N + 1 repairs are performed in total. As it is generally the case in diagnosis, additional information as test cases and extensions to the background theory could be exploited to reduce the number of most likely diagnoses (e.g. the number of minimum cardinality diagnoses). In addition, one could imagine a more sophisticated estimation regarding the likelihood of AX-diagnoses. However, since we focus on the foundations of diagnosing axioms both tasks are out of the scope of this work. In case the knowledge engineer wants to change a logical structure in addition to those contained in a minimal AX- Diagnosis, then no additional changes are necessary because of the following property: Remark 1 Every superset of a minimal AX-diagnosis for a KB-Diagnosis Problem is an AX-diagnosis. Since a replacement of a logical structure by a logical expression may also include changes in the arguments, the concept of AX-Diagnosis shares some similarities with hierarchical diagnosis: Remark 2 Let D = {L 1,..., L i,..., L n } be an AX- Diagnosis for a KB-Diagnosis Problem (KB, B, TC +, TC ), and let the logical structure L i be a predecessor of L i in the KB-tree. Then {L 1,...,L i,..., L n} (i.e. replacing L i by L i in D) is an AX-Diagnosis. Roughly speaking, if we regard logical arguments as sub components, then the previous remark says that a faulty sub component also implies a faulty super component. However, the converse is not necessarily true. This converse is usually assumed in hierarchical diagnosis, i.e. if a super component is faulty then at least one of its sub components is faulty. An AX-Diagnosis {L 1,..., L i,..., L n} might contain a logical structure L i for, which there does not exist an AX-Diagnosis {L 1,..., L i,...,l n }, where L i is replaced by a successor L i w.r.t. the KB-tree, e.g. operators could be wrongly defined leading to an inconsistency independently of the logical arguments. In the following section, we will show the basic methods for the computation of AX-Diagnoses. 4 Computation of Axiom Diagnoses The principal idea of diagnosing axioms, is to translate an axiom into a set of axioms, which allows the application of the diagnosis methods introduced in [Friedrich and Shchekotykhin, 2005]. For the translation, we assume that the logic contains the equivalence operator (which may be simulated by exploiting implication and conjunction). More formally we require (V 1 V 2 ) I,µ := (V I,µ 1 = V I,µ 2 ). We apply this operator in order to decompose logical expressions. Let E y i i be logical expressions with free variables y i and X i (y i ) be a unique literal with variables y i as arguments. A logical expression op(e y 1 1,..., Ey n n {, N j } ) is replaced by op(x 1 (y 1 ),...,X n (y n ){, N j } ) and a set of additional axioms {X 1 (y 1 ) E y 1 1,..., X n(y n ) E y n n }. A complete decomposition for our sample Axiom 66 would be the following: [66.0:] (DEFINE-CONCEPT X 66.1 X 66.2 ) [66.1:] (EQUIVALENT X 66.1 C13) [66.2:] (EQUIVALENT X 66.2 (AND X 66.3 X )) [66.3:] (EQUIVALENT X 66.3 (SOME X 66.4 X 66.5 )) [66.4:] (ROLES-EQUIVALENT X 66.4 R22) [66.5:] (EQUIVALENT X 66.5 C63) [66.6:] (EQUIVALENT X 66.6 (SOME X 66.7 X 66.8 ))... [66.32:](EQUIVALENT X *TOP* ) This transformation preserves the interpretation of the original logical expression. Remark 3 Let the logical expression op(e y 1 1,...,Ey n n {, N j } ) be decomposed as described above, I = I, ( ) I be an interpretation, and µ a variable substitution for the free variables. Then op(x 1 (y 1 ),..., X n (y n ){, N j } ) I,µ = op(e y 1 1,...,Ey n n {, N j } ) I,µ. 104 DX'06 - Peñaranda de Duero, Burgos (Spain)

115 FUNCTION DECOMP(E): Return a set of axioms Input: a logical expression E if E is a literal with free variables y then if E is an axiom then return {[M(E) :]E} else return {[M(E) :]X M(E) (y) E} else let E = op(e 1,...,E n {, N j } ), where y i are the free variables of E i, and y = i y i are the free variables of E if E is an axiom then return {[M(E) :]op(x M(E1)(y 1 ),..., X M(En)(y n ){, N j } )} i=1,...,n DECOMP(E i) else return {[M(E) :]X M(E) (y) op(x M(E1)(y 1 ),...,X M(En)(y n ){, N j } )} i=1,...,n DECOMP(E i) Figure 2: Decomposition of an axiom The complete decomposition of an axiom and all its subexpressions can be performed by the recursive function depicted in Figure 2. We exploit the markers of the logical expressions E of a KB to mark corresponding axioms by [M(E) :]. Based on a decomposition of the axioms of a KB, we can show that there is a one-to-one correspondence between the KB-Diagnoses of the decomposed knowledge base and the AX-Diagnoses of the original one. For this correspondence, we make the reasonable assumption that for every logical structure there is a syntactically valid replacement by a unique literal. This assumption holds for FOL and DL. Following definition 3, we use the newly introduced literals as replacements R i for the logical sub-structures L i of the original axiom in KB. Let every axiom Ax of the de- be identified by M(Ax) and DECOMP(KB) = composition Ax KB DECOMP(Ax) is the decomposition of the knowledge base KB. Proposition 1 Provided the knowledge representation language allows for every logical structure a syntactically valid replacement by a literal, D AX = {L 1,...,L n } is an AX-Diagnosis for the KB-Diagnosis Problem (KB, B, TC +, TC ), iff D KB = {M(L 1 ),..., M(L n )} is a KB-Diagnosis for the KB-Diagnosis Problem (DECOMP(KB), B, TC +, TC ). Note that the inverse transformation of the composition can be obtained by backsubstituting the equivalent expressions for each newly introduced literal. For our example, the original axiom arises by substituting the expressions for X 66.1 (which is C13) and X 66.2 (which is (AND X 66.3 X )), and recursively substituting the equivalent expressions for all X-literals therein. By this backsubstitution, we are able to pinpoint the logical structures in the axioms, which correspond to the elements of AX-Diagnoses. Furthermore, it is not needed to decompose the whole knowledge-base, but to apply the decomposition on demand, e.g. if the knowledge engineer has an interest to investigate an axiom more deeply, because on the level of axioms, a single leading diagnosis was identified. As a consequence, we can exploit all the methods for computing KB-Diagnoses as described in [Friedrich and Shchekotykhin, 2005] for the computation of axiom diagnoses. Based on Proposition 1, it follows easily that the complete and correct algorithm of [Friedrich and Shchekotykhin, 2005] for the generation of minimal KB-Diagnoses can be applied to generate soundly the set of all minimal AX- Diagnoses. It remains the question regarding the practical applicability, which will be addressed in the next section. 5 Evaluation We implemented the diagnostic engine in Java as described in [Friedrich and Shchekotykhin, 2005], with extensions for calculating axiom diagnoses. Benchmarks were run on a PC (Pentium IV with 2GHz and 512MB RAM) with Windows XP SP2 as operating system. We conducted numerous tests using the files from the RACER test suite. In order to be comparable, we applied the same test setting as described in [Friedrich and Shchekotykhin, 2005], i.e. we conducted 30 tests for each knowledge base, where each test randomly changed the knowledge-base such that each change on its own leads to an incoherent KB. The diagnosis task is to find minimum cardinality diagnoses in order to restore coherence. We employed QUICKXPLAIN [Junker, 2004] for finding minimal conflict sets, and RACER for coherence checks. Minimal conflict sets are computed on demand in order to label the hitting set tree 3 (HS-Tree) exploited for the computation of minimal diagnoses. The evaluation was carried out in two steps, the first of which determined the minimum cardinality KB-Diagnoses for a test. In the second step, we emulated the behavior of a knowledge engineer who wants to investigate an axiom Ax of a KB-Diagnosis D more deeply, e.g. to question, which parts of Ax must be changed provided that D is the preferred diagnosis. Therefore, we selected an axiom Ax from a KB- Diagnosis found in the first step. Since we are only interested in AX-Diagnoses w.r.t. Ax, this question corresponds to computing AX-Diagnoses, where the background theory is extended by KB D, i.e. B e := B (KB D), and the knowledge base, for which we are calculating AX-diagnoses is DECOMP(Ax). In case we want to compute the minimal AX-Diagnoses of an axiom contained in a KB-Diagnosis, the following heuristic allows a faster computation by reusing minimal conflict sets found in the KB-Diagnosis step. Informally, a conflict set CS of a knowledge base KB is a minimal subset of KB, which is inconsistent with at least one positive test case unified with the background theory, or just the background theory in case there are no positive test cases. Such a set is called minimal, if no proper subset of CS is a conflict set. Let D be a KB-diagnosis and let CS KB = CS 1,..., CS k be the minimal conflict sets found in the KB-Diagnosis step, which contain Ax. It follows that (CS i D) DECOMP(Ax) must contain a minimal conflict set. Therefore when computing a label for the HS-Tree, in a first step, we use B e := B [ k i=1 CS i D] as a reduced background theory. If we cannot find a minimal conflict w.r.t. this reduced background theory, in a second step, we use the full background theory B e to generate a conflict set. Note that if for a node n in the HS- Tree, we cannot find a conflict w.r.t. the reduced background 3 We assume the reader to know the basics of model-based diagnosis [Reiter, 1987]. DX'06 - Peñaranda de Duero, Burgos (Spain) 105

116 KB #KBC KBC #AD #AC #IAD DTRB H-ADT ADT KB-DT bike1 min ,66 2,15 8,56 6,17 81 ax avg 4,07 4,27 5,43 1,57 0 1,92 3,71 34,02 27,58 max ,48 8,18 87,25 59,77 bike2 min ,02 3,63 8,14 21, ax avg 5,67 4,1 3,13 1,43 0,1 1,86 5,4 17,43 52,84 max ,41 18,67 31,89 100,85 bike3 min ,77 4,01 7,31 24, ax avg 3, ,86 4,66 53,16 33,02 max ,89 6,66 62,13 53,65 bike4 min ,87 3,79 10,86 33, ax avg 4,97 5,4 4,7 1,7 0,07 1,95 7,33 107,2 98,07 max ,18 23,04 426,19 157,99 bike5 min ,25 6,61 36,03 30, ax avg 3,6 3,97 6,2 1,33 0 1,83 12,85 101,84 114,05 max ,58 20,5 420,2 160,15 bike6 min ,33 7,12 36,06 55, ax avg 3,33 3,8 5,33 1,33 0,2 2,23 15,14 148,49 140,11 max ,72 64,7 493,52 217,1 bike7 min ,49 3,64 30,25 24, ax avg 2,63 3 5,27 1,23 0 1,76 11,89 59,56 63,05 max ,47 27,18 97,3 113,6 bike8 min ,48 3,92 33,03 30, ax avg 2,57 3 5,33 1,2 0 1,28 11,83 52,15 65,65 max ,84 26,35 92,17 115,47 bike9 min ,37 8,94 25,2 36, ax avg 3,43 3,97 4,4 1,47 0,33 3,64 14,92 181,02 154,7 max ,22 53,94 536,98 234,47 galen min ,98 52,27 124,78 142, ax avg 3,47 2 5,4 1,2 2,07 11,48 146,98 252,71 244,52 max ,17 507,33 523,55 525,66 galen2 min ,95 39,2 98,58 90, ax avg 3,03 2 5,2 1,3 2,5 12,7 94,61 146,27 141,83 max ,94 237,21 236,79 287,45 bcs3 min ,61 3,16 5,86 5, ax avg 3,33 25,63 2,03 1,67 0,87 1,68 24,15 154,45 126,73 max ,72 99,64 550,39 528, time[s] Table 1: Test results for diagnosing faulty axioms bike1 81 bike3 109 bike2 154 bike7 162 bike4 166 bike5 184 bike8 185 bike6 207 bike9 215 bcs3 432 galen galen 3963 Figure 3: Comparing the performance with and without heuristic diagnosis according to the size of the KB. The bars in the back show the average times (in seconds) for finding minimum cardinality AX-Diagnoses without heuristic improvements. The black bars in the middle display the average time required when the heuristic is exploited. The small bars in the front show the average time required for finding minimum cardinality AX-Diagnoses with reduced background (i.e. before checking consistency w.r.t. B (KB D)). 106 DX'06 - Peñaranda de Duero, Burgos (Spain)

117 theory B e, then this property holds also for all successors of n. Hence, we can omit the first step for these successor nodes. Table 1 depicts the results using this heuristic for calculating diagnoses and compares the results to the plain method without heuristic speedup. Columns are to be interpreted as follows: H-ADT and ADT are the times for generating the minimum cardinality AX-Diagnoses with heuristic speed-up (H-ADT), and without heuristic speed-up (ADT). KB-DT is the time for running the initial knowledge base diagnosis. All times are given in seconds and include the time for decompositions. For each KB, we characterized the fastest (min) and slowest (max) heuristic axiom diagnosis by the number of minimum cardinality AX-Diagnoses (#AD) and the number of minimal conflicts (#AC) generated during diagnosing the axiom. As the heuristic requires the conflicts sets within the full KB, we additionally provide the number of KBconflicts (#KBC) and their maximum size ( KBC ) for the fastest and slowest case, as well as the number of invalid minimum cardinality AX-Diagnoses (i.e. those AX-Diagnoses of the reduced background theory B e being inconsistent with the full one B e, shown in column #IAD) and the time for finding the minimum cardinality AX-Diagnoses w.r.t. the reduced background B e (DTRB). The avg row shows these values averaged over all tests for a single KB. The unit ax denotes the number of axioms in a KB. In addition, we improved the application of QUICKX- PLAIN. QUICKXPLAIN takes two arguments (a KB and a background theory B) and computes a minimal subset of KB (called a minimal conflict set), which is inconsistent with B provided that B is consistent. If such a set does not exist (i.e. KB B is consistent), it outputs consistent. Experiments showed that QUICKXPLAIN performs better if B is small due to the divide-and-conquer technique, which reduces KB rapidly (but not B), and therefore reduces the costs for consistency checking. A reduction of the size of KB results in disproportionate speed-up. Note that this is also the reason why QUICKXPLAIN works well for large knowledge bases. Consequently, if we diagnose DECOMP(Ax) w.r.t. (a possibly large) B e, a divide-and-conquer strategy will not provide significant acceleration. Hence, we adopt another strategy for generating minimal conflicts by leaving B untouched; just replacing the axiom in D we want to investigate by DECOMP(Ax). The effort for finding an AX-diagnosis by running a KBdiagnosis on (KB D) DECOMP(Ax) with background B and no heuristic speedup, is approximately the same as for calculating a diagnosis for the plain knowledge-base (prior to the axiom-diagnosis). The additional cost if the heuristic fails (no diagnosis found by using the reduced background theory) is negligible, compared to the time spent for using the extended background theory B e. The heuristic for generating minimum cardinality AX-diagnoses can be summarized as follows: 1. Let a KB-diagnosis D be given for KB and background B (and possibly existing test-cases, which we omit for brevity). Let Ax D be the axiom to be investigated. 2. Collect all conflicts CS 1,...,CS k from the HS-tree, where D comes from, and create B e by adding all CS i D to B. 3. Run a diagnosis on DECOMP(Ax) with background B e. 4. Check the resulting diagnoses for consistency w.r.t. B (K D) 4. If none of them turns out to be consistent, run a diagnosis on (KB D) DECOMP(Ax) with background B. For 310 experiments, we compared the times for generating minimum cardinality AX-Diagnoses with and without heuristic (see Figure 3). As we recognize, the heuristic always achieved a considerable improvement. In particular, the time for diagnosing w.r.t. the reduced background theory is almost negligible compare to the diagnosis time for the full background theory. The expensive step is (as expected) to check if diagnoses of the reduced background theory are consistent with the full background theory. Therefore, the speed up depends on the number of diagnoses which must be checked and the cost of consistency checking. Note that in our experiments the minimum cardinality diagnoses of the reduced background theory always contained those of the full background theory. The number of diagnoses deleted in the maximum average cases is rather small (around two, see Table 1), which means that the diagnoses of the reduced problem are already a very good approximation for the minimum cardinality diagnoses of the full problem. The running time for finding an axiom diagnosis is strongly depending on the complexity of an axiom, as well as on the size and structure of the knowledge base. Note that in our evaluation, RACER does not provide any information on which axioms are used for discovering an incoherence. If theorem provers are available, which return the set of axioms applied during theorem proving (i.e. they return a conflict set, which is not necessarily minimal, but smaller than the whole KB) then QUICKXPLAIN could start with such a reduced set. This will result in significant speed-ups. 6 Related work Most closely related to our approach is the work of [Schlobach and Cornet, 2003]. In this paper, a method for debugging faulty knowledge bases as well as faulty axioms is proposed. The approach is called concept pinpointing. Concepts are diagnosed by successive generalization of an axiom until the most general form that is still incoherent is achieved. Generalizations are based on a syntactic relation, which is assumed to exist and left to the knowledge engineer. Moreover, the approach will find concepts only. In [Schlobach, 2005], three approaches are compared for generating diagnoses of terminologies, which differ in the size of the conflict sets returned by the theorem prover. The first approach always considers the complete knowledge base (which served as an input for the theorem prover) as a conflict set. In the second approach, the theorem prover returns a minimized (not necessarily minimal) conflict set, and in the third 4 Prior to invoking RACER for checking coherence, one can search for known conflicts to appear in the set of axioms, and calculate a new conflict if nothing is found. This conflict can be re-used in subsequent checks. DX'06 - Peñaranda de Duero, Burgos (Spain) 107

118 approach, a special procedure for computing minimal conflicts in unfoldable (i.e. acyclic) ALC-Tboxes is employed. Compared to these evaluations, we employ QUICKXPLAIN to generate minimal conflict sets in order to avoid the known problems of non-minimal conflicts for the HS-tree generation. Since we are not restricted to a special theorem prover, our approach is not restricted to unfoldable ALC-Tboxes. However, a runtime comparison of generating minimal conflict sets by the method described in [Schlobach and Cornet, 2003] and the combination of QUICKXPLAIN with a highly optimized consistency checker is open. Note that this comparison also strongly depends on the ability of the consistency checker to return axioms which were involved in the generation of an inconsistency as described above. In [Mateis et al., 2000] model-based diagnosis of Java programs is discussed. Similar to our approach, the grammar of the Java language is the starting point of the transformation, but unlike our method, the approach is designed to be used with imperative semantics, while we focus on declarative semantics. In the heuristic approaches of [Parsion et al., 2005] and [Wang et al., 2005], debugging cues respectively error patterns are exploited. In contrast to these approaches, our goal was to provide a general, complete, and correct method for diagnosis. Nevertheless, these heuristic approaches may help us to discriminate between minimal diagnoses. Based on a connection relation, [Haase et al., 2005] present a method for computing a minimal subset of an ontology, in which a concept is unsatisfiable. However, they do not identify the parts of axioms, which cause this unsatisfiability. 7 Conclusions We presented a general theory of diagnosis for faulty knowledge bases, which allows the identification of faulty parts of axioms. This approach subsumes current methods and is independent of FOL knowledge representation language variants or particular reasoning systems. Based on the roots of model-based diagnosis, we were able to develop correct and complete algorithms for the computation of axiom diagnoses. We have shown the feasibility of our approach by extensive test evaluation, and provided an extension of current diagnosis methods s.t. a considerable speed up for the diagnosis of axioms is achieved. [Friedrich and Shchekotykhin, 2005] G. Friedrich and K. Shchekotykhin. A general diagnosis method for ontologies. In 4 th ISWC 2005, volume Springer LNCS, [Haase et al., 2005] P. Haase, F. Harmelen, Z. Huang, H. Stuckenschmidt, and Y. Sure. A framework for handling inconsistency in changing ontologies. In 4 th ISWC 2005, volume Springer LNCS, [Junker, 2004] U. Junker. QUICKXPLAIN: Preferred explanations and relaxations for over-constrained problems. In Proc. AAAI 04, pages , San Jose, CA, USA, [Linz, 1996] P. Linz. An introduction to formal languages and automata. Lexington Books, Mass., 2 edition, [Mateis et al., 2000] C. Mateis, M Stumptner, and F. Wotawa. Modeling java programs for diagnosis. In ECAI 2000, pages , [Parsion et al., 2005] B. Parsion, E. Sirin, and A. Kalyanpur. Debugging OWL ontologies. In WWW 2005, Chiba, Japan, May ACM. [Patel-Schneider and Swartout, 1993] P.F. Patel-Schneider and B. Swartout. Description-logic knowledge representation system specification. Technical report, KRSS Group of the DARPA Knowledge Sharing Effort, November [Reiter, 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 23(1):57 95, [Schlobach and Cornet, 2003] S. Schlobach and R. Cornet. Non-standard reasoning services for the debugging of description logic terminologies. In Proc. IJCAI 03, pages , Acapulco, Mexico, [Schlobach, 2005] S. Schlobach. Diagnosing terminologies. In Proc. AAAI 05, pages , Pittsburgh, PA, USA, [Wang et al., 2005] H. Wang, M. Horridge, A. Rector, N. Drummond, and J. Seidenberg. Debugging owl-dl ontologies: A heuristic approach. In 4 th ISWC 2005, volume Springer LNCS, Acknowledgments The research project is funded partly by grants from the Austrian Research Promotion Agency, Programm Line FIT-IT Semantic Systems ( Project AllRight, Contract and the European Union, Project WS-Diamond, Contract References [Baader et al., 2003] F. Baader, D. Calvanese, D.L. McGuinness, D. Nardi, and P.F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, DX'06 - Peñaranda de Duero, Burgos (Spain)

119 Robust Fault Detection with State Estimators and Interval Models using Zonotopes Pedro Guerra, Vicenç Puig, Ari Ingimundarson Automatic Control Department Universitat Politècnica de Catalunya (UPC) Rambla de Sant Nebridi, 11, Terrassa (Spain) Abstract In this paper, the problem of robust fault detection considering process/measurement noises and modeling uncertainties is addressed with two different state estimation strategies based on zonotope representation of the state space. First, a consistencybased approach based on propagating the uncertainty with zonotopes is proposed. Second, a worstcase state estimation strategy based on adaptive thresholds using zonotopes is presented. In both strategies, the modeling uncertainty is represented by bounding model parameters in intervals. Process and measurement noise are also considered unknown but bounded. Finally, an example based on a linearised model of a flight control system is proposed to compare both approaches. 1 Introduction Model-based fault detection techniques have been investigated and developed within DX and FDI communities over the last few years. Model-based fault detection is based on the use of mathematical models of the monitored system. The better the model used to represent the dynamic behavior of the system, the better is the chance of improving the reliability and performance in detecting faults. However, modeling errors and disturbances in complex engineering systems are inevitable, and hence there is a need to develop robust fault detection algorithms. The most common approach to deal with the robustness problem in the FDI community is based on the decoupling principle, in which the residual is designed to be insensitive to unknown disturbances, whilst sensitive to faults using the unknown input observer, eigenstructure assignment [Chen and Patton, 1999] or structured parity equations [Gertler, 1998]. Using one of these approaches, the robustness with respect to unknown disturbances is solved. However, the robustness problem with respect modeling errors is more difficult to solve because its distribution matrix is normally unknown and should be estimated, being many times time varying and moreover it could be too many disturbances to be decoupled due to the lack of freedom. An alternative strategy to consider modeling error as disturbances and to decouple their effect is to propagate and bound its effect on the residual using for example interval methods [Puig et al., 2002b]. This will be the approach followed in this paper to handle modeling uncertainties. On the other hand, process and measurement noises are usually modeled stochastically using restrictive assumptions concerning the distribution law (the typical assumption is a zero mean white noise). However, in many practical situations it is more natural to assume that only bounds on the noise signals are available [Milanese et al., 1996]. This will also be the approach to describe noise signals used in this paper. Unfortunately, the set of states obtained propagating parameter and noise uncertainty may become extremely complex. Then, in the literature several approximating sets to enclose the set of possible states has been proposed. In [Witczak et al., 2002], a state estimator based on enclosing the set of states by the smallest ellipsoid is proposed following the algorithms proposed by [Maksarov and Norton, 1996]. However, in this approach only additive uncertainty is considered, but not the multiplicative uncertainty introduced by modeling uncertainty located in the parameters. Here, both types of uncertainties are considered as in [Rinner and Weiss, 2004], but there only system trajectories obtained from the uncertain parameter interval vertices are considered, assuming that the monotonicity property holds. In this paper, two approaches of state estimators based on enclosing the set of states by zonotopes are presented without assuming any monotonicity property and considering the whole set of possible trajectories. The first approach is the consistency based approach based on determining the set of states that are consistent with parameter and measurement uncertainty. The second approach is a worst-case state estimation based on bounding the set of possible states through computing the worst-case state estimation due to parameter and measurement uncertainty. The paper is organized as follows: In Section 2, consistency-based and worst-case state estimation principles are introduced. Section 3 presents the implementation of both approaches using zonotopes. And, in Section 4, these approaches are applied to fault detection. Finally, in Section 5 an application example based on a linearised model of a flight control system is presented to compare both strategies. DX'06 - Peñaranda de Duero, Burgos (Spain) 109

120 2 Consistency-based and worst-case state estimation principles 2.1 System set-up Let us consider the following discrete-time linear system: where: x k+1 = (A + A k )x k + (B + B k )u k + w k y k = Cx k + v k (1) x k R nx, u R nu and y R ny are state, input and output vectors of dimension nx, nu and ny respectively; v k R nv and w k R nw are measurement and process noise of dimension nv and nw respectively; that are considered unknown but bounded, i.e. v k V k, and w k W k, where V k and W k are interval boxes: V k = {v k R nv v k v k v k }, W k = {w k R nw w k w k w k } A, B and C are the state space matrices and A and B represent the associated modelling errors; If modelling errors A and B are located in the parameters, a vector of uncertain time-varying parameters θ k of dimension p with their values bounded by a compact set θ k Θ of a box type, i.e. Θ = {θ R p θ θ θ}, is introduced. The uncertain parameters are considered time varying. This type of model is known as an interval model. Then, system matrices including its associated uncertain parameters will be described in the following way: A(θ k ) and B(θ k ) respectively. 2.2 The consistency-based estimation principle A consistency-based state estimator assumes a priori bounds on noise and uncertain parameters and constructs sets of estimated states that are consistent with the a priori bounds and current measurements. Several researchers as [Chisci et al., 1996][Maksarov and Norton, 1996] [Shamma, 1997], [Calafiore, 2001] and [Kieffer et al., 2002], among others, have addressed this issue. Definition 1. Consider a system given by Eq. (1), an initial compact set X 0 and a sequence of measured inputs (u j ) k 1 0 and outputs (y j ) k 0, the exact uncertain state set at time k using the set-membership approach is expressed by X k = {x k : (x j = A(θ j 1 )x j 1 + B(θ j 1 )u j 1 +w j 1 ) k j=1, (y j = Cx j + v j ) k j=1} The uncertain state set described in Definition 1 at time k can be computed approximately by admitting the rupture of the existing relations between variables of consecutive time instants. This makes possible to compute an approximation of this set from the approximate uncertain state set at time k 1. In the linear case and considering only additive uncertainty, the set of uncertain states generally takes the form of (convex) polytopes and in the literature efficient algorithms exist to deal with [Chisci et al., 1996][Maksarov and Norton, 1996]. But, in the non-linear case (or in the linear case with multiplicative uncertainty), an explicit construction of the set of possible states is essentially prevented by the generality of shapes [Kieffer et al., 2002]. Using set computations, it is possible to define an algorithm for the non-linear case that constructs an approximation of set of uncertain states, X k, which are consistent with the current measurement trajectory and with the bounds of noise, disturbances and parameter uncertainty. Before introducing such algorithm two additional definitions are introduced. Definition 2. Consider a system given by Eq. (1), the set of uncertain states at time k-1, X k 1, and the input/ouput pair (u k 1, y k 1 ). Then, the set of predicted states at time k based on the measurements up to time k-1 is defined as X e k = {x k : A(θ k 1 )x k 1 + B(θ k 1 )u k 1 + w k 1 x k 1 X k 1, θ k Θ, w k 1 W k 1 } Definition 3. Consider a system given by Eq. (1) and a measured output y k. Then, the set of consistent states at time k with such measurement is defined as X y k k = {x } k : y k = Cx k + v k, θ k Θ, v k V k Then, the following algorithm can be introduced to determine an approximation of set of uncertain states: Algorithm 1. Consistency-based State Estimator using Set-computations Considering a system given by Eq. (1), an initial compact set X 0 and a sequence of measured inputs (u j ) k 1 0 and outputs (y j ) k 0, at each sample time k: Step 1: Compute the set of predicted states, X e k Step 2: Compute the set of consistent states, X y k k Step 3: Compute the set of uncertain states as X k = X e k X y k k Except for very particular cases, it is not possible to evaluate exactly the three sets X e k, Xy k k and X k required in Algorithm 1. Instead guaranteed outer approximations of these sets, as accurate as possible, have been used in the literature. In the case of non-linear systems (or systems including multiplicative uncertainty), such outer approximations are based on subpavings [Kieffer et al., 2002], ellipsoids [ElGhaoui and Calafiore, 1997],[Polyak et al., 2004], zonotopes [Alamo et al., 2005], among others. 2.3 The worst-case state estimation principle Let the model for the state estimator of the system described by Eq. (1) be a Luenberger observer formulated as ˆx k+1 = A(θ)ˆx k + B(θ)u k + w k + K(y k ŷ k ) ŷ k = C(θ)ˆx k + v k (2) where K is the observer gain. When K = 0 the state observer becomes a simulator, while when K = A becomes a predictor. Definition 4. Consider the state estimator given by Eq. (2), an initial compact set X 0 and a sequence of measured inputs (u j ) k 1 0 and outputs (y j ) k 0. The exact uncertain state set at time k using the worst-case approach is expressed by ˆX k = {x k : (x j = A(θ j 1 )x j 1 + B(θ j 1 )u j 1 +w j + K(y j ŷ j ), y j = Cx j + v j ) k j=1} 110 DX'06 - Peñaranda de Duero, Burgos (Spain)

As in the case of the uncertain state set described in Definition 1 at time k, the uncertain state set described in Definition 2 at time k can be computed approximately by admitting the rupture of

121 As in the case of the uncertain state set described in Definition 1 at time k, the uncertain state set described in Definition 2 at time k can be computed approximately by admitting the rupture of the existing relations between variables of consecutive time instants. This makes possible to compute an approximation of this set from the approximate uncertain state set at time k 1. Because the exact set of estimated states would be difficult to compute, one straightforward way to bound this set is using a box (interval hull)[puig et al., 2002a], a zonotope [Puig et al., 2001] or other geometric regions easy to compute [Puig et al., 2005]. In this paper, the set of estimated states will be computed iteratively using zonotopes. From these zonotopes a worst-case estimation of each state variable can be obtained by computing the interval hull of the zonotope. The sequence of interval hulls ˆX k with k [0, N] will be called the worst-case estimation of the system given by Eq. (1). Then, the following algorithm can be introduced to determine an approximation of set of uncertain states: Algorithm 2. Worst-case State Estimator using Setcomputations Considering the state estimator given by Eq. (2), an initial compact set X 0 and a sequence of measured inputs (u j ) k 1 0 and outputs (y j ) k 0, at each sample time k: Step 1: Compute the set of uncertain states, ˆX k Step 2: Compute the interval hull of the set of uncertain states: ˆX k = [x(k), x(k)] (3) 2.4 Comparison between both approaches The main difference between both state estimation approaches presented in previous sections is how measurements are taken into account. In the case of the consistency-based approach the effect of measurements is taken into account implicitly by the intersection of the set of states consistent with the measurements and the set of states predicted using the model. The degree of correction depends of the relative values of process and measurement noise. If the level of measurement noise is very low, the set of states will almost be reduced to the set of states consistent with the measurements. On the other hand, in the case of the worst-case approach the effect of measurements in the correction of state estimation is considered explicitly through the selection of the gain K. In case the process and measurement noise are modeled stochastically (Kalman filter), the value of the observer gain K will also depend on the relative values of process and measurement noise as in the consistency-based approach what allow a possible comparison between both approaches. 3 Implementation using zonotopes 3.1 Introduction In this paper, zonotopes are used to bound the set of uncertain estimated sets. Let us introduce zonotopes. Definition 5. The Minkowski sum of two sets X and Y is defined by X Y = {x+y : x X, y Y}. Definition 6. Given a vector p R n and a matrix H R n m the Minkowski sum of the segments defined by the columns of matrix H, is called a zonotope of order m. This set is represented as: X = p HB m = {p + Hz : z B m } where: B m is a unitary box, composed by m unitary intervals. Then, a zonotope X of order m can be viewed as the Minkowski sum of m parallelepipeds (Figure (1). The order m is a measure for the geometrical complexity of the zonotopes. Figure 1: Zonotope of order m=14 Definition 7. The interval hull X of a closed set X is the smallest interval box that contains X. Given a zonotope X = p HB m, its interval hull can be easily computed by evaluating p HB m using interval arithmetic since: X = {x : x i p i H i 1 } (4) where H i is i th -row of H, and x i and p i are i th components of x and p, respectively. 3.2 Implementation of consistency-based estimators According to Algorithm 1, consistency-based state estimation involves three bounding operations applied to the set of predicted states X e k, the consistent state set Xy k k and their intersection X k. A. Implementation of prediction set step The prediction set step requires characterizing the set X e k. This set can be viewed as the direct image evaluation of f(x k, θ k, w k ) = A(θ k )x k + B(θ)u k + w k. There are different algorithms to bound such an image using subpavings [Kieffer et al., 2002], ellipsoids [Polyak et al., 2004] or zonotopes [Kühn, 1998]. To bound such image using zonotopes the following result is used: Theorem 2. Zonotope Inclusion [Alamo et al., 2005]. Consider a family of zonotopes represented by X = p MB m where p R n is a real vector and M I n m is DX'06 - Peñaranda de Duero, Burgos (Spain) 111

122 an interval matrix. A zonotope inclusion (X) is defined by: [ ] B m (X) = p [mid(mg)] B n = p JB n+m j=1 where G R n n is a diagonal matrix that satisfies: G ii = m diam(m ij ) 2, i = 1, 2... n. with mid denotes the center and diam the diameter of the interval [Moore, 1966]. Under this definition, X (X). This prediction step aims at computing the zonotope X e k+1 that bounds the trajectory of the system at instant k+1, from the previous approximating zonotope at time instant k, X k, using the interval mean-value extension of Eq. (1) [Moore, 1966] and the zonotope inclusion operator, as a generalization of Künh s method [Kühn, 1998]: where: X e k+1 = p k+1 H k+1 B r (5) p k+1 = mid(a(θ k ))p k + mid(b(θ k ))u and, H k+1 = [J 1 J 2 J 3 W] J 1 = (A(θ k )H k ) J 2 = (A(θ k ) mid(a(θ k ))p k ) J 3 = u(diam(b(θ k ))/2) W is the process noise. J 1 and J 2 are calculated using the zonotope inclusion operator. It is important to notice that the set of predicted states increases the number of segments generating the zonotope X e k+1 using this method. In order to control the domain complexity, a reduction step is thus implemented. Here we use the method proposed in [Combastel, 2003] to reduce the zonotope complexity: Property 1. Given the zonotope X = p HB r R n and the integer s (with n < s < r), denote by Ĥ the matrix resulting from reordering of the columns of matrix H in decreasing Euclidean norm. Then, X p [ĤT Q] B s, where ĤT is obtained from the first s-n columns of matrix Ĥ and Q Rnxn is a diagonal matrix that satisfies: Q i,i = r j=s n+1 Ĥij, i=1...,n. B. Implementation of consistent set step The consistent state set step requires characterizing the set X y k k taking into account the information provided by measurements y k. This set can be viewed as the inverse image evaluation of g(x k, θ k, v k ) = Cx k + v k. There are algorithms based on bounding such an image using, for example, subpavings [Kieffer et al., 2002], ellipsoids [Polyak et al., 2004] or zonotopes [Alamo et al., 2005]. To bound the consistent set of states using zonotopes, the intersection of p strips in the state space is calculated. Given a measurement y k R p, the consistent state set X y k k introduced in Definition 3 corresponds the region between two hyperplanes. Define now sets X y k k (i) = {x k : (v k V k, θ k Θ) such that y k g i (x k, θ k, v k )} where g i denotes the ith component of g R p. It is clear that: X yk k p i=1 X yk k (i). C. Implementation of intersection set step Finally, the intersection set step requires characterizing the set X k. This set is the intersection of the two previous bounded sets: X k = X e k Xy k k. This set can be approximated using algorithms based on bounding such image by, for example, ellipsoids [Polyak et al., 2004], zonotopes [Alamo et al., 2005] or subpavings [Kieffer et al., 2002]. In this paper, again we use zonotopes to bound this set. Given the zonotope X e k = p k HB r, the strip X y k k = {x R n : c T x-d σ} and vector λ R n, if X e k Xy k, we have: where: k X e k X y k k X k = p(λ) Ĥ(λ)Br+1 (6) p(λ) = p + λ(d c T p) (7) Ĥ(λ) = [(I λc T )H λ] (8) It is possible to choose the parameter vector λ in such a way that a size criterion for the obtained bound is minimized. Here, we use the method based on the Frobenius norm proposed in [Alamo et al., 2005]. 3.3 Implementation of worst-case estimators On the other hand, to implement worst-case estimators using zonotopes, it should be noticed that using Eq. (2) as the expression of the estimator model, it can be rearranged as a discrete-time system with two inputs that can be reorganised as: ˆx k+1 = (A KC)ˆx k + [ B I K K ] Or, equivalently: u k w k y k v k (9) ˆx k+1 = A oˆx k + B o u o k (10) where: A o = A KC, B o = [ B I K K ] and u 0 k = [ u k w k y k v k ] t. Then, the problem of worst-case state observation can be formulated as a problem of worst-case simulation and requires characterizing the set X k. This set can be viewed as the direct image evaluation of Eq. (9) and can be implemented using zonotopes as in the prediction step in consistency-based state estimators. 4 Application to fault detection 4.1 Fault detection using consistency-based state estimators The use of consistency-based state estimation in fault detection is very straightforward. The existence of a fault will be detected through the consistency-based state estimation algorithm based on set-computations (Algorithm 1) in the intersection step. Then, assuming that the actual system satisfies (1) under non-faulty operating conditions, the consistencybased algorithm based on set-computations is correctly initialized with the initial condition given by X 0, for a given 112 DX'06 - Peñaranda de Duero, Burgos (Spain)

123 sequence of measured inputs (u j ) k 1 0 and outputs (y j ) k 0 of the actual system, a fault is said to have occurred if at any time instant k. X k = X e k X y k k = (11) When a fault occurs, a recovery strategy is needed. One possibility consists resetting the set X e k to a size which is guaranteed to capture the true state of the system after the fault has been detected. 4.2 Fault detection using worst-case state estimators In this case, fault detection consists in testing whether the measured output from the system lie within the behavior described by an observer of the faultless system. If the measurements are inconsistent with the predicted output provided by the observer, the existence of a fault is proved. The residual vector usually describes the consistency check between the predicted, ŷ(k), and the real behaviour, y(k): r(k) = y(k) ŷ(k) (12) Ideally, the residuals should only be affected by the faults. However, the presence of disturbances, noise and modeling errors causes the residuals to become nonzero and thus interferes with the detection of faults. In case of modeling a dynamic system using an interval model, the predicted output is described by a set that can be bounded at any iteration by an interval [ŷ(k)]. Then, fault detection test is based on propagating the parameter uncertainty to the residual, and checking if 0 [r(k)] = y(k) [ŷ(k)] (13) Then, no fault can be indicated. Otherwise, a fault should be indicated. This test is equivalent to check if the measured output belongs to interval of predicted outputs, i.e., to check if y(k) [ŷ(k)] (14) In the case of assuming bounded noise, the measurement can be considered to be in the interval [y(k)]. Then, the previous fault detection test can be restated as 5 Case of study 5.1 Description [y(k)] [ ˆ y(k)] (15) A modified version of the benchmark problem proposed in [Chen and Patton, 1996] is considered. It consists of a linearized discrete-time model of a simplified longitudinal flight control system given by: with: x k+1 = A k x k + B k u k + w k (16) y k = C k x k + v k A k = [ ± a ± a ± a ± a ± a ± a [ ± b1 ] B k = ± b , C k = I 3x3 The states variables x = [η y ω z δ z ] T represent the normal velocity, pitch rate, and pitch angle, respectively. The control input is an elevator control signal. The system has been simulated using u k =10. The covariance matrices for process and measurement noise sequences are Q k = diag{0.1 2, 0.1 2, } and R k = I 3 3. The aerodynamic coefficients are randomly perturbed by ±20%.i.e. a ij = 0.2a ij, while the process w k and measurement v k noises are normally distributed. The initial state vector used in the simulation was x o = [0 0 0] T. To implement the consistency-based and worst-case state estimators, process/measurement noise will only be assumed to be bounded (see Eq. (1)). In this example since statistical distribution of noise is given, these bounds will be obtained from the covariances matrices taking 3 times the standard deviation. Then: w k = [[ 0.3, 0.3], [ 0.3, 0.3], [ 0.03, 0.03]] and v k = [[ 0.03, 0.03], [ 0.03, 0.03], [ 0.03, 0.03]]. On the other hand, parameter uncertainty is obtained by considering that all parameter values in A matrix are intervals whose center is the nominal value and the with is a ij = 0.2a ij. To make the results obtained using both state estimators comparable, the gain K of the worst-case state estimators will be designed taking into account statistical distribution of noise. In particular, observer gain K is determined using covariance matrices for process and measurement noise, Q and R respectively, and making use of Kalman filter theory [Chen and Patton, 1996] using steady state approximation: K = [ Fault detection Two different types of fault were studied to compare the behavior of the fault detection approaches presented in this paper: an additive fault and a multiplicative fault. The zonotope based fault detection methods described in this paper in all cases were initialized using as initial zonotope: with: X o = p o H o B m p o = [ ] T H o = ] [ In all cases, the zonotope order m was limited to 27. A. Additive fault In this scenario, an additive fault of size 2 is introduced in the pitch angle output measurement, i.e. y k,3 = y k,3 + 2 ] ] DX'06 - Peñaranda de Duero, Burgos (Spain) 113

124 from time instant k = 10. Figure 2 shows the three components of the output measurements and their envelopes obtained with the consistency approach (dashed line) and with the worst-case method (+ marks). Looking only at the first two component of the output measurement are inside of their envelopes, then no fault is indicated. However, the third component (the pitch angle) output measurement goes outside the envelopes for several time instants from k = 10, then fault is detected. Figure 3 presents sets X e k and Xy k k at time k = 10 using the consistency test, since their intersection is the empty set: X e k Xy k k =, this is why the fault has been detected using this approach. Figure 4 shows the result of the worstcase fault detection test: 0 means no fault while 1 means fault. Figure 5 presents the result for the consistency fault detection test, in this case when the fault is detected the algorithm stops, i.e. the fault is detected only for time instant k = 10. From these last two figures it can be observed that using the worst-case test, the persistence of the fault indication is higher than in the consistency method. In the case of the worst-case approach the fault indication persistence depends on the observer gain K. Normal velocity Pitch rat Pitch angle Time instant, k Figure 2: System output measurements and envelopes (additive fault B. Multiplicative fault In this scenario, a multiplicative fault of size 1 is introduced by modifying the parameter a 21 of the system matrix A k from time instant k = 10, i.e.: [ ] a11 ± a 11 a 12 ± a 12 a 13 ± a 13 A k = a a 22 ± a 22 a 23 ± a 23 a 31 a 32 a 33 Figure 6 shows the evolution of the three output measurements and their envelopes obtained with both approaches, in these cases all measurements go outside of their envelopes in different time instant since k = 10, so fault is indicated using both methods. Figure 7 shows the result of the worst-case fault detection test. Figure 8 presents the result for the consistency fault detection test. Figure 9 presents sets X e k and X y k k at time k = 10 using the consistency test, since their intersection is the empty set: X e k Xy k k =, this is why the Normal velocity, x(3) Normal velocity residual Pitch rate residual Pitch angle residual X k Pitch angle, x(1) Figure 3: Fault detected at k=10 using consistency test Time instant, k Figure 4: Worst-case fault detection for an additive fault fault has been detected using this approach. From these last two figures, it can be observed again that using the worst-case test, the persistence of the fault indication is higher than in the consistency method and fault indication persistence depends on the observer gain K. 6 Conclusions In this paper two methods for robust fault detection based on state estimation using zonotopes are presented and compared. Both approaches use interval models to describe parameter uncertainty and assume a bounded description of process and measurement noise. The first approach, known as consistency based approaches, computes a set of uncertain states that are consistent with model uncertainty and process/measurement noise. The second approach, known as worst-case approach, computes the worst-case estimation for each state variable considering the effect of parameter uncertainty and noise. After the application of both approaches to an application example based on simplified longitudinal flight control system, it X yk 114 DX'06 - Peñaranda de Duero, Burgos (Spain)

125 Norm. vel. residual P. rate residual Time instant, k Figure 5: Consistency fault detection for an additive fault P. angle residual Time instant, k Figure 7: Worst-case fault detection for a multiplicative fault Normal velocity Pitch rate Pitch angle Time instant, k Figure 6: System output measurements (multiplicative fault) Time instant, k Figure 8: Consistency fault detection for a multiplicative fault can be noticed that the worst-case approach offers a fault indication that is more persistent in time than the one provided by the consistency-based approach. This is because in the consistency-based approach after the fault detection, the fault detection algorithm should be stopped since an inconsistency is detected and it is impossible to continue with the state estimation. If these two approaches are going to be applied in fault isolation, the consistency-based approach would require a memory to keep the fault indication active after the fault is detected. So far, these two approaches to fault detection were always presented separately in the literature but never compared, being the main contribution of this paper. Pitch rate, x(2) X k References [Alamo et al., 2005] T. Alamo, J.M. Bravo, and E.F. Camacho. Guaranteed state estimation by zonotopes. Automatica, 41(6): , X yk Normal velocity, x(1) Figure 9: Fault detected at k=10 using consistency test DX'06 - Peñaranda de Duero, Burgos (Spain) 115

126 [Calafiore, 2001] G. Calafiore. A set-valued non-linear filter for robust localization. In European Control Conference (ECC 01), Porto, Portugal, [Chen and Patton, 1996] J. Chen and R.J. Patton. Optimal filtering and robust fault diagnosis of stochastic systems with unknown disturbances. Control Theory and Applications, IEE Proceedings, 143 (1):31 36, [Chen and Patton, 1999] J. Chen and R.J. Patton. Robust Model-Based Fault Diagnosis for Dynamic Systems. Kluwer Academic Publishers, [Chisci et al., 1996] L. Chisci, A. Garulli, and G. Zappa. Recursive state bounding by parallelotopes. Automatica, 32: , [Combastel, 2003] C. Combastel. A state bounding observer based on zonotopes. In European Control Conference (ECC 03), Cambridge, UK, [ElGhaoui and Calafiore, 1997] L. ElGhaoui and G. Calafiore. Robust filtering for discrete-time systems with bounded noise and parametric uncertainty. IEEE Transactions on Automatic Control, 46(7): , [Gertler, 1998] J. Gertler. Fault Detection and Diagnosis in Engineering Systems. Marcel Dekker, [Kieffer et al., 2002] M. Kieffer, L. Jaulin, and E. Walter. Guaranteed recursive non-linear state bounding using interval analysis. International Journal of Adaptive Control and Signal Processing, 16 (3): , [Kühn, 1998] W. Kühn. Rigorously computed orbits of dynamical systems vithout the wrapping effect. Computing, 61(1), [Maksarov and Norton, 1996] D. Maksarov and J.P. Norton. State bounding with ellipsoidal set description of the uncertainty. International Journal of Control, 65 (5): , [Milanese et al., 1996] M. Milanese, J. Norton, H. Piet- Lahanier, and E. Walter, editors. Bounding Approaches to System Identification. Plenum Press, [Moore, 1966] R.E. Moore. Interval analysis. Prentice Hall, [Polyak et al., 2004] B.T. Polyak, S.A. Sergey, A. Nazin, C. Durieu, and E. Walter. Ellipsoidal parameter or state estimation under model uncertainty. Automatica, 40(7): , [Puig et al., 2001] V. Puig, P. Cuguero, and J. Quevedo. Worst-case estimation and simulation of uncertain discrete-time systems using zonotopes. In Proceedings of European Control Conference, [Puig et al., 2002a] V. Puig, P. Cuguero,, and J. Quevedo. Time-invariant approach to set-membership simulation and state observation for discrete-time invariant systems with parametric uncertainty. In Proceedings of the 41th IEEE Conference on Decision and Control, [Puig et al., 2002b] V. Puig, J. Quevedo, and T. Escobet. Robust fault detection approaches using interval models. In IFAC World Contgress (b 00), Barcelona, Spain, [Puig et al., 2005] V. Puig, A. Stancu, and J. Quevedo. Observers for interval systems using set and trajectory-based approaches. In Proceedings of European Control Conference and IEEE Conference on Decision and Control, [Rinner and Weiss, 2004] B. Rinner and U. Weiss. Online monitoring by dynamically refining imprecise models. 34: , [Shamma, 1997] J.S. Shamma. Approximate set-value observer for nonlinear systems. IEEE Transactions on Automatic Control, 42(5): , [Witczak et al., 2002] M. Witczak, J. Korbicz, and R.J. Patton. A bounder-error approach to designing unknown input observers. In IFAC World Contgress (b 02), Barcelona, Spain, DX'06 - Peñaranda de Duero, Burgos (Spain)

127 !"#"$ %& '#"(!")*$+,$ -.-%/,"0&12 3 2(" $ 4" :"# ;<2(="0>="0?@2">A /B2("#( CD EFGH IJIKLM NK JLK GDEKLKHEKO GD EFK OGJPQ DRHGH RS OGHTLKEK KUKDE HVHEKWH WROKXKO YV ZQ DGEK ELJDHGEGRD HVHEKWH[ \K ILRIRHK J WROKX RS H]IKLUGHGRD IJEEKLDH PKDKLJX KDR]PF ER TJI Q E]LK IJHE RTT]LLKDTKH RS IJLEGT]XJL ELJ^KTERLGKH RS EFK HVHEKW[ _ROKXGDP EFK OGJPDRHGH RY^KT Q EGUK YV H]IKLUGHGRD IJEEKLDH JXXRNH ]H ER PKD Q KLJXG`K EFK ILRIKLEGKH ER YK OGJPDRHKO JDO ER LKDOKL EFKW GDOKIKDOKDE RS EFK OKHTLGIEGRD RS EFK HVHEKW[ \K ZLHE SRLWJXXV OKZDK EFK OGJPQ DRHGH ILRYXKW GD EFGH TRDEKaE[ \K EFKD OKLGUK EKTFDGb]KH SRL EFK TRDHEL]TEGRD RS J OGJPDRHKL JDO SRL EFK UKLGZTJEGRD RS EFK OGJPDRHEGTJYGXG Q EV YJHKO RD HEJDOJLO RIKLJEGRDH RD ELJDHGEGRD HVHEKWH[ \K HFRN EFJE EFKHK EKTFDGb]KH JLK PKDKLJX KDR]PF ER KaILKHH JDO HRXUK GD J ]DG Q ZKO NJV J YLRJO TXJHH RS OGJPDRHGH ILRYXKWH SR]DO GD EFK XGEKLJE]LKM K[P[ OGJPDRHGDP IKL Q WJDKDE SJ]XEHM W]XEGIXK SJ]XEHM SJ]XE HKb]KDTKH JDO HRWK ILRYXKWH RS GDEKLWGEEKDE SJ]XEH[ c -($1#( dgjpdrhgdp JDO WRDGERLGDP OVDJWGTJX HVHEKWH GH JD GDTLKJHGDPXV JTEGUK LKHKJLTF ORWJGD JDO WROKX Q YJHKO JIILRJTFKH FJUK YKKD ILRIRHKO NFGTF OGeKL JTTRLOGDP ER EFK fgdo RS WROKXH EFKV ]HKO ghij hj hhj hkj ij lm[ nfk PKDKLJX OGJPDRHGH ILRYXKW GH ER OKEKTE RL GOKD Q EGSV IJEEKLDH RS IJLEGT]XJL KUKDEH RD J IJLEGJXXV RYHKLU JYXK HVHEKW[ nfgh IJIKL SRT]HKH RD OGHTLKEK Q Q KUKDE HVH EKWH WROKXKO JH Q ZDGEK HEJEK WJTFGDKH[ CD EFGH TRDEKaEM IJEEKLDH ]H]JXXV OKHTLGYK EFK RTT]LLKDTK RS J SJ]XE ghoj himm W]XEGIXK RTT]LLKDTKH RS J SJ]XE gpmm EFK LKIJGL RS J HVHEKW JSEKL EFK RTT]LLKDTK RS J SJ]XE gom[ nfk JGW RS OG JPDRHGH GH ER OKTGOKM Q YV WKJDH RS J qrstuvwxym NFKEFKL RL DRE H]TF J IJEEKLD RTT]LLKO GD EFK HVHEKW[ zukd GS H]TF J OKTGHGRD TJDDRE YK EJfKD GWWKOGJEKXV JSEKL EFK RTT]L LKDTK RS EFK IJEEKLDM RDK LKb]GLKH EFJE EFGH OKTGHGRD FJH Q ER YK EJfKD GD J YR]DOKO OKXJV[ nfgh ILRIKLEV GH ]H]JX Q XV TJXXKO qrstuvws{r r}~[ nfgh ILRIKLEV TJD YK TFKTfKO s ƒ ƒ ˆ Š ƒ ƒ Œ Ž ƒ Œ ˆ ƒ ŒŽ ƒ Š ƒ Ž Œ ƒ Œ ŒŽ š ƒ œˆ Ž žƒ Ÿ Œ ƒ yrvyr SLRW EFK HVHEKW WROKXM JDO OKIKDOH RD YREF GEH RYHKLUJYGXGEV JDO EFK H]IKLUGHGRD IJEEKLD[ RNKUKLM EFK JIILRJTFKH GD EFK XGEEKLJE]LK H]eKL SLRW HRWK OKZTGKDTGKH[ DK RYHKLUKH WJDV OGeKLKDE OKZDG EGRDH RS Q OGJPDRHJYGXGEV JDO sq vª JXPRLGEFWH SRL EFK TRD HEL]TEGRD RS EFK OGJPDRHKLM JH NKXX JH SRL EFK UKLGZTJEGRD Q RS OGJPDRHJYGXGEV[ «H J TRDHKb]KDTKM JXX EFKHK LKH]XEH JLK OG T]XE ER LK]HK SRL DKN Y]E HGWGXJL OGJPDRHGH ILRYXKW H[ \K YKXGKUK EFJE EFK LKJHRD TRWKH SLRW JD JYHKDTK RS Q J TXKJL OKZDGEGRD RS EFK GDURXUKO IJEEKLDHM NFGTF NR]XO TXJLGSV EFK HKIJLJEGRD YKENKKD EFK OGJPDRHGH RY^KTEGUK JDO EFK HIKTGZTJEGRD RS EFK HVHEKW[ CD EFGH IJIKLM NK SRLWJXXV GDELRO]TK EFK DREGRD RS IKLUGHGRD IJEEKLD JH J WKJDH ER OKZDK EFK OGJPDRHGH RY H]Q Q ^KTEGUKH J H]IKLUGHGRD IJEEKLD GH JD J]ERWJERD NFGTF XJDP]JPK GH EFK HKE RS ELJ^KTERLGKH RDK NJDEH ER OGJPDRHK[ nfk ILRIRHJX GH PKDKLJX KDR]PF ER TRUKL GD JD ]DGZKO NJV JD GWIRLEJDE TXJHH RS OGJPDRHGH RY^KTEGUKHM GDTX]OGDP OKEKTEGRD RS IKLWJDKDE SJ]XEHM Y]E JXHR ELJDHGKDE SJ]XEHM W]XEGIXK SJ]XEHM LKIKJEGDP SJ]XEHM JH NKXX JH b]gek TRWIXKa HKb]KDTKH RS KUKDEH[ \K EFKD ILRIRHK J SRLWJX OKZDGEGRD RS EFK dgjpdrhgh LRYXKW GD EFGH TRDEKaE[ nfk KHHKDEGJX IRGDE GH J TXKJL OKZDGEGRD RS EFK HKE RS ELJ^KTERLGKH ªv s}r{ x NGEF JD RY HKLUKO ELJTK[ RNM EFK dgjpdrhgh LRYXKW GH KaILKHHKO Q JH EFK ILRYXKW RS HVDEFKHG`GDP J S]DTEGRD RUKL ELJTKHM EFK qrstuvwxym NFGTF OKTLKKH RD EFK IRHHGYXK±TKLEJGD RT T]LLKDTK RS EFK IJEEKLD RD ELJ^KTERLGKH TRWIJEGYXK NGEF Q EFK ELJTK[ nfk OGJPDRHKL GH LKb]GLKO ER S]XZX ENR S]D OJWKDEJX ILRIKLEGKH Q ªvyyxª}uxww JDO {v²uqxq qrstuvw³ s{r r}~[ vyyxª}uxww KaILKHHKH EFJE EFK OGJPDRHKL JD HNKLH Q JTT]LJEKXV JDO µv²uqxq rstuvws{r r}~ P]JLJDEKKH EFJE RDXV J YR]DOKO D]WYKL RS RYHKLUJEGRDH GH DKKO KO ER Q KUKDE]JXXV JDHNKL NGEF TKLEJGDEV EFJE EFK IJEEKLD FJH RTT]LLKO[ µv²uqxq rstuvws{r r}~ GH SRLWJXXV OK Q ZDKO JH EFK QOGJPDRHJYGXGEV RS EFK HVHEKW NFKLK GH EFK H]IKLUGHGRD IJEEKLD¹M NFGTF TRWIJLKH ER HEJDOJLO OGJPDRHJYGXGEV YV ghim[ ºKXVGDP RD EFK SRLWJX SLJWK NRLf NK FJUK OKUKXRIIKOM NK EFKD ILRIRHK JXPRLGEFWH Q SRL YREF EFK OGJPDRHKL»H HVDEFKHGHM JDO EFK UKLGZTJEGRD RS QOGJPDRHJYGXGEV[ \K YKXGKUK EFJE EFKHK PKDKLGT JXPR LGEFWH JH NKXX JH EFKGL TRLLKTEDKHH ILRRSH JLK J XRE WRLK Q HGWIXK EFJD EFK RDKH ILRIRHKO GD EFK XGEEKLJE]LK[ nfk IJIKL GH RLPJDG`KO JH SRXXRNH[ CD HKTEGRD om NK LK Q DX'06 - Peñaranda de Duero, Burgos (Spain) 117

128 TJXX HEJDOJLO OKZDGEGRDH JDO DREJEGRDH RD XJYKXKO ELJD Q HGEGRD HVHEKWHM JH NKXX JH EFK DREGRD J TRWIJEGYXK ELJ Q ^KTERLGKH RS JD RYHKLUJYXK ELJTK[ ]IKLUGHGRD IJEEKLDH JLK GDELRO]TKO GD HKTEGRD i[ nfk OGJPDRHGH ILRYXKW JDO EFK QOGJPDRHJYGXGEV JLK EFKD OKZDKO[ KTEGRD l GH OKO Q GTJEKO ER JXPRLGEFWH JDO EFKGL JHHRTGJEKO ILRRSHM SRL EFK TRDHEL]TEGRD RS J TRLLKTE OGJPDRHKL JH NKXX JH EFK UKLGZQ TJEGRD RS QOGJPDRHJYGXGEV[ GDJXXVM KTEGRD GXX]HELJEKH EFK JIILRJTF NGEF JD KaJWIXK[ 9 "B++$ "2( %2(02 "$.+"($ =(2 \K ZLHE LKTJXX ]HKS]X HEJDOJLO DREJEGRDH \K JHH]WK PGUKD JD JXIFJYKE M EFJE GH J ZDGEK HKE [ nfk HKE RS ZDGEK HKb]KDTKH RUKL GH OKDREKO YV M NGEF SRL EFK KWIEV HKb]KDTK[ CD EFK IJIKLM EVIGTJX KXKWKDEH RS JLK [ RL JDO GD M OKDREKH EFK ªvuªs}xus}rvu v s uq nfk xut} RS GH OKDREKO [ \K DRN TRWK ER EFK WROKXH RS HVHEKWH!"# $% &'( )* +, -. /+ 0 %+/ : ; < => ¹?@+,+ ; % )A.636+.?-6@ 3 /-.6-%B7-.@+/ +9+C+%6 => D399+/ EFK GDGEGJX HEJEKE -. 6@+.+6 )A KUKDEH )A : E => ; -. 6@+ -% E 3%/ <F ; G G; -. EFK IJLEGJX ELJDHGEGRD LKXJEGRDH CD EFK LKHE RS EFK HKTEGRDM NK JHH]WK PGUKD JD In : ; < =J ¹[ \K NLGEK = K< = SRL = = ¹ < JDO = L< SRL M = ; = L< = [ \K KaEKDO < ER HKb]KDTKH YV HKEEGDP = N< = JXNJVH FRXOHM JDO = LK< = NFKDKUKL = L< = JDO = K< = M SRL HRWK = ;[ nfk xoxu} wx} RS J HEJEK = ; GH =¹ P Q = K< [ «HEJEK = GH yxsª s{ x GS M = R L< =[ \K HKE ST = ¹ P = ; Q = L< = [ CD IJLEGT]XJL ST = ¹ P = [ UV JY]HK RS DREJEGRDM SRL J XJDP]JPK V F M ST = V¹ P = ; Q M V = L< = M JDO SRL JDV ; F ;M ST ; V¹ W ST XYZ[ = V¹[ «H]YHKE ; F ; GH w}s{ x NFKDKUKL ST ; ¹ F ; [ : GH s r Ox GS \ = ; =¹ ] ^[ : GH ªv x}x GS \ = ; =¹ [ : GH qx}xy rurw}rª GS SRL JDV = ; JDO M = K< = _ = K< = ` = =[ nfk sut²stx txuxys}xq YV EFK HVHEKW : GH a :¹ P =b L< NFGTF KXKWKDEH JLK TJXXKO }yscxª}vyrxw RS :[ dgukd J ELJ^KTERLV a :¹M NK NLGEK a :¹e P Q a :¹ SRL EFK HKE RS ELJ^KTERLGKH EFJE KaEKDO GD : [ ºJIGOXV GD EFK IJIKLM NK NGXX DKKO ER OGHEGDP]GHF J H]YHKE ; F ; ER OKDREK ZDJX HEJEKH[ nfk DREGRDH JYRUK JLK KaEKDOKO GD EFGH HKEEGDP YV XKEEGDP a Zf :¹ Q ST =R ¹ F ; [ «]HKS]X RIKLJEGRD RD In GH EFK HVDTFLRDR]H ILRO]TE EFJE JXXRNH ER GDEKLHKTE XJDP]JPKH RS ENR In H[ g &+ 6 :h ; h =hr < h ¹E i h oe 1+ 6?) &'(.H '@+-, HVDTFLRDR]H ILRO]TE -. : G :j ; G ;j = R =j R ¹ k j <¹E?@+,+ <F ; G ;j = =j ¹ K< = = j¹?@+%+ * +, = K< = 3%/ =j K<j =jh lxkjlxv a : G :j ¹ a : ¹ ma :j ¹ JDO SRL ; F ; JDO ; j F ; jm NK JXHR FJUK a ZnoZp : G :j ¹ a Zn : ¹ m a Zp :j ¹[ «XHRM GS ENR HKEH ; F ; JDO ; j F ; j JLK HEJYXKM ; G; j GH HEJYXK GD : G :j [ «H NK JLK GDEKLKHEKO GD OGJPDRHGDP HVHEKWH Q EFGH NGXX YK SRLWJXG`KO GD EFK DKaE HKTEGRD Q M IJLEGJX RYHKLUJEGRD IXJVH J TKDELJX LqRXK[ CD EFGH LKPJLOM EFK HKE RS KUKDEH GH IJLEGEGRDKO GDER b JDO r b b k r b NGEF b m r b ^¹ NFKLK b LKILKHKDEH EFK HKE RS v{wxyos{ x KUKDEH Q KXKWKDEH RS r b JLK EFKD ²uv{wxyOs{ x KUKDEH[ nvigtjx KXKWKDEH RS b NGXX YK OKDREKO YV s s[ \K HJV EFJE : GH b³s r Ox GS \ = ; M b = L< M WKJDGDP EFJE EFKLK GH DR EKLWGDJX XRRI RS ]DRYHKLUJYXK KUKDEH[ REGTK EFJE NFKD : FJH DR XRRI RS ]DRYHKLUJYXK KUKDEHM : GH JXGUK GS JDO RDXV GS : GH b Q JXGUK[ IKE t < b YK EFK DJE]LJX yvcxª}rvu RS ELJ^KT Q ERLGKH RDER b OKZDKO YV t ¹ JDO t ¹ t ¹ GS bm JDO t ¹ REFKLNGHK[ nfk ILR^KTEGRD t HGWIXV KLJHKH EFK ]DRYHKLUJYXK KUKDEH SLRW J ELJ^KTER Q LV[ t KaEKDOH ER XJDP]JPKH YV OKZDGDPM SRL V F M t V¹ t ¹ Q V[ nfk ruoxywx yvcxª}rvu RS V GH OKZDKO YV t u V¹ Q t ¹ V[ RNM EFK sut²stx v } ysªxw v : GH v wxyz :¹ P t a :¹¹ CE GH EFK HKE RS RYHKLUJYXK HKb]KDTKH RS GEH ELJ^KTERLGKH[ LRW EFK ILR^KTEGRD t M NK OKLGUK JD Kb]GUJXKDTK LKXJ Q EGRD YKENKKD ELJ^KTERLGKH RS : M NLGEEKD {TM TJXXKO EFK x s~³ {wxyos}rvu x}²r O s xuªx GD LKSKLKDTK ER EFK OKXJVQ YGHGW]XJEGRD RS g~m ƒ ˆ Š ˆ Œ {T # &+6 {T F a :¹ Ga :¹ 1+ 6@+ 1-%3,2,+936-) % /+ 0 %+/ 1 2 {T?@+%+ * +, Ž t ¹ t ¹ 3% / Ž b -A 3% / ) %92 -A b H % * +, @ 3 6 {T -. 3% + 7-* 39+%D+,+93 6-) % E 3%/? @+ D ) % * +%6-) % 6)?,-6+ gm A), 6@+ + 7-* %D+ D93.. )A H dgukd a :¹M DJE]LJXXV WJIH RDER J ELJTK RS :M DJWKXV t ¹[ RNM PGUKD J DRD KWIEV ELJTK s RS :M s ORKH DRE ]DGb]KXV OKEKLWGDK J Kb]GUJXKDTK TXJHH JH GD PKDKLJX dkxjvq YHKLUJEGRD s TJD YK YLR]PFE YJTf GD : GD ENR OGeKLKDE WJDDKLH s TJD YK JHHRTGJEKO NGEF EFK TXJHH gm NGEF t ¹ s JDO b M RL s TJD YK JHHRTGJEKO NGEF EFK TXJHH gm NGEF t ¹ s JDO r b REGTK EFJE YV dkzdgegrd im gm JDO g m JLK OGeKLKDE[ KDTKSRLEFM NK EJfK EFK TRDUKDEGRD EFJE EFK Kb]GUJXKDTK TXJHH OKDREKO YV J ELJTK s GH g s mt P t u s¹ m a :¹ m b GS s ] gm REFKLNGHK \K HJV EFJE g s mt GH EFK HKE RS ELJ^KTERLGKH ªv s}r{ x NGEF EFK ELJTK s [ \FKD TXKJL SLRW EFK TRDEKaEM NK NGXX ]HK g s m SRL g s mt [ nfgh DREGRD RS TRWIJEGYXK ELJ^KTERLV NGXX YK J TKDELJX DREGRD SRL OGJPDRHGH JH EFK JGW NGXX YK ER GDSKL ILRIKLEGKH RD EFK HKE RS ELJ^KTERLGKH g s mt 118 DX'06 - Peñaranda de Duero, Burgos (Spain)

129 TRWIJEGYXK NGEF EFK RYHKLUJEGRD RS EFK ELJTK s [ nfk LKJHRD SRL TFRRHGDP EFGH OKZDGEGRD RS g s m GH EFJE GD EFK TJHK RS RDXGDK OGJPDRHGHM GE GH DJE]LJX ER JHH]WK EFJE EFK OGJPDRHKL GH LKJTEGUK ER JD RYHKLUJYXK WRUK RS EFK HVHEKW[ 5 %1& 2 '"((2 "$ ( " 22 'B+0 CD EFGH HKTEGRDM NK GDELRO]TK EFK DREGRD RS w² xyorwrvu s}}xyuwm NFGTF JLK WKJDH ER OKZDK XJDP]JPKH NK JLK GDEKLKHEKO GD SRL OGJPDRHGH I]LIRHK[ \K EFKD PGUK HRWK KaJWIXKH RS H]TF IJEEKLDH[ GDJXXVM NK GDELRO]TK EFK OGJPDRHGH ILRYXKW SRL H]TF IJEEKLDH[ ]IKLUGHGRD IJEEKLDH JLK LKILKHKDEKO YV IJLEGT]XJL In H $ H]IKLUGHGRD IJEEKLD ; < =R ; ¹E?@ +,+ ; < =R ¹ -. 3 / +5 6+,C-%-.6- D 3% / D ) C &'(E 3% / ; F ; -. 3 / %B7-.@+/ )A.636+.H «H GH TRWIXKEK NK PKE a ¹ [ «XHR DREGTK EFJE EFK JHH]WIEGRD EFJE ; GH HEJYXK WKJDH EFJE GE Q H JTTKIEKO XJDP]JPK GH KaEKDHGRD Q TXRHKOM G[K[ HJEGHZKH a Z ¹ a Z ¹[ EFKLNGHK HJGOM a Z ¹ GH J XJD Q P]JPK UGRXJEGDP J HJSKEV ILRIKLEV[ nfgh TFRGTK GH DJE]LJX HGDTK NK NJDE ER OGJPDRHK NFKEFKL JXX ELJ^KTERLGKH TRW Q IJEGYXK NGEF JD RYHKLUKO ELJTK FJUK J ILKZa LKTRPDG`KO YV EFK IJEEKLD[ \K DRN PGUK KaJWIXKH RS H]IKLUGHGRD IJEEKLDH NFGTF LKIFLJHK HEJDOJLO ILRIKLEGKH RS GDEKLKHE SRL OGJPDRHGH[ ŒŒ Œ IK! YK J SJ]XE JDO TRDHGOKL EFJE NK JLK GDEKLKHEKO GD OGJPDRHGDP EFK RT Q T]LLKDTK RS EFGH SJ]XE[ «ELJ^KTERLV GH SJ ]XEV GS! [ nfk H]IKLUGHGRD IJEEKLD " RS GP]LK h KaJTEXV LKTRPDG`KH EFGH XJDP]JPKM a "¹! [ $'()* # % & $ GP]LK h ]IKLUGHGRD IJEEKLD SRL RDK SJ]XE ŒŒ Œ +, IKE! JDO! YK ENR SJ]XEH EFJE WJV RTT]L GD EFK HVHEKW[ dgjpdrh j Q GDP EFK RTT]LLKDTK RS EFKHK ENR SJ]XEH GD JD ELJ^KTER Q LV WKJDH OKTGOGDP EFK WKWYKLHFGI RS EFGH ELJ^KTERLV GD! m! j an " n ¹ m ap " p ¹M NFKLK " - i h o JLK GHRWRLIFGT ER EFK H]IKLUGHGRD IJEEKLD " OKHTLGYKO GD GP]LK h[ nfk H]IKLUGHGRD IJEEKLD GH EFKD EFK ILRO]TE " n G " p NFGTF JTTKIEKO XJDP]JPK GD. G. GH anop " j n G " p ¹ an " n ¹ m a p " p ¹[ $'()/0 ) 1 * # 0 # )/ )1 $'()/ * & / 0 # )1 # & 0 1 $'()1 *) / & / 0 & 1 _RLK PKDKLJXXVM EFK H]IKLUGHGRD IJEEKLD SRL EFK RT Q T]LLKDTK RS J HKE RS SJ]XEH! 2 2 2! 3 GH EFK ILRO]TE G h " - M TRDHGOKLGD P G h h JH ZDJX HEJEK HKE[ 7 7 ŒŒ Œ ˆ CS RDK GH GDEKLKHEKO GD OGJPDRHGDP OGeKLKDE SJ]XEH GD J ILKTGHK RLOKLM SRL JWIXKM JD RTT]LLKDTK RS KaQ! JSEKL JD RTT]LLKDTK RS! M EFK H]IKLUGHGRD IJEEKLD HFR]XO j LKTRPDG`K EFK ELJ^JTERLGKH GD an " n ¹ a p " p ¹!! j JH OKHTLGYKO Y V EFK SRXXRNGDP H]IKLUGHGRD IJEEKLD $'()/ * $'()1 * $ # )/ & & / )1 CS! TRLLKHIRDOH ER J SJ]XE KUKDE JDO! j ER EFK LKIJGL RS EFGH SJ]XE GD EFK HVHEKWM EFKD NK JTE]JXXV OGJPDRHK EFK LKIJGL RS EFK SJ]XE! [ \GEF EFGH IJEEKLDM EFK JGW GH ER WJETF EFK 8³qrstuvws{r r}~ GD gom[ 9, ŒŒ Œ : + «DREFKL GDEKLKHEGDP ILRYXKW GH ER OGJPDRHK EFK W]XEGIXK RTT]L LKDTKH RS EFK HJWK SJ]XE KUKDE!M Q HJV ; EGWKH[ nfgh TRL LKHIRDOH ER EFK H]IKLUGHGRD IJEEKLD Q PGUKD YKXRN NFGTF JTTKIEKO XJDP]JPK GH a "¹<[ nfk JGW GH ER WJETF EFK ;Q OGJ PDRHJYGXGEV RS gpm[ $'()* $'()* $'()* $'()* $ # ) & ) & & / 1 = >/ ) & nfgh TJD YK KJHGXV PKDKLJXG`KO ER J IJEEKLD LKTRPDG`GDP EFK RTT]LLKDTK RS ; IJEEKLDH GOKDEGTJX RL DRE¹[? nfk SRXXRNGDP H]IKLUGHGRD IJE EKLD OKHTLGYKH EFK SJTE EFJE J SJ]XE RTT]LLKDTK RS!¹ Q RTT]LLKO ENGTK NGEFR]E LKIJGL RTT]LLKDTK RS w¹[ $'()* # ) A & / $'() 0 A* ) & $ nfgh TJD YK PKDKLJXG`KO ER J IJEEKLD LKTRPDG`GDP EFK RT Q T]LLKDTK RS ; SJ]XEH GOKDEGTJX RL DRE¹ NGEFR]E LKIJGL[ BC DE F CD EFK LKWJGDOKL RS EFK IJIKLM NK TRDHGOKL J HVHEKW NFRHK YKFJUGRL GH WROKXKO YV JD In G ; < =J ¹[ nfk RDXV JHH]WIEGRD WJOK RD G GH EFJE G GH b REGTK EFJE Q JXGUK[ G TJD YK DRD Q OKEKLWGDGHEGT[ \K JXHR TRDHGO KL J H]IKLUGHGRD IJEEKLD Q ; < =R ; ¹ OK Q DREGDP EFK XJDP]JPK a Z ¹ EFJE NK NJDE ER OGJPDRHK[ \K OKZDK EFK rstuvwrw Hyv{ x JH EFK ILRYXKW RS OKZDGDP J S]DTEGRD IJKL RD ELJTKH NFRHK GDEKDEGRD GH $ DX'06 - Peñaranda de Duero, Burgos (Spain) 119

130 ER JDHNKL EFK b]khegrd NFKEFKL ELJ^KTERLGKH TRLLKHIRDO Q GDP ER RYHKLUKO ELJTKH JLK LKTRPDG`KO RL DRE YV EFK IKLUGHGRD IJEEKLD[ \K OR LKb]GLK HRWK ILRIKLEGKH H]Q SRL IJKL vyyxª}uxww JDO µv²uqxq rstuvws{r r}~[ vy³ yxª}uxww WKJDH EFJE KH JDO R JDHNKLH HFR]XO YK JTT]LJEKM NFGXK µv²uqxq rstuvws{r r}~ WKJDH EFJE ELJ ^KTERLGKH GD a Q Z ¹ HFR]XO YK OGJPDRHKO NGEF ZDGEKXV WJDV RYHKLUJEGRDH[ nfk rstuvwrw yv{ x TJD YK HEJEKO JH SRXXRNH PGU KD JD In G JDO Q PGUKD J H]IKLUGHRLV IJEEKLD M OKTGOK NFKEFKL EFKLK KaGHEH JDO TRWI]EK GS JDV¹ J EFLKK UJX Q ]KO S]DTEGRD IJKL v wxyz G¹ < z OKTLKKGDPM SRL KJTF ELJTK s RS GM RD EFK WKWYKLHFGI GD a Z ¹ RS JDV ELJ^KTERLV GD g s m[ RLWJXXVM Ž dgjpdrhgh lrllktedkhh¹ nfk S]DTEGRD HFR]XO UKLGSV z GS g s m F a Z ¹ IJKL s¹ GS g s m m a Z ¹ ^ REFKLNGHK[ Ž UR]DOKO dgjpdrhjygxgev¹ «H G GH RDXV IJLEGJXXV RYHKLUKOM NK KaIKTE GD PKDKLJX HGE]JEGRDH NFKLK IJKL s¹ JH DKGEFKL g s m F a Z ¹ DRL g s mm a Z ¹ ^ FRXO¹[ RNKUKLM NK LKb]GLK EFGH ]DOK EKLWGDKO HGE]JEGRD DRE ER XJHE GD EFK Q SRXXRNGDP HKDHKnFKLK W]HE KaGHE M EFK YR]DOM H]TF EFJE NFKD KUKL Q g s m m a Z ¹M SRL JXX a G¹e m b M GS t ¹ EFKD IJKL t ¹¹ z [ dgjpdrhgh lrllktedkhh WKJDH EFJE EFK OGJPDRHGH RS J ELJTK s GH R GS DR ELJ^KTERLV GD GEH HKWJDEGTH g s m XGKH GD a Z ¹ NFGXK GE GH KH GS JXX ELJ^KTERLGKH GD g s m XGK GD a Z ¹[ UR]DOKO dgjpdrhjygxgev WKJDH EFJE NFKD RY Q HKLUGDP J ELJ^KTERLV GD a Z ¹M J KH JDHNKL HFR]XO YK ILRO]TKO JSEKL ZDGEKXV WJDV RYHKLUJYXK KUKDEH KK GP]LK o SRL JD GDE]GEGUK KaIXJDJEGRD RS EFKHK DREGRDH¹[ Compatible trajectories f No No No No???? Yes f GP]LK o nfk QOGJPDRHJYGXGEV SRL " RNM GS IJKL ILRUGOKH J lrllkte dgj PDRHGHM UR]DO Q KO dgjpdrhjygxgev TJD YK LKIFLJHKOM YV LKIXJTGDP IJKL t ¹¹ z NGEF gt ¹m F a Z ¹[ \K RYEJGD NFJE NK TJXX EFK ³qrstuvws{r r}~[ REGTK EFJE EFGH GH DRN J ILRIKLEV RS G NGEF LKHIKTE ER [ $% &'( G -. ¹ Q OGJPDRHJYXKE?@+,+ E?@+%+ * +, \ a Z ¹ m a G¹ m b \ a G¹e m b -A t ¹ 6@+% gt ¹m F a Z ¹ h¹ @ 3 6 G -. QOGJPDRHJYXK -A G -. ¹ Q OGJPDRHJYXK A),.) C+ H f QOGJPDRHJYGXGEV HJVH EFJE NFKD J ELJ^KTERLV KDOGD P NGEF JD RYHKLUJYXK KUKDE GH LKTRPDG`KO YV EFK H]IKLUGHGRD IJEEKLD M SRL JDV KaEKDHGRD NGEF KDR]PF RYHKLUJYXK KUKDEHM JDV ELJ^KTERLV TRWIJEGYXK NGEF EFK RYHKLUJEGRD t ¹ GH JXHR LKTRPDG`KO YV [ nfk LKWJLf YKSRLK dkzdgegrd GH SRLWJXG`KO YV, A IJ KL D ) C ),,+D 6-3B% ) 5.-.E 6@+% G -. 5/- 3B% ) A 3% / ) %92 -A 6@+ )7 %/+/ - 3B% ) , ) 9/. A), IJKL H «H ER HFRN EFK ]DGSVGDP SLJWKNRLf YJHKO RD H]IKL UGHGRD IJEEKLDHM NK FKLK TRDHGOKL EFK Q UKLV IJLEGT]XJL IKLUGHGRD IJEEKLD H]Q " RS KTEGRD i[om RLGPGDJXXV TRDHGOKLKO YV ghoj him NGEF EFK JHHRTGJEKO DREGRD RS! Q OGJ PDRHJYGXGEV[ IKE ]H ZLHE LKTJXX EFGH DREGRD[ IKE G YK JD In NFGTF GH JXGUK JDO FJH DR XRRI RS ]DRYHKLUJYXK KUKDE[ G GH! ³qrstuvws{ x NFKDKUKL M \! \ a G¹e GS EFKD \ a G¹ t ¹ t ¹ ` o¹! nfk SRXXRNGDP ILRIRHGEGRD LKXJEKH! Q OGJ PDRHJYGXGEV NGEF " Q OGJ PDRHJYGXGEV, g &+6 G 1+ 3% &'( 3% / 3..7C+ 6@ 3 6 G * + 3% 3. % ) 9 )) 8 )A - %6+,%39 + * +%6.H '@+% G -.! 5 /- 3B% ) A 3% / ) %92 -A G -. " 5 /- 3B% ) H \K ZLHE WJfK EFK SRXXRNGDP LKWJLfH #! GH Kb]GUJXKDE ER a Z "¹j #! GWIXGKH a Z "¹j «HH]WK G GH! Q OGJ PDRHJYXKM JDO EFJE S]XZXXH o¹[ \K ILRUK EFJE " ¹ Q OGJPDRHJYGXGEV FRXOH TRDHGOKL a Z "¹ m b JDO XKE a G¹e m b NGEF t ¹ j DREK EFJE EFKLKSRLK [ CE GH KJHV ER HFRN EFJE OKTRWIRHKH GDER NFKLK!M NGEF JOOGEGRDJXXV a G¹e [ RNM GWIXGKH M NFGTFM YV o¹m KDEJGXH EFJE SRL JDV NGEF a G¹ t ¹ t ¹M NK FJUK! a Z "¹[ nfgh GWIXGKH GD IJLEGT]XJL EFJE gt ¹m F a Z "¹[ ºKTGILRTJXXVM JHH]WK G GH " ¹ Q OGJPDRHJYXKMSRL HRWK [ IKE YK EFK XKDPEF RS EFK XRDPKHE ]DRYHKLUJYXK ELJ Q ^KTERLV GD G NFGTF KaGHEH YV JHH]WIEGRD¹M JDO TRDHGOKL h¹ [ lrdhgokl! JDO a G¹e NGEF EF]H t ¹ h¹[ \K FJUK ER ILRUK EFJE a G¹ NGEF t ¹ t ¹M NK FJUK! [ IKE j NGEF r b bm j b M JDO r b [ IKE [ «H ; GH HEJYXK JDO a Z "¹M a Z "¹ m b [ \K FJUK j a G¹e m b NGEF t j ¹ [ UV " ¹ OGJPDRHJYGXGEVM SRL JXX Q a G¹ NGEF t ¹ t ¹M NK FJUK! a Z "¹[ 8 /+ (02 A ( " 22 'B+0 \K DRN ILRIRHK JXPRLGEFWH SRL EFK dgjpdrhgh LRYXKW YJHKO RD HEJDOJLO RIKLJEGRDH RD In H[ CD J ZLHE HEJPK NK YJHK EFK TRDHEL]TEGRD RS EFK IJKL S ]DTEGRD RD EFK HVDTFLRDR]H ILRO]TE RS G JDO JDO GEH OKEKLWGDGHJEGRDM JDO ILRUK EFJE EFK S]DTEGRD IJKL TRWI ]EKH J lrllkte 120 DX'06 - Peñaranda de Duero, Burgos (Spain)

131 dgjpdrhgh[ KaEM NK ILRIRHK JD JXPRLGEFW JXXRNGDP ER TFKTf SRL EFK QOGJPDRHJYGXGEV RS JD In M EF]H KDH]LGDP EFK UR]DOKO dgjpdrhgh LRIKLEV RS EFK S]DTEGRD IJKL [ KDTK JTFGKUGDP EFK OKTGHGRD RS EFK dgjpdrhgh LRYXKW[ E C \K ILRIRHK J TRWI]EJEGRD RS EFK S]DTEGRD IJKL PGUKD G JD In JDO J H]IKLUGHGRD IJEEKLD M NK ZLHE TRDHGOKL EFK HVDTFLRDR]H ILRO]TE G RS G JDO HKK dkzdg EGRD Q o¹[ KaE NK IKLSRLW RD G J HKTRDO RIKLJEGRD HKK dkzdgegrd ¹ NFGTF JHHRTGJEKH ER G J OKEKLWGDGHEGT In NLGEEKD z G ¹[ \K EFKD HFRN FRN z G ¹ ILRUGOKH J S]DTEGRD IJKL OKXGUKLGD P J lrllkte dgjpdrhgh[ IKE ]H ZLHE GDELRO]TK J OKEKLWGDGHJEGRD S]DTEGRD[ &+ 6 : ; < => ¹ 1+ 3% &'(?-6@ r b k b H '@ + OKEKLWGDGHJEGRD RS : -. 6@+ &'( z :¹ b < > ¹?@+,+ oz 6@ +.+6 )A )A ; D399+ / WJTLR Q HEJEKHE > => 3%/ < ST r b ¹¹ Q 3%/ b H REGTK EFJE SRL EFGH OKZDGEGRD EFK EJLPKE WJTLR Q HEJEK RS J ELJDHGEGRD K< GH RDXV TRWIRHKO RS HEJEKH = RS : NFGTF JLK EJLPKEH RS HKb]KDTKH RS ELJDHGEGRDH = L< 6 K = KDOGDP NGEF JD RYHKLUJYXK KUKDE [ nfk LKJ HRD SRL EFGH OKZDGEGRD GH EFK Q TRFKLKDTV NGEF g m[ CD SJT EM SLRW EFK OKZDGEGRD RS Q < GD z :¹M NK GDSKL EFJE S T J s¹ ST =J g s m¹m NFGTF WKJDH EFJE EFK WJTLR Q HEJEK LKJTFKO SLRW J YV s GD z :¹ GH TRW IRHKO RS EFK HKE RS HEJEKH EFJE JLK LKJTFKO SLRW Q =J YV ELJ^KTERLGKH RS g s m GD : [ GDJXXVM OKEKLWGDGHJEGRD ILKHKLUKH ELJTKHM HR NK FJUK a z :¹¹ ysªxw z :¹¹ ysªxw :¹[ \K DRN KaIXJGD EFK TRDHEL]TEGRD RS EFK OGJPDRHKL SLRW G JDO [ IKE ]H ZLHE TRDHGOKL EFK HVDTFLRDR]H ILRO]TE G G G HKK dkzdgegrd o¹[ \K EFKD PKE a G ¹ a G¹ma ¹ a G¹ JH GH TRWIXKEK EF]H a ¹ ¹[ \K JXHR PKE a G ; G ; ¹ a G¹ m a Z ¹ WKJDGDP EFJE EFK ELJ^KTERLGKH RS G JTTKIEKO YV JLK KaJTEXV EFK JTTKIEKO ELJ^KTERLGKH RS G [ GDJXXV DREK EFJE ; G; GH HEJYXK GD G JH YREF ; JDO ; JLK HEJYXK YV JHH]WIEGRD[ \K DRN JIIXV OKEKLWGDGHJEGRD ER G [ \K FJUK ysªxw z G ¹¹ ysªxw G ¹ v wxyz G¹ EF]H SRL JXX s v wxyz G¹M S J s¹ S =J g s m¹[ \K DRN KHEJYXGHF EFK SRXXRNGDP S]DOJWKDEJX LKH]XEH RD EFK TRDHEL]TEGRD z G ¹, ), 3%2 s v wxyz G¹ v wxyz G ¹E S > s¹ F ; G; ` g s m F a Z ¹ i¹ S > s¹ m ; G; ^ ` g s m m a Z ¹ l¹ ^ i¹ WKJDH EFJE JXX ELJ^KTERLGKH TRWIJEGYXK NGEF J ELJTK s JLK JTTKIEKO YV GS JDO RDXV s XKJOH ER J WJTLR Q HEJEK RDXV TRWIRHKO RS WJLfKO HEJEKH GD G [ l¹ WKJDH EFJE JXX ELJ^KTERLGKH TRWIJEGYXK NGEF s JLK DRE JTTKIEKO YV GS JDO RDXV GS s XKJOH ER J WJTLR Q HEJEK RDXV TRWIRHKO RS ]DWJLfKO HEJEKH GD G [ nfk ILRRS RS i¹ GH KHEJYXGHFKO YV EFK SRXXRNGDP HKb]KDTK RS Kb]GUJXKDTKH S J s¹ F ; G; ` S =J g s m F ; G; ` g s m F a ZoZ G ¹ ` g s m F a G¹ m a Z ¹ GWGXJLXVM SRL EFK ILRRS RS l¹ NK FJUK S J s¹ m ; G; ^ S ` =J g s m m ; G; ^ ` g s m m a ZoZ G ¹ ^ ` g s m m a G¹ m a Z ¹ ^ ` g s m m a Z ¹ ^ JH g s m F a G¹ \K FJUK DRN EFK WJEKLGJX ER OKZDK EFK S]DTEGRD IJKL JDO ER RYEJGD EFK lrllktedkhh dgjpdrhgh LRIKLEVM SRX Q XRNGDP OGLKTEXV SLRW LRIRHGEGRD i[!: + &+ 6 z G ¹ 1+ 6@+ &'( )* +E 3%/ 9+6 IJKL s¹ 1+ (! -A S R s¹ F ; G; "! -A S R s¹ m ; G; ^ #! )6@+,?-.+ IJKL D ) C ),,+D 6-3B% ).-.H $ +, % ), /+, 6) ,36+ 6@+ /- 3B% ).+, D ) %5.6,7D6-) % E D ) %.-/+, 6@+ &'( G )A -B7,+ % 9+ A6 3%/.-/+ H $..7 C+?+? 3% 6 6) /- 3B% ).+ 6@+ ) DD7,,+%D+ )A 6@ + A * +%6! H + 6@ @+.78+, *-.-) % 8366+,% " /+5. D, -1+/ -% -B7,+ & 3%/ 17-9/ 6@+ 8, )/7 D6 G G G " H % 6@-. D3.+ G -. -.) C D 6) G?-6@.+6 )A C3, +/ o i lh '@+ /- 3B% ).+, 3.? %.?+,. )163 - %+/ 1 2 /+6+,C-% ) % )A G ),+8,+.+%6+/ -% -B7,+ %,-B@6 3% /.-/+ H, * , ', (, ) - * /, *+(, *+(+) - ) GP]LK i G JDO GEH JHHRTGJEKO OGJPDRHKL N[L[E[ " 7 8E C 9 DE 8 «H NK FJUK KHEJYXGHFKO EFK lrllktedkhh RS IJKL M JT Q TRLOGDP ER LRIRHGEGRD hm EFK UR]DOKO dgjpdrhjygxgev LRIKLEV RS IJKL GH ILRUGOKO Y V EFK QOGJPDRHJYGXGEV RS G[ \K DRN ILRIRHK JD JXPRLGEFW SRL OKTGOGDP Q OGJPDRHJYGXGEV dkzdgegrd ¹[ nfgh JXPRLGEFW GH JOJIEKO SLRW g j hlm[ nfk GOKJ GH EFJE G GH DRE QOGJPDRHJYXK GS EFKLK KaGHEH JD JLYGELJL Q GXV XRDP ELJTK s M H]TF EFJE ENR ELJ^KTERLGKH TRWIJEGYXK NGEF s OGHJPLKK RD a Z ¹ WKWYKLHFGI HKK EFK JYRUK KaJWIXK¹[ \K ZLHE GDELRO]TK EFK dkxjvq YHKLUJEGRDJX Q lxrh]lk :; G ¹ EFJE ILKHKLUKH EFK GDSRLWJEGRD JYR]E a Z ¹ WKWYKLHFGI NFGXK JYHELJTEGDP JNJV ]DRYHKLU Q JYXK KUKDEH[ KaEM J HKXS Q ILRO]TE :; G ¹ G :; G ¹ JXXRNH ER KaELJTE SLRW J ELJTK s IJGLH RS ELJ^KTERLGKH RS G JDO ER TFKTf EFKGL a Z ¹ WKWYKLHFGI JPLKKWKDE[ ¹ DX'06 - Peñaranda de Duero, Burgos (Spain) 121

132 ), 3% & '( : ; < => ¹E 6@+ dkxjvq YHKLUJEGRDJX Q lxrh]lk RS : -. 6@+ &'( :¹ ; b <b => ¹?@+,+ = K<b =?@+%+ * +, = LK< = -% : A),.) C+ r b 3%/ b H UV OKZDGEGRDM SRL JXX s v wxyz :¹M =J :; G ¹ GS JDO RDXV M g s m H[E[ = L< = GD : [ <b = GD lrdhgokl DRN :; G ¹ ; b <b =J ¹ JDO XKE :; G ¹ G:; G ¹ YK EFK In ; G ; b < =J =J ¹¹[ UV OKZDGEGRD RS :; JDO HVDTFLRDR]H ILRO Q ]TEM GS s v wxyz G¹ JDO =J =J ¹ < = =¹ EFKLK KaGHEH g s m H[E[ =J L< = JDO =J L< = GD G [ dgukd OKZDKO JH JYRUK[ \K HJV EFJE = = ¹ ; G; GH ³qx}xy ruxq NFKDKUKL = ;G; ` = ;G; [ EFKLNGHKM EFKV JLK TJXXKO ²uqx}xy ruxq[ «IJEF GD GH TJXXKO JD ³²uqx}xy ruxq s} GS GE TRDEJGDH h TRDHKT]EGUK Q]DOKEKLWGDKO HEJEKH EF]H KUKDEH YKENKKD EFKW¹[ «IJEF GD GH JD ²uqx}xy ruxq ª~ª x GS GE GH J TVTXK NFGTF HEJEKH JLK JXX ]DOKEKLWGDKO[ \K DRN HFRN EFK LKXJEGRD YKENKKD ¹ Q OGJPDRHJYGXGEV JDO EFK KaGHEKDTK RS Q]DOKEKLWGDKO IJEFH[ ++ G -. ¹ 5 /- 3B% ) A 3% / ) %92 -A 6@+,+ -. % %/+6+,C-%+/ 836@ -% E ]IIRHK EFKLK GH DR LKJTFJYXK IJEF[ IKE Q]DOKEKLWGDKO a Z ¹ m b JDO a G¹e m b NGEF QQt ¹QQ [ \K HFR]XO ILRUK EFJE SRL JXX gt ¹mM a Z ¹[ IKE s t ¹[ «DV IJEF =J =J ¹ < = =¹ GD TJD YK OKTRWIRHKO GDER J IJEF =J =J ¹ < n w w < p = =¹ NGEF s t ¹ JDO sj t ¹[ \K FJUK QQ sj QQ QQt ¹QQ EF]H w w ¹ < p GH J IJEF NGEF = =¹ KUKDEH[ UV FVIREFKHGHM RDK RS EFKHK H EJEKHM Q HJV = = ¹ GH OKEKLWGDKOM JDO JH a Z ¹M = = ¹ GH H]LKXV GD ; G; ¹ j [ RN JH ; G; GH HEJYXKM = = ¹ ; G ; ¹ j [ LRW EFGH GE GH TXKJL EFJE SRL JXX gt ¹mM a Z ¹[ lrduklhkxvm H]IIRHK DRN EFJE EFKLK GH JD Q ]DOKEKLWGDKO IJEF w w ¹ < p = =¹ GD NGEF QQ sj QQ G[K[ JXX HEJEKH RD EFK IJEF JLK ]DOKEKL WGDKO¹[ CS EFGH IJEF GH LKJTFJYXK EFKLK GH JXHR J IJEF Q =J =J ¹ < n w w ¹[ «H w w ¹ GH ]DOKEKLWGDKOM EFKLK KaGHEH g s m NGEF a Z ¹ m b M JDO e a Z ¹[ nfklk JXHR KaGHEH a G¹e m b NGEF t ¹ sj [ «H JXX HEJEKH GD IJEF JLK ]DOKEKLWGDKOM EFKLK KaGHEH a G¹e m b NGEF t ¹ sj M Y]E e a Z ¹[ \K EF]H FJUK a Z ¹ m b JDO a G¹em b NGEF QQt ¹QQ M JDO gt ¹m NGEF e a Z ¹[ nfgh ILRUKH EFJE G GH DRE ¹ OGJPDRHJYXK[ Q!: + g G -. 5/- 3B% ) A 3% / ) %92 -A 6@+, D 6@ D ) % 63 - %. % %/+6+,C-%+/ 836@H UJHKO RD EFKRLKW o JDO RD EFK SJTE EFJE GH ZDGEK HEJEKM NK TRDTX]OK EFJE G -. %)6 5 /- 3B% ) A 3% / ) %92 -A D ) %63 - % %/+6+,C-%+/ D2D9+ H HGDP LRIRHGEGRD h JDO EFK TRDHEL]TEGRD RS M UKLGSVGDP OGJPDRHJYGXGEV JWR]DEH ER TFKTf EFK KaGHEKDTK RS LKJTF JYXK Q ]DOKEKLWGDKO TVTXKH GD M LKELGKUGDP EFK GOKJ RS EFK JXPRLGEFW RS ghlj m[ UV lrlrxxjl V h JDO IKWWJ hm g A G -. 5 /- 3B% ) %/ -. 6@+ 9+%B6@ )A 6@+ 9 ) %B+.6 7%/+6+,C-%+/ 836@ )A E 6@+% G -. h¹ 5 /- 3B% ) %/ % )6 ¹ 5 /- 3B% ) \K DRN H]WWJLG`K EFK ILRTKO]LK ER OKEKLWGDK NFKEFKL G GH QOGJPDRHJYXK \K IKLSRLW J OKIEF ZLHE HKJLTF RD NFGTF KGEFKL KaFGYGEH ]DOKEKLWGDKO TVTXK RL KDOH YV FJUGDP TRWI]EKO EFK XKDPEF RS EFK XRDPKHE ]DOKEKLWGDKO HKb]KDTK[ YUGR]HXVM EFGH FJH XGDKJL TRHE GD EFK HG`K RS [ $ +, g % ), /+, 6) ,36+ 6@+ D ) %.6,7D6-) % )A E D ) C+ 1 3D 6) 6@+ 3C89+ &H G "¹ -. B-* +% -% -B7, A6 3%/.-/+ H '@ +,+D63%B9+. D ),,+.8 ) %/ 6) 6@+ C3, +/.636+.H ")? G ¹ G G ¹ -. B-* +% -% -B7,+ 4,-B@6 3% /.-/+ H $!"# GP]LK l :; G ¹ JDO SRL EFK In RS GP]LK i '@ h i¹ i h¹ h l¹ l h¹ -% 3,+ 7%/+5 6+,C-%+/H ")? ).@)? 6@ 3 6 6@+,+ -. % ) 7 %/+5 6+,C-%+/ D2D9+ E?@- 3DD ), /-%B 6) ),, ) 93,2 & +%.7,+. 6@ 3 6 G -. " 5 /- 3B% ) H % /++/E 3..)) % 3.! -. 6,-B5 B+,+/E%-. )1.+, * +/ 3 A6+, 6@+ ) DD7,,+%D+ )A 3 0 %-6+ %7C5 1+, )A )1. +, * * +%6. 1)7 %/+/ 1 2 % H '@7. )1.+, *- %B %.7,+92 - %/- D @ 3 6! ) DD 7,+/ -% 6@+ 83 «.6H TRDELJLGRE D ) %.-/+, 6@+ &'( G -% -B7,+ H B-* +% -% 3. 7%/+6+,C-%+/ D2D9+. - % h i¹ 3% / i h¹e 6@7.E G -. % )6 " 5 /- 3B% ) H % A 3D 6E A), 3%2 E 6@+ 6,3' +D 6), -+. x 3%/! j x 3,+ 1)6@ D ) C ?-6@ s x E?@-9+ e a "¹E?@ +,+3. j a 0 1( "¹H 2, 1(,, *. ', (, ) * / -, *+(, *+(+) - ( GP]LK G JDO GEH JHHRTGJEKO OGJPDRHKL N[L[E[ " %D -/+%63992 E 6@-. +3C89+ 8, )* +. 6@ 3 6 5/- 3B% ) D3%% ) D +/ /-,+D692 ) % 6@+ /- 3B% ).+,H % A 3D 6 6@+ /- 3B% ).+,. A), G 3% / " -B7,+ E,-B@6 3% / A), G A), -B7,+ %E,-B@6 3,+ -.) C D E 3% / G -. " 5 /- 3B% ) ?@-9+ G -. % )6H 122 DX'06 - Peñaranda de Duero, Burgos (Spain)

133 GP]LK SRL EFK In G RS zajwixk h 6 %1& 2 "0&+ nfk KaJWIXK NK OGHT]HH FKLK JDO PGUKD GD GP]LK p GXX]H Q ELJEKH EFK JIILRJTF ILKHKDEKO JYRUK[ CD EFGH KaJWIXKM NK HGWIXV WROKX EFK WRUKWKDE RS J IKLHRD GD J Y]GXOGDP TRWIRHKO RS JD vªx C¹M J r{ysy~ U¹M J yxªx }rvu «¹ JDO J ªvxx³w v l¹[ nfk ORRLH SLRW RDK IJLE RS EFK Y]GXOGDP ER JDREFKL TJD YK EJfKD GD RDXV RDK OGLKTEGRD[ nljdhgegrdh h WROKX EFK TLRHHGDPH RS EFK ORRLH[ RWK ORRLH JLK HKT]LKO YV JTTKHH Q TJLOH IRHHGYXV JXXRNGDP EFK RYHKLUJEGRD¹[ \K JHH]WK EFJE EFKLK KaGHE JTTKHH Q TJLOH A I C B "!! " GP]LK p G JDO GEH H]IKLUGHGRD IJEEKLD SRL EFK ORRLH M j JDO #M WKJDGDP EFJE NFKD JTEGUJEKOM GE GH IRHHGYXK ER RYHKLUK EFK SJTE EFJE RDK IKLHRD TLRHHKH EFK ORRL[ \K TRDHGOKL EFK H]IKLUGHGRD IJEEKLD PGUKD GD GP]LK p¹ NFGTF KaILKHHKH EFK SJTE EFJE PRGDP ENGTK ER TReKK Q HFRI NGEFR]E PRGDP ER EFK XGYLJLV GH J YKFJUGR]L EFJE FJH ER YK H]IKLUGHKO[ RXXRNGDP EFK OGeKLKDE HEKIH OKHTLGYKO GD EFK ILKUGR]H HKTEGRDHM EFK ILRO]TE G G G GH ]HKO ER XJYKX EFK HEJEKH RS G NGEF LKHIKTE ER EFK H]IKLUGHGRD IJEEKLD [ nfk TRLLKHIRDOGDP In GH OKHTLGYKO GD GP]LK $[ % I,N % B,N % %( % %( %& %' I,N1 %) % A,N C,N1 A,N1 %) GP]LK $ G C,F %& B,F % %' %) IKE ]H ZLHE JHH]WK EFJE RDXV EFK JTTKHH Q TJLOHM TRL Q LKHIRDOGDP ER EFK KUKDEH M j JLK JTEGUJEKO JDO EF]H RYHKLUJYXK G[K[ b j [ REGTK EFJE EFK HVHEKW EFKD FJH GDEKLDJX KUKDEH XRRIH[ nfk RYHKLUJYXK HVHEKW G[K[ :; G ¹¹ GH PGUKD GD GP]LK ~[ A,F % % %( I,F %( % I,N % A,N %( %( %( I,N1 %( % A,N1 %( A,F %( %( GP]LK ~ :; G ¹ SRL b j CE GH KJHV ER TFKTf EFJE EFK In :; G ¹ G :; G ¹ DRE LKILKHKDEKO FKLK¹ FJH JD ]DOKEKLWGDKO LKJTFJYXK TVTXK *+, +*+, p < *+, p +*+, < *+, +*+- ṗ < *+, +*+- M EF]H G GH DRE NFKD EFK HKE RS RYHKLUJYXK KUKDEH GH QOGJPDRHJYXK b j [ RNKUKLM GS EFK JTTKHH Q TJLO # GH JTEGUJEKO G[K[ b j # ¹M EFK RYHKLUJYXK H VHEKW :; G ¹ GH PGUKD YV EFK In LKILKHKDEKO GD GP]LK hk[ % 0 % I,N % 0 % A,N %( % % %( I,N1 %( % A,N1 % %( A,F %( GP]LK hk :; G ¹ SRL b j # DK TJD TFKTf EFJE :; G ¹ GH OKEKLWGDGHEGT[ nf]h :; G ¹ G :; G ¹ GH GHRWRLIFGT ER :; G ¹M EF]H QOGJPDRHJYXK FJH DR ]DOKEKLWGDKO TVTXK[ lrdhkb]kdexvm G GH SRL b j # [ REK EFJE NK JXHR FJUK EFJE z G ¹ :; G ¹[ nf]h :; G ¹ JTE]JX Q XV TRLLKHIRDOH ER EFK OGJPDRHKL[ /,#+12 nfk ILKHKDE IJIKL JOURTJEKH EFK ]HK RS H]IKLUGHGRD IJE Q EKLDH SRL EFK OKHTLGIEGRD RS OGJPDRHGH RY^KTEGUKH[ «H]Q IKLUGHGRD IJEEKLD GH JD J]ERWJERDM XGfK EFK RDKH ]HKO GD WJDV OGeKLKDE ORWJGDH UKLGZTJEGRDM WROKX Q YJHKO EKHE Q GDPM IJEEKLD WJETFGDPM KET¹M GD RLOKL ER ]DJWYGP]R]HXV OKDREK J SRLWJX XJDP]JPK[ «H GXX]HELJEKO GD EFK IJIKLM EFK SJ]XE Q RTT]LLKDTK OGJPQ DRHGH GH J IJLEGT]XJL TJHK RS IJEEKLD OGJPDRHGHM Y]E IJE Q EKLDH JLK JXHR ]HKS]X ER OKHTLGYK WRLK PKDKLJX RY^KTEGUKHM JH HFRND GD H]YHKTEGRDH i[h JDO HKTEGRD [ nfk TRDTKIE RS H]IKLUGHGRD IJEEKLDH GH KUKD WRLK JEELJTEGUK GD EFK HKDHK EFJE IJEEKLDH TJD YK TRWIRHKO ]HGDP ]H]JX TRWYGDJERLH GDFKLGEKO SLRW XJDP]JPK EFKRLV ]DGRDM GDEKLHKTEGRDM TRD Q TJEKDJEGRDM KET[¹[ \K JLK GDEKLKHEKO GD OGJPDRHGDP EFK RTT]LLKDTK RS ELJ Q ^KTERLGKH UGRXJEGDP J HJSKEV ILRIKLEVM NFGTF YV OKZDGEGRD TJD YK UGRXJEKO RD J ZDGEK ILKZa[ CE GH EFKD DJE]LJX ER JHH]WK EFJE IJEEKLDH LKTRPDG`K KaEKDHGRD Q TXRHKO XJD Q P]JPKHM GD EFK HKDHK EFJE GS J ELJ^KTERLV RS EFK HVHEKW YKXRDPH ER EFK XJDP]JPKM HR ORKH JDV KaEKDHGRD RS EFGH ELJ^KTERLV[ nfgh GH EKTFDGTJXXV JTFGKUKO YV EFK w}s{r r}~ JHH]WIEGRD RD EFK J]ERWJERD [ «ORIEGDP EFK YKFJU Q GRLJX ILRIKLEGKH IRGDE RS UGKN RD EFK IJEEKLDH XKJOH ER EFK %( I,F I,F % DX'06 - Peñaranda de Duero, Burgos (Spain) 123

134 JEEKWIE ER OGJPDRHK JDV XGDKJL EGWK I]LK IJHE EKWIRLJX SRLW]XJH g$m[ CE GH TXKJL EFJE EFK ILRIKLEGKH NK TRDHGOKL OR DRE WKKE EFK InI OKZDJYXK ILRIKLEGKH FJDOXKO YV g m[ CD EFK NRLLV RS KaIRHGDP J SJGLXV PKDKLJX SLJWKNRLf SRL OGJPDRHGH GHH]KHM EFK dgjpdrhgh LRYXKW GH ILKHKDEKO GD J LJEFKL OKDREJEGRDJX HIGLGEM JH RIIRHKO ER EFK RIKLJEGRDJX HIGLGE NK ZDO GD EFK XGEKLJE]LK NK I]E EFK KWIFJHGH RD EFK OGJPDRHGH S]DTEGRD IJKL NGEF GEH TRLLKTEDKHH JDO YR]DOKODKHH OGJPDRHJYGXGEV ILRIKLEV[ lrllktedkhh GH JD KHHKDEGJX ILRIKLEV EFJE KDH]LKH EFK JTT]LJTV RS EFK DRHGH[ _RLKRUKLM OGJPQ UKLGSVGDP EFK OGJPDRHJYGXGEV ILRIKLEV RS EFK HVHEKW NGEF LKHIKTE ER EFK H]IKLUGHGRD IJEEKLD P]JL JDEKKH EFJE NFKD Q ]HGDP IJKL RDXGDKM JD RTT ]LLKDTK RS EFK IJEEKLD NGXX KUKDE]JXXV YK OGJPDRHKOM JDO EFJE EFGH KUKDE]JXGEV TJD YK b]jdegzko[ CE GH EFK HEJDOJLO DREGRD RS dgjpdrhjygxgevm Y]E HKKD FKLK JH J WKLK WKJD ER JTFGKUK J HJEGHSJTERLV OGJPDRHGH S]DTEGRDj NK JLK JNJLK EFJE EFGH IRGDE RS UGKN OGeKLH SLRW REFKL TXJHHGTJX JI ILRJTFKH[ nfk OKZDGEGRD RS Q OGJPDRHJYGXGEV JH ILRIRHKO FKLK GH J]ERWJEJ Q YJHKOM NGEF G JDO M Y]E TR]XO JH NKXX YK KaILKHHKO GD J XJDP]JPK Q YJHKO SLJWKNRLf[ \K DRN E]LD ER EKTFDGTJX JHIKTEH RS EFK JIILRJTF[ \K FJUK GDHGHEKO RD NFJE EFK HKWJDEGTH RS J ELJTK GH J ELJTK OKDREKH EFK HKE RS ELJ^KTERLGKH NFGTF ILR^KTE RDER EFGH ELJTK JDO EFJE DKTKHHJLGXV KDO ]I NGEF JD RYHKLUJYXK KUKDE[ lrdhkb]kdtkh RS EFGH TFRGTK JLK WJDGSRXO GD EFK OKZDGEGRDH RS QOGJPDRHJYGXGEVM z G ¹ JDO :; G¹[ \K TR]XO FJUK TFRHKD JDREFKL HKWJDEGTHM GWIJTEGDP RD EFK LKXJEKO OKZDGEGRDH JTTRLOGDPXV SRL KaJWIXK NK TR]XO FJUK TRDHGOKLKO EFK HKE RS ELJ^KTERLGKH NFGTF ILR^KTE RD ER EFGH ELJTK[ \FJE GH Q WRHEXV GWIRLEJDE GH EFK JTT]LJEK WJETF YKENKKD EFK HKWJDEGTH SRL ELJTKH JDO EFK REFKL OKZDGEGRDH FKDTK NK JURGO OGHIXKJHGDP OGHTLKIJDTGKH ER OKEKLWGDK ILKTGHKXV EFK dgjpdrhjygxgev UR ]DOM JDO KUKD YKEEKLM NK FJUK J TXKJL ILRRS SRL EFK TRLLKTEDKHH RS EFK HVDEFKHGH JXPRLGEFW[ RNKUKLM NK YKXGKUK R]L TFRGTK GH EFK WRHE DJE]LJX NFKD JOWGEEGDP EFJE EFK OGJPDRHGH S]DT EGRD GWIXKWKDEKO RDXGDK JH JD R]EI]E UKLOGTE GH LKJTEGUK Q ER JD RYHKLUJYXK WRUK RS EFK HVHEKW[ «WRLK HRIFGHEGTJEKO OGJPDRHGH EFJD EFK RDK KaIXJGDKO FKLK TJD YK OKLGUKO SLRW R]L TRDHEL]TEGRD Q EFGH GH SJGLXV HEJDOJLO SRL KaJWIXKM NK TJD EJfK JOUJDEJPK RS fdrn Q GDP EFJE EFK dgjpdrhjygxgev YR]DO GH KaJTEXV [ «HH]WK EFJEM JSEKL J ELJTK s M EFK S]DTEGRD IJKL ILRO ]TKH RD o TRDHKT]EGUK KUKDEHM EFKD DKTKHHJLGXV EFK ELJ^KTERLGKH TRWIJEGYXK NGEF s TJDDRE FJUK WKE EFK IJEEKLD[ \K EKLWGDJEK EFK OGHT]HHGRD NGEF S]E]LK NRLf IKL HIKTEGUKH Q JGWGDP ENR GDOKIKDOKDE RY^KTEGUKH[ nfk ZLHE RY^KTEGUK GH ER KaEKDO EFK JXPRLGEFWH ER WRLK KaILKHHGUK TXJHHKH RS HVHEKWHM H]TF JH GDZDGEK HVHEKWH NFKLK OJEJ GDSRLWJEGRDH GH KaIXRGEKOj EFGH NR]XO KDXJLPK HGPDGZTJDEXV EFK JIIXGTJYGXGEV RS EFK WKEFROH[ nfk HKTRDO RY^KTEGUK GH ER LKXJa EFK HEJYGXGEV JHH]WIEGRDM RL Kb]GUJXKDEXV ER E]LD ER XJDP]JPKH NFGTF JLK DRE KaEKDHGRD Q TXRHKOM GDEKDOGDP ER KDTRWIJHH SLJWKNRLfH XGfK gom SRL GDEKLWGEEKDE SJ]XEHM SRL NFGTF J IRHHGYXK H]IKLUGHGRD IJEEKLD NR]XO YK PGUKD YV EFK SRXXRNGDP In [ $'()* # % & & % $'(A* $'()* && $'(A*.A#2 ; ŒŽ Œ ŽŒ Ž Ž š ŽŒƒ ƒ Œ ˆ ƒžƒ ƒ Ž! "!! #"$ ƒ % & % '' % : (ŒŽ Ž Œ Ž Ž š Ž ) ƒ š ŽŒƒ ƒ Œ Ž Ž ƒ * + *,-., -/ 0,! % 1 & % % % 2 3 š ŠŒ Œ Ž Ž š Ž ) ƒ Ž Ž (ŒŒ Ÿ ) Œ Œ Œ ƒ Œ ŽŒƒ ƒ Œ ƒ Ÿ ˆ Ž ƒžƒ ƒ * + *,-., - / 0,! 4% 122 & % œ 5 Š ; Žˆ Ž ƒ ( 6 3 Ž š ƒ Š ƒ ŒŽƒ ŒŽ Œ ƒ ˆ Ž ƒžƒ ƒ Ž "! * #*$ ž Ž ž % 7 6 Ž 8 Ž 9 ( Ž Ž 3 : Œ žžœ Œ Œ ŽŒƒ Š ž Œ ƒ ˆ Ž ƒžƒ ƒ " 0 -! 1 2 & 2% % 6 Ž Ž 3 : 5 ŽŒƒ ƒ Œ ƒ ˆ Ž ƒžƒ ƒ ; Ž Ÿ Œ Œ ƒ < ŒŽƒ Ÿ " 0 -! ' 1'2 & ' 7 % 6 Ž 3 : Ž 8 œ š ŽŒƒ ƒ Œ 4 Ž Ž ƒ Ž ƒ ˆ Ž ƒžƒ ƒ Ÿ " 0 => - ' % 12 & 2%2 % 2 5 Œ ƒƒ Ž Ž ŽŒ Š Ž ž Œ Œ Œ ƒ ; ƒ Ÿ 0! -. % 12 2 & 2% Š ''7 ' 3 Ž Œ ƒ ŒŽ Œ ŒŠƒ ˆ Š Ž Ÿ Š ˆ Œ Ž?!!@- 0! > -- # $ ˆŒ % Œ AB. ƒ %7 & 2 Ž Ÿ 9 ' C Ž Œ D Ž Ÿ : (Œ Œ ;Œ Œ Ž ƒ ŽŒƒ ƒ Œ ƒ ƒ ˆ Ž ƒžƒ ƒ Ž ƒ ŒŽ Œ Œ Ž ŒŽ ;Œ ƒ Ž Ÿ! "!! E! Ÿ % % 7 3Œ ) D Ž Ÿ : (Œ š ŽŒƒ Ž ƒ Ÿ ˆ Ž ƒžƒ ƒ 1 Ž F Ž Ž Œ Ž ŒŽ ƒ Ž ;Œ Ÿ Ž G "! HI * +., - ƒ 2 & 2 '' % 3 Ž Œ Ž : Ž Œ Ž Ž Ÿ š Ž ) ƒ š ŽŒƒ Š ž Œ ƒ ˆ Ž ƒžƒ ƒ " 0 -! ' & 7 7 ''7 2 3 Ž Œ Ž : Ž Œ Ž Ž š Ž ) ƒ 5 ŽŒƒ ƒ ƒ Ž ƒ Ÿ ˆ Ž Œ ƒ " 0!., - 0!, % 1 7 & % ƒ '' CŒŒ Ž Œ Ž Œ žžœ Ÿ ˆ < ŒŽ Œ ŽŒƒ Š ž Œ žÿ ŒŠƒ ˆ ƒ Ÿ ˆ Ž ƒžƒ Ÿ ƒ " 0J -! ' % % 124 DX'06 - Peñaranda de Duero, Burgos (Spain)

135 On-line diagnosis for Time Petri Nets G. Jiroveanu Ý B. De Schutter Þ R.K. Boel Ý Þ Ý EESA - SYSTeMS, University of Ghent, Belgium DCSC, Delft University of Technology, The Netherlands b.deschutter@dcsc.tudelft.nl Abstract We derive in this paper on-line algorithms for fault diagnosis of Time Petri Net (TPN) models. The plant observation is given by a subset of transitions while the faults are represented by unobservable transitions. The model-based diagnosis uses the TPN model to derive the legal traces that obey the received observation and then checks whether or not fault events occurred. To avoid the consideration of all the interleavings of the concurrent transitions, the plant analysis is based on partial orders (unfoldings). The legal plant behavior is obtained as a set of configurations. The set of legal traces in the TPN is obtained solving a system of Ñ Ü µlinear inequalities called the characteristic system of a configuration. We present two methods to derive the entire set of solutions of a characteristic system, one based on Extended Linear Complementarity Problem and the second one based on constraint propagation that exploits the partial order relation between the events in the configuration. 1 Introduction This paper deals with the diagnosis of TPNs. TPNs are extensions of untimed Petri Nets (PNs) where information about the execution delay of some operations is available in the model. In a TPN a transition can be fired within a given time interval after its enabling and its execution takes no time to complete. A trace in the plant comprises the transitions that are executed in the TPN model (the untimed support) as well as the time of their occurrence. Since a transition can be executed at any time within an interval after it has become enabled, the state space of TPNs is in general infinite. Methods based on grouping states under a certain equivalence relation onto so called state classes were proposed in [2]. The state class graph was proved to be finite iff the net is bounded, thus infinite state spaces can be finitely represented and the analysis of TPN models is computable. Supported by a European Union Marie Curie Fellowship during his stay at Delft University of Technology (HPMT-CT ). Currently with TRANSELECTRICA SA, Craiova, Romania. george.jiroveanu@transelectrica.ro We consider the plant observation given by a subset of transitions whose occurrence is always reported. Moreover the time when an observed transition is executed is measured and reported according to a global clock. The unobservable events are silent, i.e. the execution of an unobservable transition is not acknowledged to the monitoring system. The faults are modeled by a subset of unobservable transitions. The model-based diagnosis for TPNs comprises two stages. First the set of traces that are legal and that obey the received observation is derived and then the diagnosis result of the plant is obtained checking whether some or all of the legal traces include fault transitions. The diagnosis of a TPN can be derived based on the computation of the state class graph as proposed in [5]. However the analysis of TPNs is not tractable even for models of reasonable size because of the interleaving of (unobservable) concurrent transitions. Partial orders were shown to be an efficient method to cope with the state space explosion of untimed PNs because the interleaving of concurrent transitions is filtered out [4],[8]. They were also found applicable for the analysis of PN models where the time is considered as quantifiable and continuous parameter [1],[3]. In this paper we extend the results presented in [6],[7] presenting on-line algorithms for the diagnosis of TPNs based on partial orders. The plant analysis is based on time configurations (time-processes in [1]). A time configuration is an untimed configuration (a configuration in the net-unfolding of the untimed PN support of the TPN model) with a valuation of the execution times for its events. A time configuration is legal if there is a time trace in the original TPN that can be obtained from a linearization of the events of the configuration where the occurrence times of the transitions in the trace are identical with the valuation of their images in the time configuration. A linearization of the events in a configuration is a trace that comprises all the events of the configuration executed only once such that the partial order between the events in the configuration is preserved in the order in which they appear in the trace. The on-line diagnosis algorithm that we propose works as follows. When the process starts we derive time traces in the TPN model up to the first discarding time. A discarding time is the time when in absence of any observation one can discard untimed support traces and it corresponds with DX'06 - Peñaranda de Duero, Burgos (Spain) 125

136 the smallest value of the latest time when an observable event could be forced to happen. The occurrence of an observable transition before the first discarding time is taken in to account eliminating traces that are not consistent with the received observation. Then the plant behaviour is derived up to a next discarding time. The set of all legal time traces in the original TPN can be obtained computing for each configuration the entire solution set of a system of Ñ Ü µ-linear inequalities called the characteristic system of the configuration. The calculations involve time interval configurations. A time interval configuration is an untimed configuration endowed with time intervals for the execution of the events within the configuration. A time interval configuration is legal if for every event and for every execution time of the event within its execution time interval there exists a legal time configuration that considers the event executed at that time. Thus, we need to derive for each configuration the entire solution set of its characteristic system. The naive approach to enumerate all the possible Ñ Ü-elements would imply to interleave concurrent events which is exactly what we wanted to avoid by using partial orders to represent the plant behaviour. To cope with this difficulty we present two methods that avoid the explicit consideration of all the cases for each Ñ Ü-term in the characteristic system. The first method uses the Extended Linear Complementarity Problem (ELCP) [10] for deriving the set of all solutions of the characteristic system of the configuration. The solution set can be represented as a union of faces of a polyhedron that satisfy a cross-complementarity condition. The second method is based on constraint propagation and exploits the partial order relation between the events within the configuration. We derive for each untimed configuration a set of hyperboxes of dimension equal with the number of events within the configuration such that the union of all the subsets of solutions that are circumscribed by the hyperboxes is a cover of the solution set. The paper is organized as follows. In Section 2 we provide definitions and the notation used in the paper. In Section 3 we formalize the diagnosis problem for TPNs models. The analysis of TPNs based on partial orders is described in Section 4. Section 5 and Section 6 present the two methods to derive the solution set of a characteristic system of a configuration and then in Section 7 we present the on-line diagnosis algorithm that we propose. The paper is concluded in Section 8 with final remarks and future work. 2 Notation and definitions 2.1 Petri nets A Petri Net is a structure Æ È Ì µ where È denotes the set of È places, Ì denotes the set of Ì transitions, and È Ö È Ó Ø is the incidence function where È Ö Ô Øµ È Ì ¼ ½ and È Ó Ø Ø Ôµ Ì È ¼ ½ are the pre- and post-incidence function that specify the arcs. We use the standard notations: Ô, Ô for the set of input, respectively output transitions of a place; similarly Ø and Ø denote the set of input places to Ø, and the set of output places of Ø respectively. A marking Å of a PN is represented by a È -vector, Å È ÁÆ, that assigns to each place of Æ a non-negative number of tokens. The set Ä Æ Å ¼ µ of all legal traces of a PN, Æ Å ¼, with initial marking Å ¼ is defined as follows. A transition Ø is Ò Ð at the marking Å if Å È Ö Øµ. Firing, an enabled transition Ø consumes È Ö Ô Øµ tokens in the input places Ô ¾ Ø and produces È Ó Ø Ø Ôµ tokens in the output places Ô ¾ Ø. The next marking is Å ¼ Å È Ó Ø Ø µ È Ö Øµ. A trace is defined as Ø Å ½ Ø ¼ Å ¾ Ø ½ Å, where for ½, Å ½ È Ö Ø µ. Å ¼ Å denotes that the sequence may fire at Å ¼ yielding Å. A PN Æ Å ¼ is 1-safe if for every place Ô ¾ È we have that Å Ôµ ½ for any marking Å that is reachable from Å ¼. 2.2 Occurrence nets Definition 1 Given a PN Æ È Ì µ the immediate dependence relation ½ È Ì µ Ì Èµ is defined as: µ ¾ È Ì µ Ì Èµ ½ if µ ¼ Define as the transitive closure of ½ ( ½ ). The immediate conflict relation ½ Ì Ì is defined as: Ø ½ Ø ¾ µ ¾ Ì Ì Ø ½ ½ Ø ¾ if Ø ½ Ø ¾ Define È Ì µ È Ì µ as µ ¾ È Ì µ È Ì µ: if Ø ½ Ø ¾ s.t. Ø ½ ½ Ø ¾ and Ø ½ and Ø ¾. The independence relation È Ì µ È Ì µ is defined as µ ¾ È Ì µ È Ì µ: µ µ µ µ Definition 2 Given two PNs Æ È Ì µ and Æ ¼ È ¼ Ì ¼ ¼ µ, is a homomorphism from Æ to Æ ¼, denoted Æ Æ ¼ where µ Èµ È ¼ and Ì µ Ì ¼ and µ Ø ¾ Ì, the restriction of to Ø respectively Ø is a bijection between Ø and Øµ respectively between Ø and Øµ. Definition 3 An occurrence net is a net Ç ½ µ such that: µ ¾ µ (acyclic); µ ¾ ½ (well-formed); µ ¾ ½ (no backward conflict). In the following is referred as the set of conditions while is the set of events. Definition 4 A configuration µ in the occurrence net Ç is defined as follows: i) is a proper sub-net of Ç ( Ç) ii) is conflict free, i.e. ¾ µ µ µ µ iii) is causally upward-closed, i.e. ¾ ¾ and ½ µ ¾ iv) Ñ Ò µ Ñ Ò Çµ Definition 5 Consider a PN Æ Å ¼ s.t. Ô ¾ È Å ¼ Ôµ ¾ ¼ ½. A branching process of a PN Æ Å ¼ is a pair Ç µ where Ç is an occurrence net and is a homomorphism Ç Æ s.t.: 1. the restriction of to Ñ Ò Çµ is a bijection between Ñ Ò Çµ and Å ¼ (the set of initially marked places) 126 DX'06 - Peñaranda de Duero, Burgos (Spain)

137 2. µ È and µ Ì 3. ¾ ( ) ( µ µ) µ For a configuration in Ç denote by ÍÌ µ the maximal (w.r.t. set inclusion) set of conditions in that have no successors in : ÍÌ µ ¾ µ Ñ Ò Çµµ Ò ¾ µ Definition 6 Given a PN Æ Å ¼ and two branching processes ¼ of PN Æ Å ¼ then ¼ if there exists an injective homomorphism ³ ¼ s.t. ³ Ñ Ò ¼ µµ Ñ Ò µ and Æ ³ ¼. There exists a unique (up to an isomorphism) maximum branching process (w.r.t. ) that is the unfolding of Æ Å ¼ and is denoted Í Æ Å ¼ µ [8]. Denote by the set of all the configurations of the occurrence net Í Æ Å ¼ µ. For a configuration ¾ denote by the set of strings that are linearizations of µ where a string ½ ¾ is a linearization of µ if and ¾ we have that: µ µ and µ for, if then. 2.3 Time Petri nets A Time Petri Net (TPN) Æ È Ì Á µ, consists of an (untimed) Petri Net Æ È Ì µ (called the untimed support of Æ ) and the static time interval function Á Ì Á É µ, Á Ø Ä Ø Í Ø, Ä Ø Í Ø ¾ É, representing the set of all possible time delays associated to transition Ø ¾ Ì. In a TPN Æ Å ¼ we say that a transition Ø becomes enabled at the time Ø Ò then the clock attached to Ø is started and the transition t can and must fire at some time Ø ¾ Ø Ò Ä Ø Ò Ø Í Ø, provided Ø did not become disabled because of the firing of another transition. Notice that Ø is forced to fire if it is still enabled at the time Ø Ò Í Ø. Definition 7 A state at the time (according to a global clock) of a TPN Æ Å ¼ is a pair Ë Å Áµ where Å is a marking and Á is a firing interval function associated with each enabled transition in Å ( Á Ì Á É )). If Ø is executed at the time Ø ¾ É we write Å Áµ Ø Ø Å ¼ Á ¼ µ or simply Ë Ø Ø Ë ¼ where: 1. Å È Ö Øµ Ø Ø Ò Ä Ø µ Ø¼ ¾ Ì s.t. Å È Ö Ø ¼ µ µ Ø Ø Ò Í ¼ Ø µ ¼ 2. Å ¼ Å È Ö Øµ È Ó Ø Ø µ 3. Ø ¼¼ ¾ Ì s.t. Å ¼ È Ö Ø ¼¼ µ we have: (a) if Ø ¼¼ Ø and Å È Ö Ø ¼¼ µ then Á Ø ¼¼ µ Ñ Ü Ø Ò Ä ¼¼ Ø ¼¼ Øµ Ø Ò ¼¼ (b) else Ò Ø ¼¼ Ø and Á Ø ¼¼ µ Ø Ò Ä ¼¼ Ø Í Ø ¼¼ ¼¼ Ò Í Ø ¼¼ A legal time trace in a TPN Æ satisfies: Ë ¼ Ø ½ Ø½ Ë½ Ø ¾ Ø¾ Ø Ø Ë where Ø ½ Ø ½ Ë Ë ½ for ¼ ½. In the following for a time trace we use the notation to denote its untimed support. For the initial state Ë ¼ we use also the notation Å. Denote ¼ Ä Æ Å ¼ µ the set of all legal time Ø ¼¼ traces that can be executed in Æ Å. We call ¼ Ä Æ Å ¼ µ the time language of the TPN Æ Å. ¼ Ä Æ Å ¼ µ is the untimed support language of the time language Ä Æ Å ¼ µ i.e. Ä Æ Å ¼ µ ¾ Ä Æ Å ¼ µ. 3 Diagnosis of TPNs We consider the following plant description: 1. the TPN model Æ Å ¼ is untimed 1-safe 2. Ì Ì Ó Ì ÙÓ where Ì Ó is the set of observable transitions and Ì ÙÓ is the set of unobservable transitions 3. Ð Ó is the observation labeling function Ð Ó Ì Å Ó where Å Ó is a set of labels and is the empty label. Ð Ó Øµ if Ø ¾ Ì ÙÓ and Ð Ó Øµ ¾ Å Ó if Ø ¾ Ì Ó 4. when an observable transition Ø Ó ¾ Ì Ó is executed in the plant the label Ð Ó Ø Ó µ is emitted together with the global time ÐÓ Ø Ó µ when this execution of Ø Ó took place 5. the execution of an unobservable transition does not emit anything (is silent) 6. the faults are modeled by a subset of unobservable events, Ì Ì ÙÓ ; Ð Ì ÙÓ Å is the fault labeling function (Å is a set of labels and is the empty label); Ð Øµ if Ø ¾ Ì ÙÓ ÒÌ and Ð Øµ ¾ Å if Ø ¾ Ì 7. the faults are unpredictable, i.e. Ø ¾ Ì, Ø ¼ ¾ Ì Ò Ì s.t. µ Ø ¼ Ø and µ Ä Ø ¼ Í Ø. The plant observation at the time the Ò Ø observed event is executed in the plant is denoted as ÇÒ Ó ½ Ó ½ Ó Ò Ó Ò, where Ó ½ Ó Ò ¾ Å Ó are the labels that are received and Ó ½ Ó ¾ Ó Ò are the occurrence times of the corresponding events. Denote by ÇÒ the plant observation at the time Ó Ò, i.e. ÇÒ includes also the information that no observable event occurred in the interval Ó Ò. Ä Æ Ç Ò µ is the set of all time traces that are feasible in Æ Å ¼ up to the time of the last observation Ó Ò and that obey the received observation ÇÒ where ¾ Ä Æ Ç Ò µ if: µ ¾ Ä Æ Å ¼ Ó Ò µ ( is legal); µ Ð Ó µ Ó ½ Ó Ò ( obeys the untimed observation), and µ for each observable transition Ø Ó ¾ Ì Ó, ½ Ò we have that Ð Ó Ø Ó µ Ó µ Ø Ó ( obeys the execution times of the observed transitions). Similarly Ä Æ Ç Ò µ is the set of all time traces that are feasible in Æ Å ¼ up to the time and that obey the received observation ÇÒ. The plant diagnosis Æ ÇÒ µ based on the received observation ÇÒ comprises the untimed strings obtained by projecting the untimed support traces contained in Ä Æ ÇÒ µ onto the set of fault transitions Ì : Æ Ç Ò µ Ò ¾ Ä Æ Ç Ò µ and Ð µ The diagnosis result Ê Æ ÇÒ µ indicates that a fault for sure happened if all the traces contain fault events, i.e. Ê Æ ÇÒ µ ¾ Æ Ç Ò µ Ó DX'06 - Peñaranda de Duero, Burgos (Spain) 127

138 If Æ ÇÒ µ contains only the empty string then the diagnosis result is normal, i.e. Ê Æ ÇÒ µ Æ Æ Ç Ò µ. Otherwise the diagnosis result is uncertain, i.e. a fault could have happened but did not necessarily happen [9]. 4 The analysis based on partial orders The partial order reduction techniques developed for untimed PN [8] are shown in [1],[3] to be applicable for TPN. Consider a configuration in the unfolding Í Æ Å ¼ µ of the untimed PN support of a TPN. Then consider a valuation of the execution times at which the events ¾ in the configuration are executed, that is for each ¾ consider a time value ¾ Ì ( Ì the time axis) at which occurs and is an -tuple representing the execution times of all the the events ¾. An untimed configuration with a valuation ¾ Ì of the execution time for its events is called a time configuration of the TPN. A time configuration is legal if there is a legal trace ¾ Ä Æ Å ¼ µ in the TPN Æ Å ¼ whose untimed support is a linearization of the partial order relation of the events in the configuration (i.e. µ and ¾ ) while the execution time Ø of every transition Ø considered in the trace is identical with the valuation of the event for which Ø is its image via. Consider an untimed configuration ¾. The TPN Ñ Ò Í Æ µ Á µ is obtained by attaching to each event ¾ the static interval Á Ø that corresponds in the original TPN to transition Ø s.t. µ Ø. Denote by Ã the following system of inequalities: Ã Ñ Ü ¼ µ Ä Ñ Ü ¼ µ Í ¾ ¼ ¾ ¼ ¾ (1) where in (1) implies Ñ Ü ¼ ¾ ¼ µ ¼. Proposition 1 ¾ Ä Æ Å ¼ µ we have that if µ and ¾, then is a solution of Ã, where Ø½ µ Ø ½ µ with µ Ø, ½. Proof: The proof is straightforward. Denote by ËÓÐ Ã µ the set of all solutions of Ã. The -hyperbox Á that circumscribes ËÓÐ Ã µ is easily obtained as: ¾, Á µ Ä µ Í µ with Ä µ Ñ Ü ¼ ¾ Ä ¼ µµ Ä and Í µ Ñ Ü ¼ ¾ Í ¼ µµ Í where ¾ s.t., Ä µ Ä and Í µ Í. Example 1 Consider the TPN displayed in Fig. 1. Static intervals are attached to each transition. The observable transitions are Ø, Ø and Ø ½¼ and they emit the same label. Ø and Ø are faulty transitions. In Fig. 2 a part of the unfolding Í Æ Å ¼ µ is displayed where attached to each event ¾ is the interval Á µ. We cannot claim yet that for ¾ there exists at least a legal time configuration that corresponds with because for a general TPN the enabling of a transition does not guarantee that it eventually fires because some conflicting transition may be forced to fire beforehand. ¾ e 1 [10,26] e 2 [12,30] p 1 t 4 [1,8] t 2 t t [2,4] 5 p 1 3 [1,5] [3,9] t 3 [2,4] b 2 e1 [5,13] e 2 [7,17] p 2 e2 [2,4] b1 b2 p 4 p 5 t 7 [9,10] p 6 t 6 [1,2] Figure 1: e5 [1,5] b4 b5 p 7 t 8 [1,4] p 8 b7 e8 b8 [1,4] b11 p 10 t 10 [2,9] t 11 p [2,5] t 9 12 [2,9] t 9 [1,3] p 11 b10 e11 [2,5] e3 e6 e9 [4,9] [2,7] [3,8] b 1 b3 b6 b9 b 10 e4 e7 [5,17] [11,17] e10 [5,17] b 7 b 2 bb1 b 4 b 10 b 4 b7 e 3 b 3 b 1 [9,21] b 9 e 4 e 10 [10,29] [9,31] bb 1 b 4 Figure 2: e12 [4,14] e 11 [6,19] b 11 b 7 e 9 [7,22] b 10 e12 [8,28] b 10 e 11 [10,33] Denote by the set of conflicting events of a configuration ¾ where comprises the events that could have been executed but are not included in : ¾ Ò The characteristic system Ã of configuration ¾ is obtained by adding to Ã inequalities regarding the conflicting events : Ã Ñ Ü ¼ µ Ä Ñ Ü ¼ µ Í ¾ ¼ ¾ ¼ ¾ Ñ Ò ¼ µ Ñ Ü ¼ ½ ¼¼ ¾ ¼¼ µ Í ¾ Proposition 2 Given an arbitrary time we have that ¾ Ä Æ Å ¼ µ iff: µ µ, ¾ and ¾ ; µ is a solution of Ã, µ ¾ µ, and Úµ ¾ Æ Ä µ,. Proof: µ Since the PN is 1-safe we have that for any legal untimed trace there exists a unique configuration s.t. ¾. Condition ½ and are trivial and the proof that Ø ½ Ø Ò µ is a solution of Ã is simply by induction. The proof is trivial. ¾ The problem the we should answer next is: Up to what time to make the calculations for the on-line monitoring?. There are different solutions to answer this question, depending on the computational capability, the plant behaviour, and the requirements for the diagnosis result. Solution 1: Calculations in advance The first solution is appropriate for a plant known to have a cyclic operation, where each operation cycle is initiated by the plant operator. Having derived the plant behaviour up to the time that corresponds with the completion of an operation cycle, the plant is monitored on-line in the following way: b DX'06 - Peñaranda de Duero, Burgos (Spain)

139 1. the received observation is taken in to account adding (in)equality constraints to the characteristic system of a configuration. 2. or configurations are discarded when the current time exceeds the latest execution time of an observable event in a configuration. The main drawback of this method is that a large amount of calculations is performed in advance and then discarded because of the received observation. Solution 2: Calculations after each observation The second solution is to perform calculations each time an event is observed in the plant. E.g. when the first observable event is executed in the plant we derive the plant behaviour up to the time Ó ½ in the following way. Let the first observation be Ç ½ Ó ½ Ó ½. Consider the set of configurations Ç ½ µ s.t. ¾ Ç ½ µ if: 1. contains only one event Ó s.t. Ó µ ¾ Ì Ó and Ð Ó Ó µµ Ó ½, and Ó ½ ¾ Á Ó µ 2. ¾ ÍÌ µ Ä µ Ó ½ 3. ¾ Æ Ä µ Í µ Ó ½ where Æ Ä µ denotes the set of events that correspond to transitions that are enabled from ÍÌ µµ. The characteristic system Ã Ç ½ µ of configuration ¾ Ç ½ µ is obtained by adding to Ã inequalities regarding the conflicting events and the received observation. This method requires less computation but the price to be paid is that a fault may be detected with a delay. This is because no calculations are performed until a new observation is received, thus the fact that the current time of the plant exceeds the latest execution time of an observable event is not taken in to account. However this method is practically useful when the frequency of observations is high, i.e. the time interval in between two observations is short and control actions are inevitably taken with some latency. Moreover this method is also suitable when the plant observation is known to be uncertain, i.e. the observation of an event can be lost because of a sensor failure. This is because in between two observations the diagnosis result w.r.t. the detection of the faults that for sure happened does not change if the observation is uncertain. Solution 3: Calculations up to a discarding time A discarding time is the earliest time when in absence of any observation one can discard untimed support traces because it can be proved that they are not valid. E.g. the first discarding time is the smallest latest execution time of an observable transition in the plant. Definition 8 A configuration ¾ is derived up to the time if: µ Ñ Ü ¾ ÍÌ µ Ä µµ and µ Ñ Ò ¾ Æ Ä µ Í µµ. Given a configuration ¾ that is derived up to a time ¼, denote by µ the set of extensions of up to the time ¼ where ¾ µ if: µ ( is a continuation of ) and µ is derived up to the time. The first discarding time is calculated iteratively as follows. is initiated with a big value (say ½ for simplicity) and then starting from the initial configuration ½ µ we construct an initial part of the net unfolding by appending events as in the untimed case, the only difference being that among all the enabled events denoted by Æ Ä µ only the events with the smallest upper limit Í µ are appended, until the first observable event say Ó is encountered. The discarding time is set equal to Í Ó µ and then the configurations that contain Ó are extended up to the time Í Ó µ. Denote this set by Ó Ò Û. Then for each configuration ¾ Ó Ò Û we calculate ËÓÐ Ã µ and for those configurations that have a non-empty solution set we calculate Í ¼Ó µ, i.e. the smallest latest time when an observable event ¼Ó can be executed. Obviously Í ¼Ó µ Í Ó µ. The discarding time is set as the smallest latest time when an observable event can be forced to execute considering all ¾ Ó Ò Û. Notice that a configuration may contain some other observable events and after calculating ËÓÐ Ã µ some other observable event may have the smallest latest time for its execution. Then recursively all the configurations that contain only unobservable events are extended up to the new discarding time by appending event(s) selected among all the enabled with the smallest upper limit Í µ. Continue this operation until either a new observable event is encountered or no more events can be appended. Notice that because is calculated recursively some configurations (that contain at least one observable events) are derived up to times bigger than. However this does not affect the diagnosis result since the events that can be executed after the time are seen as a prognosis. The on-line diagnosis algorithm works as follows. When the process starts we derive the set of configurations up to the first discarding time and then we have two cases: Case 1 If no observation is received before the time then: 1. the configurations that contain observable events having the upper limit equal to are discarded 2. for all the other configurations that contain observable events inequalities of the form: Ó Ã Ó Ò Ó Ó ¾ and Ó µ ¾ Ì Ó are added to the characteristic systems Ã and we derive the entire solution set 3. for all the configurations ¾ ÙÒÓ that contain only unobservable events we check only if ËÓÐ Ã µ has an non-empty set of solutions. 4. denote by Ç ¼ µ the set of traces that are obtained as linearizations of the set of events of the configurations that are not discraded. 5. the diagnosis ÔÓ Ç Æ ¼ µ is obtained projecting Ç ¼ µ onto the set of fault transitions Ì Case 2 If the first observation Ó ½ Ó ½ is received before the time of the process becomes then: DX'06 - Peñaranda de Duero, Burgos (Spain) 129

140 1. the set of configurations ÙÒÙ that contain only unobservable events is discarded 2. for each configuration ¾ Ó that contains observable events an equality relation: ÃÓ ¼ ½ Ó Ó ½ Ð Ó Ó µ Ó ½ Ó ¾ and for observable events other than Ó inequalities of form: Ó ÃÓ ¼¼ ½ Ò ¼Ó ¼Ó ¾ and ¼Ó µ ¾ Ì Ó are added to the characteristic system Ã and then we derive the entire solution set 3. denote by Ç ½ µ the set of traces that are obtained as linearizations of the set of events of the configurations that are not discarded. 4. ÔÓ Ç Æ ½ µ is obtained projecting Ç ½ µ onto Ì Notice that the plant diagnosis is derived either at the time of the first observed event ÔÓ Ç Æ ½ µ or in absence of any observation at the first discarding time, ÔÓ Ç Æ µ. ¼ Theorem 1 Given a TPN model Æ Å ¼ we have that: 1. when the first observable event is executed: Ê Æ Ç ½ µ ÊÔÓ Ç Æ ½ µ 2. if no observation is received until the first discarding : Ê Æ Ç ¼ µ ÊÔÓ Ç Æ ¼ µ 3. and for any time, in absence of any observation, the diagnosis result is different from : Ê Æ Ç¼ µ Proof: ½µ and ¾µ have a similar proof. Based on Proposition 2 we calculate the set of legal traces up to given time. However some configurations include events that are executed after the time Ó ½ or. Since the faults are unpredictable the consideration of some events that can be executed after the time Ó ½ or does not change the diagnosis result w.r.t. the detection of faults that for sure happened. µ is proved straightforwardly by the assumption that the faults are unpredictable. ¾ Remark 1 Obviously by imposing the inequality that all the events in a configuration have execution times smaller than Ó ½ or allows one to derive exactly the diagnosis result by removing the events that can be executed after the time Ó ½ respectively. However this is not efficient for practical calculations especially when the frequency of observations is high. Notice also that calculations in advance are not fully developed, thus it may be that an event that is considered executed after Ó ½ might not be executed since an event that is successor of the observed event can pre-empt its execution. In what follows we present two methods to derive the solution set of the characteristic system of a configuration. The first method is based on the ELCP and derives the entire solution set as a union of faces of a polyhedron that satisfy the cross-complementarity condition [10]. The second method is based on constraint propagation and derives for a configuration a set of -hyperboxes s.t. the union of the subsets of solutions that are circumscribed by the -hyperboxes is a cover of the entire solution set. 5 The method based on ELCP The ELCP is defined as follows (see [10]). Given ¾ ÁÊ Û Þ, ¾ ÁÊ Õ Þ, ¾ ÁÊ Û, ¾ ÁÊ Õ, and Ñ index sets ½ Ñ ½ Û, find Ü ¾ ÁÊ Þ such that Ü Ü (2) Ñ ½ ¾ Ü µ ¼ (3) Condition (3) can be interpreted as follows. Since Ü, all É the terms in (3) are nonnegative. Hence, (3) is equivalent to ¾ Ü µ ¼ for ½ Ñ. So we could say that each set corresponds to a group of inequalities in Ü, and that in each group at least one inequality should hold with equality. In [10] we have developed an algorithm to find all solutions of an ELCP. This algorithm yields a description of the complete solution set of an ELCP by finite points, generators for extreme rays, and a basis for the linear subspace associated with the maximal affine subspace of the solution set of the ELCP. Let us now explain how Ñ Ü µ equations of the form Ñ Ü µ Ä Ñ Ü µ Í (4) ¾Â ¾Â can be recast as an ELCP. First we introduce a dummy variable Ñ Ü ¾Â. Then (4) reduces to Ä Í (5) which already fits the ELCP format. Let us now look at the equation Ñ Ü ¾Â. This can be recast as for all ¾ Â (6) where for at least one index ¾ Â equality should hold, i.e., ¾Â µ ¼ (7) Clearly, equations (5) (7) constitute an ELCP. Thus Ã can be treated as an ELCP. First we derive the polyhedron that provides the set of solution for the system of linear (in)eaqualities given by 2. The solution set of the ELCP is obtained as a union of faces of a polyhedron that satisfy the cross-complementarity condition [10]. 6 The method based on constraint propagation Before formally presenting the second algorithm we introduce first the definition of a time interval configuration. A time interval configuration Áµ is an untimed configuration ¾ endowed with time intervals for the execution of the events within the configuration. Á is a vector of dimension that comprises for each event ¾ the time interval Á µ in which the event is assumed executed. Definition 9 Given the observation Ç ½ and a configuration ¾ Ç ½ µ we have that the time interval configuration Áµ is legal if for any event ( ¾ ) and for any execution time of the event ( ¾ Á µ) there exist execution times for all the other events within the configuration ( ¾ Á µ for all ¾ Ò ) s.t. ½ µ is a solution of the characteristic system Ã ( ¾ ËÓÐ Ã µ). 130 DX'06 - Peñaranda de Duero, Burgos (Spain)

141 Given a hyperbox Á Á denote by Ä µ Í µ the execution time interval for the event. Then for a conflicting event denote by Ä µ Ñ Ü ¼ ¾ Ä ¼ µµ Í and Í µ Ñ Ü ¼ ¾ Í ¼ µµ Í the earliest respectively the latest time when is forced to fire. We have that. Proposition 3 Á µ is a legal time interval configuration if the following conditions hold true: 1. Á Á such that Ä µ Ñ Ü ¼ ¾ Ä ¼ µµ Í and Í µ Ñ Ü ¼ ¾ Í ¼ µµ Ä 2. ¾, ¾ s.t. ½ and Ä µ Ä µ and Í µ Í µ. 3. Ó ½ Ó for Ó ¾, Ó µ Ð Ó ½ µ 4. ¾ ÍÌ µ µ Í µ Ó ½ 5. ¾ Æ Ä µ µ Ñ Ü ¼ ¾ Ä ¼ µµ Í Ó ½. Proof: The proof is lengthy and is omitted. In the following we present an algorithm that derives a set of -hyperboxes, Á ¾ Î (Î the set of indexes) s.t. for each -hyperbox Á, Á µ is a legal time interval configuration and the union of the subsets ËÓÐ Ã µ ¾ Î that are circumscribed by Á is a cover of the entire solution set ËÓÐ Ã µ, i.e. Ë ¾Î ËÓÐ Ã µ ËÓÐ Ã µ, where ËÓÐ Ã µ ËÓÐ Ã µ Á. The idea behind developing the algorithm that we propose is as follows. First we calculate the hyperbox Á that circumscribes ËÓÐ Ã µ. Then we should impose the timing constraints imposed by the conditions ¾ in Proposition 3. We have three kinds of constraints. Denote by Ã ÓÒ, ÃÓ ¼, and ÃÓ ¼¼ the set of constraints imposed by the set of conflicting events (condition ¾µ), the equality constraint required by the observation of the label Ð Ó ½ (condition µ), and respectively the set of constraints that require that the time configuration is complete w.r.t. the time Ó ½ (none of the concurrent parts of the process are left behind in time). Consider a constraint on the time interval Á µ Ä µ Í µ of an event ¾ where: Ò Á ¼ µ Ä ¼ µ Í ¼ µ Ä ¼ µ Ä µ or Í ¼ µ Í Ó µ The set of solutions of Ã that satisfy, denoted ËÓÐ Ã µ, is obtained propagating the constraint forward to its successors and backwards to its predecessors: - forward propagation: for all ¾ : Ä ¼ µ Ñ Ü Ä µ Ä Ä µµ and Í ¼ µ Ñ Ò Í µ Í Í µµ - backward propagation: i) for all ¾ : Í ¼ µ Ñ Ò Í µ Ä Í µµ ii) for each ¾ s.t. Ä µ Í Í µ consider a different case ¾ Î ¼ : ii.1) Ä ¼ µ Ä µ Í ii.2) for all ¾, Ä ¼ µ Ä µ. ¾ e1 [11,13] e2 [2,4] b 1 b1 b2 b5 b4 b7 e5 e8 [1,5] [1,4] b8 e3 e6 e9 [4,9] [3,7] [3,8] b3 b6 b9 e4 e10 [5,17] [5,17] bb1 b 4 b 7 Figure 3: b 10 b10 e11 [2,5] b11 e 12 [11,14] b 10 The backward propagation of a constraint may require to split an -hyperbox considering different cases. Notice that the number of cases is not bigger than the number of concurrent predecessor events of the event to whom the constraint is applied. For each hyperbox Á ¼, ¼ ¾ Î ¼ the set of constraints is updated since in general it may be that new constraints appear while some of the previous constraints are satisfied. If a constraint cannot be imposed the case is aborted while if the set of constraints is empty the algorithm returns an hyperbox that circumscribes a subset of solutions of Ã. The constraint propagation algorithm works as follows: 1. first step is to impose the constraints of kind ÃÓ ¼ and (required by the received observation) Ã ¼¼ Ó 2. the second step is to impose for each -hyperbox that results after step 1, the set of constraints Ã ÓÒ. E.g. for Á consider that ¾ s.t. condition ¾ in Proposition 3 is not satisfied. Then for each ¾ s.t. ½ consider a different case and impose a constraint Ä ¼ ¼ µ Ä ¼ µ if Ä ¼ µ Ä ¼ µ or Í ¼ ¼ µ Í µ if Í ¼ µ Í ¼ µ. 3. an arbitrary constraint or is selected and then it is imposed backwards. If new constraints appear on the time intervals of the predecessor events of or then one of these constraints is selected and it is imposed further backwards until a decision is achieved. Then constraints are propagated forward for the - hyperboxes that are not aborted. The maximum number of different cases that result propagating recursively a constraint backwards is smaller than the size of maximum set of concurrent events in the configuration 4. a decision is achieved for each case in finite time since the corner points of each -hyperbox are rational numbers and each constraint that is applied either reduces one edge of the -hyperbox or returns success/abort. Example 2 Consider for the configuration displayed in Fig. 3 that the first observation is received at the time ½ and consider the case when is the event that was observed. Let ¼ ½. ¼ is propagated backwards and a new constraint ¼ appears where ¼¼ Á. ¼ is propagated backwards but no new constraints appears. Then ½¼ is required to be executed after ½, i.e. ¼¼ ½¼ ½¼ ¾ ½ ½. ¼¼ ½¼ is propagated backwards and DX'06 - Peñaranda de Duero, Burgos (Spain) 131

142 a constraint appears where ¼¼ Á. ¼¼ is propagated backwards and no new constraint appears. Then the timing constraints required by the conflicting events ½ and ½¾ are satisfied. What is left is the conflicting event. We have that and and Á µ, Á µ, and Á µ. We have two cases. First consider. We have Ä¼ and Í ¼. is propagated backwards and we have two cases: either Á ½ µ ¾ and Á ½ µ ½ or Á ¾ µ ½ and Á ¾ µ ¾. Í ¼ does not produce new constraints. We obtain two hyperboxes and if we consider the case when we obtain in a similar way another two hyperboxes. 7 The on-line diagnosis In the previous sections we have presented the plant diagnosis up to the first observation or in absence of any observation up the the first discarding time. Then the on-line diagnosis is performed calculating the plant behaviour up to a new discarding time. Theorem 2 Given a TPN model Æ Å ¼ we have that: 1. when an observable event is executed: Ê Æ ÇÒ µ ÊÔÓ Ç Æ Ò µ 2. for the first discarding time after the time when the Ò Ø observed event is reported: Ê Æ Ç Ò µ ÊÔÓ Ç Æ Ò µ 3. and in absence of any observation, the diagnosis result w.r.t. the detection of the faults that for sure happened calculated any time in between the last observed event and the discarding time is constant, i.e. ¾ Ó Ò µ: Ê Æ ÇÒ µ ÊÔÓ Ç Æ Ò µ. Proof: The proof is similar to the proof of Theorem 1. 8 Final remarks and future work We have derived in this paper on-line algorithms for the diagnosis of TPN models. The plant behaviour is derived up to a discarding time, i.e. up to a time when in absence of any observation one can discard untimed support traces because they are not consistent with the plant behaviour. The analysis is based on partial orders and it requires to derive the solution set of systems of Ñ Ü µ-linear inequalities. We have presented two algorithms to derive the entire solution set, one based on the ELCP and the second one based on constraint propagation. Both algorithms are NP-hard problems. Beside the number of events, the number of conflicting events, and the maximum number of predecessors respectively successors of a node in a configuration, the computational complexity of both methods strongly depends on the structure of the system. However there are a few reasons that allow us to claim that the two methods are computationally more efficient than the ones ([1], [5]) presented in the literature. Comparing with the method based on the state class graph computation [5] our methods have the advantage that not all the interleaving of the concurrent events are considered. Moreover the ¾ computational complexity depends in our case on the size of the largest subnet that contains unobservable transitions whereas the computation complexity in [5] depends on the size of the entire net. The algorithm in [1] solves a system of Ñ Ü µ-inequalities enumerating all the cases for each Ñ Ü-term. This combinatorial approach is known in the literature to be computational less efficient than the ELCP. Finally notice that for the above example the ELCP provides subsets while constraint satisfaction only finds 4 subsets. The reason is that each face of a polyhedron that satisfies a cross-complementarity condition provides a legal time interval configuration but the converse is not true. The subset of solutions that is circumscribed by the hyperbox of a time interval configuration may be obtained as a union of faces of a polyhedron that satisfy a cross-complementarity condition. However the set of hyperboxes obtained running the algorithm based on constraint propagation does not allow one to calculate the minimum and maximum time separation between the execution of two events unless a further refinement of the calculations is performed. We plan to extend the methodology for a distributed setting where the strong assumptions considered in [6] to be relaxed. References [1] T. Aura and J. Lilius. Time processes of Time Petri Nets. ATPN 97 - LNCS, 1248, [2] B. Berthomieu and M. Menasche. An enumerative approach for analyzing Time Petri Nets. IFIP Congess, Paris, [3] T. Chatain and C. Jard. Time supervision of concurrent systems using symbolic unfoldings of Time Petri Nets. Int. Conf. on Formal Modeling and Analysis of Time Systems, Uppsala, Sweden, [4] E. Fabre, A. Benvensite, S. Haar, and C. Jard. Distributed monitoring of concurrent and asynchronous systems. Journal of Discrete Event Dynamic Systems, 15(1):33 84, March [5] M. Ghazel, M. Bigand, and A. Toguyéni. A temporalcontraint based approach for monitoring of Discrete Event Systems under partial observation. In IFAC Congress, Prague, [6] G. Jiroveanu. Fault diagnosis for large Petri Nets. PhD thesis, Ghent University, Gent, Belgium, [7] G. Jiroveanu, B. De Schutter, and R.K. Boel. Fault Diagnosis for Time Petri Nets. In Workshop on Discrete Event Systems (WODES O6), Ann Arbor, USA, [8] K. L. McMillan. Using unfoldings to avoid the state space explosion problem in verification of asynchronous circuits. In 4th Int. Workshop on CAV, [9] M. Sampath, R. Sengupta, S. Lafortune, S. Sinnamohideen, and D. Teneketzis. Diagnosability of Discrete Event Systems. IEEE-T on AC, 40(9), [10] B. De Schutter and B. De Moor. The Extended Linear Complementarity Problem. Mathematical Programming, 71(3): , Dec DX'06 - Peñaranda de Duero, Burgos (Spain)

143 Primary and Secondary Plan Diagnosis Femke de Jonge and Nico Roos and Cees Witteveen Dept. of Computer Science, Universiteit Maastricht, P.O.Box 616, NL-6200 MD Maastricht fax: , Faculty EEMCS, Delft University of Technology, P.O.Box 5031, NL-2600 GA Delft fax: , Abstract Diagnosis of plan failures is an important subject in both single- and multi-agent planning. Plan diagnosis may provide information that can improve the way the plan failures are dealt with in three ways: (i) it provides information necessary for the adjustment of the current plan or for the development of a new plan, (ii) it can be used to point out which equipment and/or agents should be repaired or adjusted so they will not further harm the plan execution, and (iii) it can identify the agents responsible for plan execution failures. We introduce two general types of plan diagnosis: primary plan diagnosis identifying the incorrect or failed execution of actions, and secondary plan diagnosis that identifies the underlying causes of the faulty actions. Furthermore, three special cases of secondary diagnosis are distinguished, namely equipment diagnosis, environment diagnosis and agent diagnosis. 1 Introduction In multi-agent planning research there is a tendency to deal with plans that become larger, more detailed and more complex. Clearly, as complexity grows, the vulnerability of plans for failures will grow correspondingly. Taking appropriate measures to such plan failures requires knowledge on their causes. So it is important to be able to detect both the occurrence of failures and to determine the causes of them. Therefore, we consider diagnosis as an integral part of the capabilities of planning agents in single- and multi-agent systems. In this paper we adapt and extend a classical Model-Based Diagnosis (MBD) approach to the diagnosis of plan execution. The system to be diagnosed consists not only of the plan and its execution, but also of the equipment needed for the execution, the environment in which the plan is executed and the executing agents themselves. Therefore, the agents, This research is supported by the Technology Foundation STW, applied science division of NWO and the technology programme of the Ministry of Economic Affairs (the Netherlands). Project DIT5780: Distributed Model Based Diagnosis and Repair. the equipment and the environment need also be the subject of diagnosis. To motivate the need for these different types of diagnosis we distinguished, consider a very simple example in which a pilot agent of an airplane participates in a larger multi-agent system for the Air Traffic Control of an airport. Suppose that the pilot agent is performing a landing procedure and that its plan prescribes the deployment of the landing gear. Unfortunately, the pilot was forced to make a belly landing. Clearly, the plan execution has failed and we wish to apply diagnosis to find out why. A first, but superficial, diagnosis will point out that the agent s action of deploying the landing gear has failed and that the fault mode of this action is landing gear not locked. We will denote this type of diagnosis as primary plan diagnosis; this type of diagnosis focuses on set of fault behaviors of actions that explain the differences between the expected and the observed plan execution. Often, however, it is more interesting to determine the causes behind such faulty action executions. In our example, a faulty sensor may incorrectly indicate that the landing gear is already extended and locked, which led the pilot agent to the belief that the action was successfully executed. We will denote the diagnosis of these underlying causes as secondary plan diagnosis. Secondary diagnosis can be viewed as a diagnosis of the primary diagnosis. It informs us about malfunctioning equipment, unexpected environment changes (such as the weather) and faulty agents. As a special type of secondary diagnosis, we are also able to determine the agents responsible for the failed execution of some actions. In our example, the pilot agent might be responsible, but so might be the airplane maintenance agent. In our opinion, diagnosis in general, and secondary diagnosis in particular, enables the agents involved to make specific adjustments to the system or the plan as to manage current plan-execution failures and to avoid new plan-execution failures. These adjustments can be categorized with regard to their benefits to the general system. First of all, diagnosis provides information on how the plan behaves during execution, which might contribute to a failure-free (re)planning. For example, we can imagine that the initial knowledge of how a dynamic environment may influence the plan execution is rather limited. Diagnosis may provide information that expands the knowledge about plan execution. Secondly, a secondary diagnosis can point out which equipment used for the DX'06 - Peñaranda de Duero, Burgos (Spain) 133

144 plan execution was malfunctioning. Broken equipment then can be fixed to improve future plan execution. Moreover, if the amount of possible repairs is limited, diagnosis can indicate which repair has the most, positive, influence on future plan execution. In this respect, agents can be viewed in the same way as equipment: agents too can malfunction, either because of incorrect beliefs of the agent, or because the agent somehow died (crashed). Secondary diagnosis can also provide the information necessary to recover and adjust agents thereby contributing to a better plan execution. Hence, it can contribute to solving the well known qualification problem [McCarthy, 1977]. Finally, diagnosis can indicate the agents responsible (accountable) for the failures in the plan execution. This information is very interesting when evaluating the system, and can also be used to divide costs of repairs and/or changes in the plan amongst the agents. To realize the benefits of plan-based diagnosis outlined above, we introduce an object-oriented view to describe planexecution. Based on this model primary and secondary diagnosis will be defined. The primary plan diagnosis more or less corresponds with the main aspects of diagnosis of plan execution described by Witteveen and Roos [Witteveen et al., 2005; Roos and Witteveen, 2005]. To enable us to apply secondary plan diagnosis, we expand their model such that it is not only possible to analyze the plan execution process, but also the role of the objects that influence the plan execution. The resulting model is specified in section 3 and consists of objects representing the plan and its execution, the equipment that is used for the plan execution, the environmental objects that are somehow involved in the plan, and the agent executing the plan. On this model, we can apply techniques inspired by model-based diagnosis to find the primary diagnosis, as described in subsection 4.1. The secondary diagnosis is presented in subsection 4.2, while subsection 4.3 discusses the agent that are held responsible for the failed actions. But first of all, we will place our approach into perspective by discussing some approaches to plan diagnosis in the following section. 2 Related research In this section we briefly discuss some other approaches to plan diagnosis. Birnbaum et al. [Birnbaum et al., 1990] apply MBD to planning agents relating planning assumptions made by the agents to outcomes of their planning activities. However, they do not consider faults caused by execution failures as a separate source of errors. de Jonge et al. [de Jonge and Roos, 2004; de Jonge et al., 2005] present an approach that directly applies model-based diagnosis to plan execution. Here, the authors focus on agents each having an individual plan, and on the conflicts that may arise between these plans (e.g. if they require the same resource). Diagnosis is applied to determine those factors that are accountable for future conflicts. The authors, however, do not take into account dependencies between health modes of actions and do not consider agents that collaborate to execute a common plan. Kalech and Kaminka [Kalech and Kaminka, 2003; 2004] apply social diagnosis in order to find the cause of an anomalous plan execution. They consider hierarchical plans consisting of so-called behaviors. Such plans do not prescribe a (partial) execution order on a set of actions. Instead, based on its observations and beliefs, each agent chooses the appropriate behavior to be executed. Each behavior in turn may consist of primitive actions to be executed, or of a set of other behaviors to choose from. Social diagnosis then addresses the issue of determining what went wrong in the joint execution of such a plan by identifying the disagreeing agents and the causes for their selection of incompatible behaviors (e.g., belief disagreement, communication errors). Although we do not consider hierarchical plans of behaviors, social diagnosis is related to the here proposed agent diagnosis. Lesser et al. [Carver and Lesser, 2003; Horling et al., 2001] also apply diagnosis to (multi-agent) plans. Their research concentrates on the use of a causal model that can help an agent to refine its initial diagnosis of a failing component (called a task) of a plan. As a consequence of using such a causal model, the agent would be able to generate a new, situation-specific plan that is better suited to pursue its goal. While their approach in its ultimate intentions (establishing anomalies in order to find a suitable plan repair) comes close to our approach, their approach to diagnosis concentrates on specifying the exact causes of the failing of one single component (task) of a plan. Diagnosis is based on observations of a component without taking into account the consequences of failures of such a component w.r.t. the remaining plan. In our approach, instead, we are interested in applying MBDinspired methods to detect plan failures. Such failures are based on observations during plan execution and may concern individual components of the plan, but also agent properties. Furthermore, we do not only concentrate on failing components themselves, but also on the consequences of these failures for the future execution of plan elements. Witteveen et al. [Witteveen et al., 2005; Roos and Witteveen, 2005] show how classical MBD can be applied to plan execution. To illustrate the different aspects of diagnosis discussed in the introduction, below we present an adapted and extended version of their formalization of plan diagnosis. This formalization enables the handling of the approaches of de Jonge et al. [de Jonge and Roos, 2004; de Jonge et al., 2005], Kalech and Kaminka [Kalech and Kaminka, 2003; 2004], and Lesser et al. [Carver and Lesser, 2003; Horling et al., 2001]. The work of Birnbaum et al. [Birnbaum et al., 1990] is not covered by the proposed formalization since it focuses on the planning activity instead of on plan execution. 3 Preliminaries Objects In [Witteveen et al., 2005] it was shown that by using an object-oriented description of the world instead of a conventional state-based description, it becomes possible to apply classical MBD to plan execution. Here, we will take this approach one step further by also introducing objects for agents executing the plan and for the actions themselves. Hence, we assume a set of objects O that will be used to describe the plan, the agents, the equipment and the environment. 134 DX'06 - Peñaranda de Duero, Burgos (Spain)

145 The objects O are partitioned into classes or types. We distinguish four general classes, namely: actions A, agents Ag, equipment E and environment objects N. States and partial states Each object in o O is assumed to have a domain D o of values. The state of the objects O = {o 1,..., o n } at some time point is described by a tuple σ D o1... D on of values. In particular, the states σ A, σ Ag, σ E and σ N are used to denote the state of the action objects A, the agent objects Ag, the equipment objects E and the environment objects N, respectively. The state σ N of environment objects N describes the state of the agents environment at some point in time. These state descriptions can be the location of an airplane or the availability of a gate. The states σ A, σ Ag and σ E of action, agent and equipment objects, respectively, describe the health modes of these objects for the purpose of diagnosis [Kleer and Williams, 1989; Struss and Dressler, 1989]. We assume that each of their corresponding domains contains at least (i) the value nor to denote that the action, agent and equipment objects behave normally, and (ii) the general fault mode ab to denote that the action, agent and equipment objects behave in an unknown and possibly abnormal way. Moreover, the domains may contain several more specific fault modes. For instance, the domain of a flight action may contain a fault mode indicating that the flight is 20 minutes delayed. 1 It will not always be possible to give a complete state description. Therefore, we introduce a partial state as an element π D oi1 D oi2... D oik, where 1 k n and 1 i 1 <... < i k O. We use O(π) to denote the set of objects {o i1, o i2,..., o ik } O specified in such a state π. The value of an object o O(π) in π will be denoted by π(o). The value of an object o O not occurring in a partial state π is said to be unknown (or unpredictable) in π, denoted by. Including in every value domain D i allows us to consider every partial state π as an element of D 1 D 2... D O. Partial states can be ordered with respect to their information content: given values d and d, we say that d d holds iff d = or d = d. The containment relation between partial states is the point-wise extension of : π is said to be contained in π, denoted by π π, iff o O[π(o) π (o)]. Given a subset of objects S O, two partial states π, π are said to be S-equivalent, denoted by π = S π, if for every o S, π(o) = π (o). We define the partial state π restricted to a given set S, denoted by π S, as the state π π such that O(π ) = S O(π). An important notion for our notion of diagnosis is the compatibility relation between partial states. Intuitively, two states π and π are said to be compatible if they could refer to the same complete state. This means that they do not disagree on the values of objects defined in both states, i.e., for every o O either π(o) = π (o) or at least one of the values π(o) and π (o) is undefined. So we define π and π to be compatible, denoted by π π, iff o 1 Note that in a more elaborate approach the value of, for instance, an equipment object may also indicate the location of the equipment. In this paper we only represent the health mode of the equipment. O[π(o) π (o) or π (o) π(o)]. As an easy consequence we have, using the notion of S-equivalent states, π π iff π = O(π) O(π ) π. Finally, if π and π are compatible states, they can be merged into the -least state π π containing them both: o O[π π (o) = max {π(o), π (o)}]. Goals An (elementary) goal g of an agent specifies a set of states an agent wants to bring about using a plan. Here, we specify each such a goal g as a constraint, that is, a relation over some product D i1... D ik of domains. We say that a goal g is satisfied by a partial state π, denoted by π = g, if the relation g contains some tuple (partial state) (d i1, d i2,... d ik ) such that (d i1, d i2,... d ik ) π. We assume each agent a to have a set G a of such elementary goals g G a. We use π = G a to denote that all goals in G a hold in π, i.e. for all g G a, π = g. Actions and action execution The set A of action objects, also called plan steps is partitioned into subclasses α i called action types or plan operators. Through the execution of a specific action object a A, the state of environment objects N and possibly also of equipment E objects may change. We describe such changes of an action object a A by a (partial) function f α where α is the type of the action (plan operator) a is an instance of: f α : D a D ag D e1... D ei D n1... D nj D e 1... D e k D n 1... D n l where a α A is an action of type α, ag Ag is the execution agent, e 1,..., e i E are the required equipment objects, n 1,..., n i N are the required environment objects, and {e 1,..., e k, n 1,..., n l } {e 1,..., e i, n 1,..., n j } are equipment and environment objects that are changed by the action a. Note that since the values of equipment objects only indicate health modes of these objects we allow for equipment objects in the range of f α in order be able to describe repair and maintenance actions. To distinguish the different types of parameters in a more clear way, semicolons will be placed between them when they appear in the argument of a function, e.g.: f transport (driving : A; hal : Ag; truck : E; goods : N ). The objects whose value domains occur in dom(f α ) will be denoted by dom O (o a ) = {o a, o ag, o e1,..., o ei, o n1,..., o nj } and, analogously, ran O (o a ) = {o e 1,..., o e l, o n 1,..., o n j }. Moreover, we will use dom Ag O (o a), dom E O (o a) and dom N O (o a) to denote {o ag }, {o e1,..., o ei } and {o n1,..., o nj } respectively. Note that we use the action instance o a to denote the objects involved in the execution of o a according to the function f α with o a α. The result of an action may not always be known if, for instance, the action fails or if equipment is malfunctioning. DX'06 - Peñaranda de Duero, Burgos (Spain) 135

146 state π 3 state agent (status: normal) goods (location: Rotterdam) truck (status: normal) driving (status: normal) a 5 a 6 f transport state π 2 a 3 a 4 state agent (status: normal) goods (location: New York) truck (status: normal) driving (status: normal) state π 1 Figure 1: An action and its state transformation. a 1 a 2 Therefore we allow that the function associated with an action maps the value of an object o to to denote that the effect of the action on o is unknown. In fact, we only require that the effect of an action is completely specified for all objects in the function s range if the action is executed in normal circumstances. That is, the agent is capable of executing the planned action given the planned equipment. Figure 1 gives an illustration of the above outlined state transformation as result of the application of a drive action. Note that in this example only the state of the goods is changed as the result of the transport action. In the paper we assume that an object representing equipment only indicates the health state of the equipment. In a more elaborated approach, the health state is only one of the attributes of the object. Another attribute may indicate the location of the object. In Figure 1, the location of the truck changes also as result of the drive action. Plans A plan is a tuple P = A, < where A A is a subset of plan steps (action objects) that need to be executed and < is a partial order defined on A A where a < a indicates that the plan step a must finish before the plan step a may start. Note that each plan step a A occurs exactly once in the plan P, while there may be several plan steps that belong to the same action type. We will denote the transitive reduction of < by, i.e., is the smallest sub-relation of < such that the transitive closure + of equals <. We assume that if in a plan P two action instances a and a are independent, in principle they may be executed concurrently. This means that the precedence relation < at least should capture all resource dependencies that would prohibit concurrent execution of actions. Therefore, we assume < to satisfy the following concurrency requirement: If ran O (a) dom O (a ) then a < a or a < a. That is, for concurrent instances, domains and ranges do not overlap. 2 2 Note that since ran O(a) dom O(a), this requirement excludes overlapping ranges of concurrent actions, but domains of concurrent actions are allowed to overlap as long as the values of the object in the overlapping domains are not affected by the actions. state π 0 o 1 o 2 o 3 o 4 Figure 2: Plans and action instances. Each state characterizes the values of four objects o 1, o 2, o 3 and o 4. States are changed by application of action instances Example Figure 2 gives an illustration of a plan. Since an action object is applied only once in a plan, for clarity reasons, we will replace the function describing the behavior of the action by the name of the action. The arrows relate action to the objects is uses as inputs and the objects it modifies as its outputs. In this plan, the dependency relation is specified as a 1 a 3, a 2 a 4, a 4 a 5, a 4 a 6 and a 1 a 5. Note that the last dependency has to be included because a 5 changes the value of o 2 needed by a 1. The action a 1 shows that not every object occurring in the domain of an action need to be affected by the action. The actions a 5 and a 6 illustrate that concurrent actions may have overlapping domains. Plan execution For simplicity, we will assume that every action in a plan P takes a unit of time to execute. We are allowed to observe the execution of a plan P at discrete times t = 0, 1, 2,..., k where k is the depth of the plan, i.e., the longest <-chain of actions occurring in P. Let depth P (a) be the depth of action a in plan P = A, <. Here, depth P (a) = 0 if {a a a} = and depth P (a) = 1 + max{depth P (a ) a a}, otherwise. If the context is clear, we often will omit the subscript P. We assume that the plan starts to be executed at time t = 0 and that concurrency is fully exploited, i.e., if depth P (a) = k, then execution of a has been completed at time t = k + 1. Thus, all actions a with depth P (a) = 0 are completed at time t = 1 and every action a with depth P (a) = k will be started at time k and will be completed at time k + 1. Note that thanks to the above specified concurrency requirement, concurrent execution of actions having the same depth leads to a well-defined result. A timed state is a tuple (π, t) where π is a state and t 0 a time point. We would like to consider the predicted ef- 136 DX'06 - Peñaranda de Duero, Burgos (Spain)

147 fect (time state) (π, t ) as the result of executing plan P on a given timed state (π, t). To define this relation in a precise way, we will need the following concepts. First of all, let P t denote the set of actions a with depth P (a) = t, let P >t = t >t P t, P <t = t <t P t and P [t,t ] = t k=t P k. Secondly, we say that a plan step a is enabled in a state π if dom O (a) O(π). Now we can predict the timed state (π, t + 1) using the timed state (π, t) and the set P t of to be executed plan steps as follows: 1. whenever an object o does not occur in the range of an action a P t, its value in state π is the same as its value in π, i.e., π(o) = π (o); 2. if the object o occurs in the range of an action a P t that is enabled in π, its value changes according to the function specification, i.e., π (o) = f α (π dom O (a))(o). Formally, we say that (π, t + 1) is (directly) generated by execution of P from (π, t), abbreviated by (π, t) P (π, t+ 1), iff the following conditions hold: 1. π (o) = f α (π dom O (a))(o) for each a P t such that a α and for each o ran O (α). 2. π (o) = π(o) for each o ran O (P t ), that is, the value of any object not occurring in the range of an action in P t should remain unchanged. Here, ran O (P t ) is a shorthand for the union of the sets ran O (a) with a P t. For arbitrary values of t t we say that (π, t ) is (directly or indirectly) generated by execution of P from (π, t), denoted by (π, t) P (π, t ), iff the following conditions hold: 1. if t = t then π = π; 2. if t = t + 1 then (π, t) P (π, t ); 3. if t > t+1 then there must exists some state (π, t 1) such that (π, t) P (π, t 1) and (π, t 1) P (π, t ). 3.1 Normality assumptions In the above subsection, we defined the (expected) result of a plan execution given the known states of several objects. In general, we do not know the state of every object. More particularly, we do not know the health mode of the objects affected by an action unless we can directly verify the effect of the action execution. More in general, the results of plan execution are uncertain since we need not know the health mode of actions, agents and equipment. Therefore, to predict the effect of a plan execution, we must make assumptions about the (health) state of actions, agents and equipment. We will simply assume, that actions, agents and equipment are in the state nor, unless we have information stating otherwise. Hence, to a given partial state π we add a set of default assumption δ specifying for actions, agents and equipment that they are executed or behaving normally. Equivalently, such a set of assumptions δ associated with π specifies a partial state π δ such that: O(π δ ) O O(π), for each o O(π δ ): π δ (o) = nor. Note that normally, in the absence of actions that can sabotage equipment, the status of the equipment objects in O(π δ ) E will not change during plan execution. Using these assumptions δ, we can define the result of a normal execution of a plan P by extending the initial partial state π at time point t = 0 with the state π δ and then considering the timed state (π, t ) as the result of executing P on the timed state (π π δ, 0). That is, (π, t ) is the result of normal plan execution on (π, 0) iff (π π δ, 0) P (π, t ). 4 Plan diagnosis By making (partial) observations at different time points of the ongoing plan execution we may establish that there are discrepancies between the expected and the observed plan execution of the plan. These discrepancies indicate that the results of executing one or more actions differs from the way they were planned. Identifying these actions and, if possible, what went wrong in the actions execution will be called primary plan diagnosis. Actions may fail because external factors such as changes in the environmental conditions (the weather), failing equipment or incorrect beliefs of agents. These external factors are underlying causes which are important for predicting how the remainder of a plan will be executed. The secondary plan diagnosis aims at establishing these underlying causes. 4.1 Primary plan diagnosis In [Witteveen et al., 2005; Roos and Witteveen, 2005], Witteveen et al. describe how plan execution can be diagnosed by viewing action instances of a plan as components of a system and by viewing the input and output objects of an action as in and outputs of a component. This made it possible to apply classical MBD to plan execution. Here, we will use a modified version of the plan diagnosis proposed by Witteveen et al. Since a plan P = (A, <) is a partial order, actions (plan steps) in A are executed only once. Therefore, we could define a primary diagnosis in which the execution of some actions may fail, using the set of default assumption δ. However, for other types of diagnosis such as diagnosis of equipment such an approach does not suffice. One of the reasons is that e.g. equipment may start malfunctioning during the execution of some action instance and not as the result of it. In general, there may be quite a number of abnormalities that cannot be attributed to the malfunctioning of an action. So we define the more general notion of a qualification κ, consisting of triples (o j, d, t) each specifying an object o j, the value d of the object and the time point t at which the object o j takes this value d. In case of primary diagnosis, the qualification κ is used to change the value (the health mode) of plan steps. Hence, the triples have the form (a, d, depth(a)) with a A and d D a. Note that the plan diagnosis defined in [Witteveen et al., 2005] is a special case of primary diagnosis where the qualification κ consists of triples (a, ab, depth(a)) and where for the general fault mode ab the behavior of the action is unknown. DX'06 - Peñaranda de Duero, Burgos (Spain) 137

148 t=3 t=2 t=1 t=0 a 6 π 3 π 2 a 3 a 4 o 1 π 1 a 1 a 2 π 0 a 7 a 8 a 5 o 2 o 3 o 4 o 5 Figure 3: Plan execution with abnormal actions. Using qualifications, we say that (π, t + 1) is (directly) generated by execution of P from (π, t) given the qualification κ, abbreviated by (π, t) κ;p (π, t + 1), iff the following conditions hold for: 1. π (o) = d for each o O if (o, d, t) κ, else π (o) = π(o). 2. (π, t) P (π, t + 1). for some auxiliary state π. For arbitrary values of t t we say that (π, t ) is (directly or indirectly) generated by execution of P from (π, t) given the qualification κ, denoted by (π, t) κ;p (π, t ), iff the following conditions hold: 1. if t = t then π = π; 2. if t = t + 1 then (π, t) κ;p (π, t ); 3. if t > t+1 then there must exists some state (π, t 1) such that (π, t) κ;p (π, t 1) and (π, t 1) κ;p (π, t ). Example Figure 3 gives an illustration of an execution of a plan. Suppose action a 3 is abnormal and generates a result that is unpredictable ( ). Given the qualification κ = {(a 3, ab, 1)} and the partially observed state π 0 at time point t = 0, we predict the partial states π i as indicated in Figure 3, where (π 0, t 0 ) κ;p (π i, t i ) for i = 1, 2, 3. Note that since the value of o 1 and of o 5 cannot be predicted at time t = 2, the result of action a 6 and of action a 8 cannot be predicted and π 3 contains only the value of o 3. Suppose now that we have a (partial) observation obs(t) = (π, t) of the state of the world at time t and an observation obs(t ) = (π, t ) at time t > t 0 during the execution of the plan P. We would like to use these observations to infer the health states of the actions occurring in P. Assuming a normal execution of P, we can (partially) predict the state of the world at a time point t given the observation obs(t): if all actions behave normally, we predict a partial state π at time t such that (π π δ, t) P (π, t ). Since we do not require observations to be made systematically, O(π ) and O(π ) might only partially overlap. Therefore, if this assumption holds, the values of the objects that occur in both the predicted state and the observed state at time t should match, i.e, we should have π π. If this is not the case, the execution of some action instances must have gone wrong and we have to determine an action qualification κ such that the predicted state derived using κ agrees with π. This is nothing else then a straight-forward extension of the diagnosis concept in MBD [Reiter, 1987; Console and Torasso, 1991] to plan diagnosis: Definition 1 Let P = A, < be a plan with observations obs(t) = (π, t) and obs(t ) = (π, t ), where t < t depth(p ) and let the action qualification κ be a set of triples (a, d, depth(a)) with a A and d D a. Moreover, let (π π δ, t) κ;p (π κ, t ) be a derivation assuming an action qualification κ. Then κ is said to be a primary plan diagnosis (action diagnosis) of P, obs(t), obs(t ) iff π π κ. So in a primary plan diagnosis κ, the observed partial state π at time t and the predicted state π κ at time t assuming the action qualification κ agree upon the values of all objects O(π ) O(π κ) occurring in both states. Example Consider again Figure 3 and suppose that we did not know that action a 3 was abnormal and that we observed obs(0) = ((d 1, d 2, d 3, d 4 ), 0) and obs(3) = ((d 1, d 3, d 5), 3). Using the normal plan derivation relation starting with obs(0) we will predict a state π at time t = 3 where π = (d 1, d 2, d 3, 4, 5 ). If everything is ok (κ = ), the values of the objects predicted as well as observed at time t = 3 should correspond, i.e. we should have d j = d j for j = 1, 3. If, for example, only d 1 would differ from d 1, then we could qualify a 6 as abnormal, since then the predicted state at time t = 3 using κ = {(a 6, ab, 2)} would be π κ = ( 1, 2, d 3, 4, 5 ) and this partial state agrees with the observed state. Note that for all objects in O(π ) O(π κ), the qualification κ provides an explanation for the observation π made at time point t. Hence, for these objects the qualification provides an abductive diagnosis [Console and Torasso, 1990]. For all observed objects in O(π ) O(π κ), no value can be predicted given the qualification κ. Hence, by declaring them to be unpredictable, possible conflicts with respect to these objects if a normal execution of all actions is assumed, are resolved. This corresponds with the idea of a consistency-based diagnosis [Reiter, 1987]. Diagnosing a sequence of observations In the previous section we described how to diagnose the executions of a plan between two observations at different time points. Here, the observation at the earliest time point corresponds to observed inputs of a system in classical Model-Based Diagnosis while the observations at the latest time point corresponds to the observed outputs in classical Model-Based Diagnosis. During the execution of a plan, however, we may make observations at more than two time points during the execution of the 138 DX'06 - Peñaranda de Duero, Burgos (Spain)

149 plan. Unless we observe the complete state of the world at each of these time points, we cannot use successive pairs observations to make the best possible diagnosis of the part of the plan executed between these time points. Hence, we must extend our definition of plan diagnosis to handle sequences of observations. The use of a sequence of partial observations implies that a diagnosis of the part of a plan executed between time points t i and t i+1 may lead to predictions for the unobserved objects at t i+1 that are relevant for diagnosing the part of the plan executed between t i+1 and t i+2. Hence, a qualification of the actions executed between two time points t i and t i+1 depends on the qualification of actions executed before t i. Definition 2 Let P = A, < be a plan with observations obs(t 1 ) = (π 1, t 1 ),..., obs(t k ) = (π k, t k ), where t 1 < t 2 <... < t k depth(p ). Moreover, let κ be an action qualification. The action qualification κ is said to be a plan diagnosis of P, obs(t 1 ),..., obs(t k ) iff (π 1 π δ, t 1 ) κ;p (π 2, t 2 ), (π i π i, t i) κ;p (π i+1, t i+1) for 1 < i < k, and π i π i for 1 < i k. 4.2 Secondary plan diagnosis Actions may fail because of unforeseen (environmental) conditions such as being struck by lightning, malfunctioning equipment or incorrect beliefs of agents. Diagnosing these secondary causes is more difficult since weather, equipment and agents may play a role in the execution of several actions. Moreover, objects such as equipment and weather may go through several unforeseen mode changes. The above introduced qualification for primary diagnosis can also be used for secondary diagnosis. In fact, we did not use the default assumptions to model qualifications of action in order to have a uniform representation for both failing actions and underlying causes. A secondary qualification κ consists of triples (o j, d, t) where o j O A is an object that changes to the value d D j at time point t. Usually we choose for the time point t the depth depth(a) of the first action instance where change manifests itself. So, for some action a, t = depth(a) and o j dom O (a). An object such as an airplane may have several (fault) modes. Between these modes transitions are possible. For example, continuing to drive an overheated engine will cause more severe damage, namely a completely ruined engine. Of course, not every transition between the (fault) modes is valid. For example, a truck with a broken engine cannot become a truck with only a flat tyre without first repairing the truck s engine. Hence, we need Discrete Event Systems [Cassandras and Lafortune, 1999] to represent equipment or objects such as the weather. The specification of the discrete event system consists of the values D o of an object o, the events (o, d, t) κ that change the value of the object o, and a transition function describing for object o the set of valid transitions. Hence, we assume that for every object o j O a transition function tr j : D j 2 Dj has been specified. This transition function describes how the value of an object may change due to, for the agent, unknown events. One of the goals of diagnosis is to determine some of these unknown events. The values of some objects in the environment may only change due to the execution of actions. For these objects o j, the transition function is the identity function; i.e.: tr j (d) = d for every d D j. The identity function disallows any change in the object s value that is not the result of an action. Since the transition function places restriction on the possible transitions of an object, we have to adapt the first item of the specification of (π, t) κ;p (π, t + 1). 1. π (j) = d if (o j, d, a) κ, d tr j (π(j)) and t = depth(a), else π (j) = π(j). 2. (π, t) P (π, t + 1) Definition 3 Let P = A, < be a plan with observations obs(t) = (π, t) and obs(t ) = (π, t ), where t < t depth(p ) and let the action qualification κ be a set of triples (o, d, t) with o O A and d D a. Moreover, let (π π δ, t) κ;p (π κ, t ) be a derivation assuming a qualification κ and the transition functions tr j : D j 2 Dj for each object o j O. Then qualification κ is said to be a secondary plan diagnosis of P, obs(t), obs(t ) iff π π κ. The secondary diagnosis can be divided into agent, equipment and environment diagnosis depending on whether the object o in a triple (o, d, t) κ belongs to Ag, E or N respectively. An interesting special case of secondary diagnosis is agent diagnosis. Agents may incorrectly execute an action because of wrong internal beliefs about the agents s environment or about how actions should be executed. One possible cause of such wrong beliefs are incorrect observations of malfunctioning equipment such as sensors. In principle, an agent s incorrect beliefs can be can be modeled using the agent s state. Hence, we need an agent qualification (o j, d, t) with o j Ag describing the incorrect beliefs of an agent that have led to the incorrect execution of actions. This can especially be the case if an action must be achieved by choosing appropriate behaviors. Note that agent diagnosis is closely related to social diagnosis described by Kalech and Kaminka [Kalech and Kaminka, 2003; 2004]. 4.3 Applications of diagnosis As mentioned in the introduction, the information provided by primary and secondary diagnosis can be used to improve the way agents deal with plan failures. First, to adjust the planning after plan failure, we need an analysis of the expected future execution of the plan and whether the goals will still be reached.secondary plan diagnosis enables us to the determine which future actions may also be effected by the malfunctioning agents and equipment, and by unforeseen state changes in the environment. Definition 4 Let t be the current time point and let κ be a secondary diagnosis of the plan executed sofar. Then the set of future actions that will be effected given the current diagnosis κ is: {a A (o j, d, t ) κ, o j dom O (a), d nor, depth(a) t } DX'06 - Peñaranda de Duero, Burgos (Spain) 139

150 Besides identifying the actions that will be effected by agent and equipment failure or by unexpected changes in the environment, we can also determine the goals that can still be reached. Definition 5 Let t be the current time point, let π current partial state and let κ be an secondary diagnosis of the plan executed sofar. Moreover, let (π, t) κ;p (π, depth(p )). Then the set of goals that can still be realized is given by: {g G π = g} Second, based on the equipment diagnosis, the agents can point out which equipment should be repaired. Moreover, we can view repairs as events that change an equipment object from a failed state into a normal state. Then, we can use definitions 4 and 5 to verify the consequences a certain repair has. This way, agents can consider which repair to choose if repairs are limited (e.g., due to their costs). Third, it is also important to know the agents responsible for the failures. This information can contribute to negotiation on repairs of plan failure, to division of costs of failed plans or of plan repair, and to avoiding failures of future plans. As an illustration of different agents that can be responsible for a plan-execution failure, reconsidering the example in the introduction where the agent responsible for the belly landing can be the pilot agent, the maintenance agent, or the airline agent that reduced the maintenance budget. Here we will present a very simple model of responsibility. We introduce a responsibility function res : (O N ) Ag specifying the agent that is responsible for each of the action, agent and equipment objects. Definition 6 Let κ be any diagnosis of a plan execution and let res : (O N ) Ag be a responsibility function. Then for each event (o, d, t) κ, the responsible agent is determined by: res(o). 5 Conclusion This paper describes a generalization of the model for plan diagnosis as presented in [Witteveen et al., 2005; Roos and Witteveen, 2005]. New in the current approach is (i) the introduction of primary and secondary diagnosis, and (ii) the introduction of objects representing actions, agents and equipment. The primary diagnosis identifies failed actions and possibly in which way they failed while the secondary diagnosis addresses the causes for action failures. The latter is an improvement over the plan diagnosis presented in [Witteveen et al., 2005; Roos and Witteveen, 2005], where only dependencies between action failures could be described using causal rules. An additional feature of the here proposed approach is that all objects can be modeled as discrete events systems. This enables the description of the unknown dynamic behavior of objects such as equipment over time. The secondary diagnosis then identifies the events behind the state changes of these objects. The results of primary and secondary diagnosis can be used to predict future action failures, to determine the goals that can still be reached and to identify the agents that can be held responsible for plan-execution failures. References [Birnbaum et al., 1990] L. Birnbaum, G. Collins, M. Freed, and B. Krulwich. Model-based diagnosis of planning failures. In AAAI 90, pages , [Carver and Lesser, 2003] N. Carver and V.R. Lesser. Domain monotonicity and the performance of local solutions strategies for cdps-based distributed sensor interpretation and distributed diagnosis. Autonomous Agents and Multi- Agent Systems, 6(1):35 76, [Cassandras and Lafortune, 1999] C. G. Cassandras and S. Lafortune. Introduction to Discrete Event Systems. Kluwer Academic Publishers, [Console and Torasso, 1990] L. Console and P. Torasso. Hypothetical reasoning in causal models. International Journal of Intelligence Systems, 5:83 124, [Console and Torasso, 1991] L. Console and P. Torasso. A spectrum of logical definitions of model-based diagnosis. Computational Intelligence, 7: , [de Jonge and Roos, 2004] F. de Jonge and N. Roos. Planexecution health repair in a multi-agent system. In Plan- SIG 2004, [de Jonge et al., 2005] F. de Jonge, N. Roos, and H.J. van den Herik. Keeping plan execution healthy. In Multi- Agent Systems and Applications IV: CEEMAS 2005, LNCS 3690, pages , [Horling et al., 2001] Bryan Horling, Brett Benyo, and Victor Lesser. Using Self-Diagnosis to Adapt Organizational Structures. In Proceedings of the 5th International Conference on Autonomous Agents, pages ACM Press, [Kalech and Kaminka, 2003] M. Kalech and G. A. Kaminka. On the design ov social diagnosis algorithms for multiagent teams. In IJCAI-03, pages , [Kalech and Kaminka, 2004] M. Kalech and G. A. Kaminka. Diagnosing a team of agents: Scaling-up. In AAMAS 2004, [Kleer and Williams, 1989] J. de Kleer and B. C. Williams. Diagnosing with behaviour modes. In IJCAI 89, pages , [McCarthy, 1977] John L. McCarthy. Epistemological problems of artificial intelligence. In IJCAI, pages , [Reiter, 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32:57 95, [Roos and Witteveen, 2005] N. Roos and C. Witteveen. Diagnosis of plans and agents. In Multi-Agent Systems and Applications IV: CEEMAS 2005, LNCS 3690, pages , [Struss and Dressler, 1989] Peter Struss and Oskar Dressler. physical negation integrating fault models into the general diagnostic engine. In IJCAI, pages , [Witteveen et al., 2005] C. Witteveen, N. Roos, R. van der Krogt, and M. de Weerdt. Diagnosis of single and multiagent plans. In AAMAS 2005, pages , DX'06 - Peñaranda de Duero, Burgos (Spain)

151 Getting the Probabilities Right for Measurement Selection Johan de Kleer Palo Alto Research Center 3333 Coyote Hill Road, Palo Alto, CA USA Abstract The core objective of model-based diagnosis is to identify candidate diagnoses which explain the observed symptoms. Usually there are multiple such candidate diagnoses and a model-based diagnostic engine proposes additional measurements to better isolate the actual diagnosis. An objective of such an algorithm is to identify this diagnosis in minimum average expected cost (e.g., the sum of the costs of the measurements). Minimizing this cost requires having accurate probability estimates for the candidate diagnoses. Most diagnostic engines utilize sequential diagnosis combined with Bayes Rule to determine the posterior probability of a candidate diagnosis given a measurement outcome. Unfortunately, one of the terms of Bayes rule, the conditional probability of a measurement outcome given a candidate diagnosis, must often be estimated (noted as ɛ in most formulations). This paper presents a reformulation of the sequential diagnosis process used in diagnostic engines and shows how different ɛ policies lead to varying results. 1 Introduction Model-based diagnosis has been applied to a wide range of applications including automobiles [Struss and Price, 2004], spacecraft [Williams and Nayak, 1996], mobile robots [Steinbauer and Wotawa, 2005] and software [Köb and Wotawa, 2004] to mention just a few. The core objective of modelbased diagnosis is to identify candidate diagnoses which explain the observed symptoms. Usually there are multiple such diagnoses and a model-based diagnostic engine proposes additional measurements to better isolate the actual diagnosis. An objective of such an algorithm is to identify this diagnosis in minimum average expected cost (e.g., the sum of the costs of the measurements). Minimizing this cost requires having accurate probability estimates for the candidate diagnoses. Most diagnostic engines utilize a greedy sequential diagnosis combined with Bayes Rule to determine the posterior probability of a candidate diagnosis given a measurement outcome. Unfortunately, one of the terms of Bayes rule, the conditional probability of an measurement outcome given a candidate diagnosis, must often be estimated (noted as ɛ in most formulations). This paper presents a reformulation of the sequential diagnosis process used in most diagnostic engines and shows the results of various ɛ-policies. In order to minimize possible confounding of different domain models and to have easy access to many examples we draw all our examples from a widely available combinatorial logic test suite from ISCAS- 85 [Brglez and Fujiwara, 1985]. In order to focus on the impact of varying ɛ-policies we make the following assumptions. (All the assumptions can be relaxed, but would confound the results.): (1) All measurements have equal cost, (2) No intermittent faults, (3) No multi-step lookahead, (4) The inference engine used to derive the consequences of observations is complete, (5) All the system s inputs are known, (6) One symptomatic output is given, (7) Time is not modeled, (8) The system has at most two faults, (9) The behavioral model for each component is completely described, (10) The system is well-formed (no unattached inputs or outputs or cycles). 2 GDE Probability Framework This basic framework is described in [de Kleer and Williams, 1987; de Kleer et al., 1992]. Definition 1 A system is a triple (SD,COMPS, OBS) where: 1. SD, the system description, is a set of first-order sentences. 2. COMPS, the system components, is a finite set of constants. 3. OBS, a set of observations, is a set of first-order sentences. Definition 2 Given two sets of components Cp and Cn define D(Cp, Cn) to be the conjunction: [ ] [ ] AB(c) AB(c). c Cp c Cn Where AB(x) represents that the component x is ABnormal (faulted). A diagnosis is a sentence describing one possible state of the system, where this state is an assignment of the status normal or abnormal to each system component. DX'06 - Peñaranda de Duero, Burgos (Spain) 141

152 Definition 3 Let COMPS. A diagnosis for (SD,COMPS,OBS) is D(, COM P S ) such that the following is satisfiable: SD OBS {D(, COMP S )} Components are assumed to fail independently. Therefore, the prior probability a particular diagnosis D(Cp, Cn) is correct is thus: p(d) = p(c) (1 p(c)), (1) c C p c C n where p(c) is the prior probability that component c is faulted. The posterior probability of a diagnosis D after an observation that x has value v is given by Bayes Rule: p(d x = v) = p(x = v D)p(D). (2) p(x = v) p(d) is determined by the preceding measurements and prior probabilities of failure. The denominator p(x = v) is a normalizing term that is identical for all p(d) and thus need not be computed directly. Thus the only term remaining to be evaluated in the equation is p(x = v D) : p(x = v D) = 1 if x = v follows from D, SD, p(x = v D) = 0 if D, SD, (x = v) are inconsistent. If neither holds, p(x = v D) = ɛ ik p(d), (3) where ɛ ik = 1 m. This corresponds to the intuition that if x ranges over m possible values, then each possible value is equally likely. In digital circuits m = 2 and thus ɛ =.5. Consider other possible values for ɛ ik. As ɛ approaches 0, some diagnoses would be assigned far smaller posterior probabilities which would lead to inaccurate conclusions and excessive measurement cost. For example, multiple faults would be assigned far smaller probability than is actually the case. So long as ɛ > 0 the GDE algorithm will identify the correct diagnosis after sufficient measurements (ɛ = 0 would assign 0 probability to correct diagnoses). As ɛ approaches 1, there would be little need to use Bayes Rule and the relative likelihoods of any two diagnoses would always be a constant. This would force GDE to consider very unlikely candidate diagnoses. Looked at differently, as ɛ varies from 0 to 1 approximates the spectrum of abductive-based to consistency-based diagnostic frameworks [Brusoni et al., 1998]. ɛ clearly must lie between 0 and 1, but should it be ɛ = 1 m? There are a number of reasons that a candidate diagnosis might fail to predict a value for a measured variable. Incompleteness in the inference engine used (e.g., GDE s). Incompleteness in the component models. The model may predict a disjunction of values (as can be the case in qualitative models). Lack of knowledge of the actual faulty behavior of a component. Although the lack of inferential completeness is common in model-based diagnosis engines, in this paper we focus on the last source and use complete models and a complete inference procedure. Note that lack of knowledge of the faulty behavior of a component does not necessarily imply that a candidate diagnosis fails to predicts some value. A minimal diagnosis D(B, G) is one where there is no other diagnosis D(B, G ) where B is a proper subset of B. Minimal candidate diagnoses often predict every variable value. Under the assumptions of this paper, minimal candidates always assign a value to every variable. Consider a minimal diagnosis D(B, G) where b B. As there are no fault models, there is no model to predict a value for the output(s) of b. As D(B, G) is minimal we know that D(B {b}, G {b}) cannot be consistent, therefore the input/outputs observed or inferred around b are inconsistent with the correct behavior of b. In the case of binary valued variables, this means that the output of b must be the opposite of what b s behavioral model predicts (i.e., if the variable can t be 0, it must be 1 and vice versa). Hence, failure to assign a variable value only occurs with multiple faults (or, with multi-valued quantities such as +,, 0). 3 Using an ɛ-policy In order to avoid excessive computational cost, many diagnostic algorithms utilize a greedy minimum entropy approach to select the best next measurement to make next (i.e., the one which, on average, minimizes the cost of identifying the correct diagnosis). [de Kleer and Williams, 1987] shows how expected entropy outcomes for hypothetical measurements can be determined without additional inference. The outcome can be calculated directly from the current probability distribution of measurement outcomes given ɛ ik = 1 m. We now generalize this approach to allow an arbitrary ɛ-policy. Fortunately, the outcomes can be evaluated directly in the general case as well. Given a set of diagnoses, DIAGNOSES, and assuming all measurements are of unit cost, H = p(d) log p(d), (4) D DIAGNOSES estimates the number of measurements needed to complete a diagnosis. We define, S ik = {D DIAGNOSES D SD OBS x i = v ik }, U i = {D DIAGNOSES D S ik for any k}. p(s ik ) = p j, C j S ik p(u i ) = p j, C j U i p(x i = v ik ) = p(s ik ) + ɛ ik p(u i ). ɛ ik is determined by the diagnostic policy, under the restriction that Σ m k=1 ɛ ik = 1 for all i. The expected entropy after measuring x i = v ik is: m H e (x i ) = p(x i = v ik )H(x i = v ik ). (5) k=1 142 DX'06 - Peñaranda de Duero, Burgos (Spain)

153 Let p be the probability after making the measurement. Substituting equation 2 into equation 4 gives: H(x i = v ik ) = p l log p l l S ik U i = p l p(x i =v ik ) log p l p(x i =v ik ) l S ik ɛ ik p l p(x i =v ik ) log ɛ ik p l p(x i =v ik ) l U i Substituting H into this equation gives: H e (x i ) = m p l p l log p(x i = v ik ) k=1 l S ik m ɛ ik p l (ɛ ik p l )log p(x i = v ik ). k=1 l U i Expanding the logarithms: H e (x i ) = m p l log p l k=1 l S ik m + p l log p(x i = v ik ) k=1 l S ik m m ɛ ik p l log p l ɛ ik p l log ɛ ik k=1 l U i k=1 l U i m + ɛ ik p l log p(x i = v ik ). k=1 l U i The first and third terms are simply the current entropy H and is necessarily constant. The second and fifth terms are the negative entropy (i.e., of the probability density distribution of x i. The expected entropy H e (x i ) to minimize has the following form: m m H + p(x i =v ik )log p(x i =v ik ) p(u) ɛ ik logɛ ik k=1 k=1 The best proposed measurement is the one which maximizes information gain: m m p(x i =v ik )log p(x i =v ik ) + p(u) ɛ ik logɛ ik. k=1 k=1 Expected information gain is the expected reduction in number of additional measurements needed to isolate the true diagnosis and always lies between 0 and 1. There is thus no need to utilize additional inferential machinery to hypothesize the results of possible measurement outcomes. Given a policy for distributing p(u i ), the proposed measurement can be evaluated directly from known probabilities. Furthermore, if the ɛ-policy is fixed, then m k=1 ɛ iklogɛ ik is constant throughout the diagnosis task. Table 1: Expected costs information gains for cascaded inverters after measurements (with p =.01). a = 0 a = 0, e = 0 ɛ =.01 ɛ =.5 ɛ = 1 ɛ =.01 ɛ =.5 ɛ = 1 a b c d e Advantages of the GDE Framework One of the fundamental advantages of the GDE framework is that it is unnecessary to enumerate all the possible fault modes beforehand. Thus a diagnostic algorithm can successfully diagnose a system having never-before-seen faults. These fault modes are a challenge to more conventional diagnostic approaches which require far more prior knowledge of all the system s fault modes. GDE s probabilistic framework allows it to identify the best measurement to make next to localize the system s fault. Consider the simple four inverter circuit of Figure 1 and a = 0. Figure 1: Four sequential inverters. To see the effects of different ɛ ik s consider the some simplistic policies.table 1 lists expected costs of measuring all the variables, first after a = 1, and then after e = 0. The costs are given for three values of ɛ ik. Note that ɛ = 1 is equivalent to using no probabilistic information at all, and as a consequence the resulting costs cannot be used to rank proposed measurements. As long as 0 < ɛ < 1, GDE can eventually identify good measurements to make next. 5 Using a Fixed ɛ-policy We have implemented a new diagnostic algorithm called ɛgde which accepts an arbitrary ɛ-policy and is logically complete (it identifies all conflicts and all variable value predictions efficiently). It is provided an ɛ-policy, fault probabilities, system model, and a set of input vectors and symptoms. Given this as input, ɛgde, computes the average expected cost to diagnose the system. There is one source of stochastic variability in ɛgde. When it encounters multiple measurement choices of approximately (5% to normalize for roundoff errors) equal costs, it chooses randomly among them. Consider a fixed policy in where the e ik are fixed for all k. And where Σ m k=1 ɛ ik = 1. The data in Figure 2 shows the average expected costs with ɛ i0 increasing in.05 steps. ɛ i1 = 1 ɛ i0. All components fail with equal probability p = Circuit c432 has 160 components. The data was gathered with 248 randomly generated double faults each DX'06 - Peñaranda de Duero, Burgos (Spain) 143

154 with a fully populated input vector with one identified symptom. Figure 2 illustrates the diagnostic cost for each fixed ɛ-policy. For this device, diagnostic cost is minimum if ɛ is nearly 1 for all 1 values, and nearly 0 for 0 values. This is very different than the ɛ =.5 estimate of GDE. Figure 4: Average cost vs. ɛ for 1 for circuit c880. c880 has 384 gates and is an 8-bit arithmetic logic unit from the ISCAS-85 test suite. Figure 2: Average cost vs. ɛ for 1 for circuit c432 with.95 confidence interval. c432 has 160 gates and is a 27-channel interrupt controller from the ISCAS-85 test suite. Table 2: The 4 possible binary functions of one input and one output. i is the binary input, and each column f i list the outputs for the corresponding input. i f 0 f 1 f 2 f Figure 3: Average cost vs. ɛ for 1 for circuit c499. c499 has 201 gates and is a 32-bit single-error-correcting-circuit from the ISCAS-85 test suite. Figure 3 shows the results on circuit c499 which has 202 components. The diagnostic task is to isolate a double fault from among all possible double faults. For this task, there is a sharp notch around.5 which corresponds to GDE s estimate. Figure 4 shows the results on circuit c880 which has 383 components. This likewise has a notch at.5. 6 Towards a Dynamic ɛ-policy Figures 2, 3 and 4 suggest adaptive ɛ-policies can improve diagnostic costs. We would like to devise a dynamic ɛ-policy appropriate to each diagnostic task. It is also important that this policy be easy to compute, otherwise it competes with the alternative of expensive multi-step lookahead. Consider the oversimplistic example of a single inverter (say A in Figure 1). Assume that p( AB(A)) = α and we measured a = 0 and b = 1. The priors are p(ab(a)) = (1 α), p( AB(A)) = α. Given this evidence, ɛ = 1/2 gives a posterior of p( AB(A)) to be 2α α+1. Definition 4 [Raiman et al., 1991] A component behaves non-intermittently if its outputs are a function of its inputs. Table 2 describes all possible binary functions of one input and one output. f 3 describes the correct behavior of an inverter, f 0 is the fault stuck-at-0, f 2 is the fault stuck-at- 1, and f 1 an unexpected short of input to output. Given that each fault mode f i 3 has equal probability ( α 3 ), the correct 3α posterior should be 2α+1. This corresponds to ɛ = 1 3. The difference is: α(α 1) (2α + 1)(α + 1). Suppose that we had measured b = 0 instead. Table 2 shows that only one of the three possible faulty behaviors are eliminated: f 2 is eliminated, but f 0 and f 1 remain. Therefore, p(x = v D) = 2 3. For simplicity consider only the first three inverters and the double fault diagnosis of D({B, C}, {A}). The prior is (1 α)α 2. After measuring d = 0, GDE reduces its probability by 1 2. Given Table 2 we can compute the posterior probability exactly as follows. Inverter C is faulted with output 0 and thus it can only be behaving according to functions f 0 or f 1. The input to the faulty inverter B is 1, but that alone does not provide any evidence for changing its probability. For the f 1 mode of inverter C to produce d = 0, c must be 0. This is inconsistent with modes f 1 or f 2 of B. Therefore, there are only 4 consistent combinations of modes for B and C: { f 0, f 0, f 0, f 1, f 1, f 0, f 2, f 0 }. As only 4 out of the 9 combinations survive, the posterior probability is reduced by 4 9. Measuring c = 0 eliminates only f 2, f 0 to a final reduction of 3 9. Tables 3 and 4 summarize these calculations. Consider a simple 2-input and gate. Table 5 lists all possible behaviors for a 2-input/1-output gate. The correct behavior for the and gate is given by f 2. All remaining behaviors correspond to fault modes. 144 DX'06 - Peñaranda de Duero, Burgos (Spain)

155 This analysis and the results of Figures 2 and 3 suggest that an adaptive ɛ-policy might be exploited to further improve the diagnostic cost of isolating faulty components. Table 3: GDE vs. correct probability changes for D({B, C}, {A}) OBS ɛ = 1/2 correct 1 4 d = c = Table 4: GDE vs. correct probability changes for D({A, B, C}, {}) OBS ɛ = 1/2 correct 1 2 d = c = b = Table 5: Possible functions of two inputs and one output. i 0 i 1 f 0 f 1 f 2 f 3 f 4 f 5 f 6 f i 0 i 1 f 8 f 9 f 10 f 11 f 12 f 13 f 14 f Extension to Fault Modes The probabilistic framework outlined in Sections 2 and 3 can be directly expanded to include component fault modes [de Kleer and Williams, 1989]. The AB/ AB framework can viewed as assigning each component a G, good, mode or an U, unknown, faulty mode. The good model for an and gate is given by column f 2 of Table 5, and faulty behaviors correspond to all the remaining columns. Two common fault models for a and gate SA1 (output stuck at 1) and SA0 (output stuck at 0). This model for the gate has 4 modes: G (f 2 ) SA1 (f 15 ), SA0 (f 0 ), and U (which corresponds to the remaining 13 columns). All the analyses of this paper directly extend to multiple fault modes, but p(u) will always be smaller because some of the fault modes are explicitly modeled with their own probabilities. As the extension to fault modes is direct, we do not formalize them in this paper. 8 Dynamic Epsilon Policies One possible dynamic policy that has been suggested is a max-entropy policy where each ɛ ik is chosen to maximize the entropy of the value distribution of the x i. The maximum entropy distribution for a two valued quantity is trivially computable. The motivation for this policy is that it injects least amount of information into the measurement scoring. This policy does not yield significant improvement in overall diagnostic cost. Using the framework of the previous sections and two additional assumptions, we now define an ɛ-policy which is computes ɛ exactly for each individual variable and candidate diagnosis. Intuitively, we apply the diagnostic framework laid out in Section 2 recursively for each possible candidate. We assume that each component model specifies an output value when all its inputs are known. This assumption holds for digital circuits, but may not apply for some qualitative modeling paradigms where adding a qualitative + and yields no result. In addition, we assume that we are provided the prior probabilities for each possible function of a component. For example, in the case of an and gate we are given the prior probability of each possible function f i of Table 5. Let p(f i (c)) be the prior probability that component c behaves according to function f i. We know that: f i p(f i (c)) = 1, and, f i F (c) p(f i (c)) = p(c), where F (c) is the set of all faulty functions f i and p(c) is the prior probability that component c is faulted as defined in Equation 1. Consider a diagnosis D = D(B, G) which fails to predict some x = v. Restating Equation 3: p(x = v D(B, G)) = ɛ ik p(d(b, G)). DX'06 - Peñaranda de Duero, Burgos (Spain) 145

156 Definition 5 A micro candidate diagnosis M for candidate diagnosis D = D(B, G) is a conjunction: f i(c), c B where f i is formula describing the behavior of f i as a propositional formula, and M is consistent with D SD OBS. p(m) follows straightforwardly from Bayes Rule: c B p(m) = p(f i(c)). p(m) Thus, or, p(x = v D) = ɛ ik = Ms.t.M D SD OBS x=v Ms.t.M D SD OBS x=v p(m)p(d), p(m), where the M are the micro candidate diagnoses for D. If all the faulted p(f i (c)) are equal for each component c, this new framework reduces to that of the previous section. This approach is most powerful when the individual p(f i (c)) vary significantly. In these cases, the resulting diagnostic efficiency improvement can be significant. Using GDE with fault modes, identical results would be obtained if explicit fault modes were introduced for each possible faulty function of each component. Unfortunately, this approach is computationally intractable. For a digital circuit where the sum of the number of input terminals was l, the complexity would be 2 2l. [de Kleer and Williams, 1989] J. de Kleer and B.C. Williams. Diagnosis with behavioral modes. In Proc. 11th IJCAI, pages , Detroit, [de Kleer et al., 1992] J. de Kleer, A. Mackworth, and R. Reiter. Characterizing diagnoses and systems. Artificial Intelligence, 56(2-3): , [Köb and Wotawa, 2004] Daniel Köb and Franz Wotawa. Introducing alias information into model-based debugging. In 16th European Conference on Artificial Intelligence (ECAI), Valencia, Spain, August [Raiman et al., 1991] O. Raiman, J. de Kleer, V. Saraswat, and M. H. Shirley. Characterizing non-intermittent faults. In Proc. 9th National Conf. on Artificial Intelligence, pages , Anaheim, CA, July [Steinbauer and Wotawa, 2005] Gerald Steinbauer and Franz Wotawa. Detecting and locating faults in the control software of autonomous mobile robots. In Proceedings of the 19 th International Joint Conference on AI (IJCAI-05), pages , Edinburgh, UK, [Struss and Price, 2004] Peter Struss and Chris Price. Model-based systems in the automotive industry. AI Magazine, 24(4):17 34, [Williams and Nayak, 1996] B. C. Williams and P. P. Nayak. A model-based approach to reactive self-configuring systems. In Proc. 14th National Conf. on Artificial Intelligence, pages , Conclusions This paper presents two advances. First, it presents a generalization of the information gain equation used in evaluating possible measurements. Second, this paper presents an algorithm which improves the average expected costs of diagnosis by exploiting more precise estimates of ɛ s. Measures of improved diagnostic costs are found in the longer version of this paper. 10 Acknowledgments Conversations with Olivier Raiman and Brian Williams helped clarify many of these concepts. References [Brglez and Fujiwara, 1985] F. Brglez and H. Fujiwara. A neutral netlist of 10 combinational benchmark circuits and a target translator in fortran. In Proc. IEEE Int. Symposium on Circuits and Systems, pages , June [Brusoni et al., 1998] Vittorio Brusoni, Luca Console, Paolo Terenziani, and Daniele Theseider Dupre. A spectrum of definitions for temporal model-based diagnosis. Artificial Intelligence, 102(1):39 79, [de Kleer and Williams, 1987] J. de Kleer and B. C. Williams. Diagnosing multiple faults. Artificial Intelligence, 32(1):97 130, April Also in: Readings in NonMonotonic Reasoning, edited by Matthew L. Ginsberg, (Morgan Kaufmann, 1987), DX'06 - Peñaranda de Duero, Burgos (Spain)

157 Incremental Indexing of Temporal Observations in Diagnosis of Active Systems Gianfranco Lamperti Marina Zanella Dipartimento di Elettronica per l Automazione Via Branze 38, Brescia, Italy Tel: Fax: lamperti@ing.unibs.it zanella@ing.unibs.it Abstract Observations play a major role in diagnosis of discrete-event systems (DESs). At a high level of abstraction, as in the active system approach, this task takes as input the observable events generated by a DES and their emission order. However, uncertainty conditions, affecting the transmission from the DES to the observer and/or the capabilities of the observer itself, may obscure the (discrete) values of these events and/or their reciprocal order. Thus, in the general case, an uncertain observation can be represented as a directed acyclic graph. Out of efficiency, diagnostic processing requires generating a surrogate of such a graph, the index space. The scenario becomes more complicated when the observation is perceived as a list of fragments rather than in one shot, because the set of candidate diagnoses is supposed to be generated at the reception of each fragment. This translates to the need for computing a new index space every time. Since the computation from scratch is expensive, an incremental technique is proposed, that is capable of extending the previous index space for producing the new one at the occurrence of each observation fragment. 1 Introduction Observations are the inputs to several tasks that can be carried out by exploiting model-based reasoning techniques [Brusoni et al., 1998; Baroni et al., 1999; Rozé and Cordier, 2002; Wotawa, 2002; Cordier and Pencolé, 2005; Lamperti and Zanella, 2006]. Temporal observations, being inherent to dynamical systems and processes, are endowed not only with a logical content, describing what has been emitted by the system, but also with a temporal content, describing when it has been emitted. Both (independent) aspects can be modeled either quantitatively or qualitatively. A general model for (qualitative uncertain) temporal observations was proposed in [Lamperti and Zanella, 2002], and exploited for describing the input of an a posteriori diagnosis task. Such a model consists of a directed acyclic graph where each node contains an uncertain logical content, ranging over a set of qualitative values (labels), and each edge is a temporal precedence relationship, entailing a partial emission order. The observation graph implicitly represents all the possible sequences of labels consistent with the temporal observation received over a time interval, where each sequence is a sentence of a language. In the same contribution it is remarked that, although the observation graph is intuitive and easy to build from the point of view of the observer, for the sake of efficiency of any further processing it is better to represent a language in the standard way regular languages are represented [Hopcroft and Ullman, 1979], that is, by means of a deterministic automaton. This automaton is called index space and it is built as the transformation of a nondeterministic automaton drawn from the observation graph. The problem with this construction method arises when the nodes of the observation graph are received and processed one at a time, typically in monitoring-based diagnosis of dynamical systems. The need for producing appropriate diagnostic information at each occurring piece of observation [Lamperti and Zanella, 2004] translates to the need for generating a new index space at each new reception. However, a naive approach, that each time makes up the new index space from scratch, would be computationally inadequate. Therefore, this paper proposes a method for the incremental generation of the index space. The new algorithm is expected to benefit not only the active systems approach [Lamperti and Zanella, 2003], within which the notion of an index space was first proposed, but also any other approach dealing with discrete uncertain observations whose observable fragments are received and processed one at a time. 2 Temporal Observations A history h of a system is a sequence of state transitions, h = T 1,...,T n, that produces a temporal sequence l 1,...,l k, where each l i, i [1..k], k n, is the observable label generated by a visible transition in h. The DES evolution described by h is perceived outside the system as a temporal observation O = ϕ 1,...,ϕ n which is a sequence of temporal fragments, totally ordered according to the order in which such fragments were received by the observer. O brings some information about the temporal sequence generated by h. However, such information is uncertain due to synchronization errors affecting the multiplicity of communication channels between the (possibly dis- DX'06 - Peñaranda de Duero, Burgos (Spain) 147

158 tributed) system and the observer, and to noise on such channels. Formally, let Λ be a domain of observable labels, including the null label ε (invisible to the observer). A temporal fragment ϕ is a pair (λ, τ), λ being the logical content, λ Λ, λ, λ {ε}, and τ the temporal content. The logical content represents what has been observed while the temporal content identifies the set of fragments preceding the current one in the emission order. We assume that the current fragment can be preceded in the emission order only by fragments that have already been received. Therefore, the temporal content of a fragment ϕ i is a (possibly empty) subset of the fragments preceding ϕ i in O (i.e. the fragments that were received before ϕ i ), that is i [1..n],ϕ i =(λ i,τ i )(τ i {ϕ 1,...,ϕ i 1 }). A fragment is uncertain in nature, both logically and temporally. Logical uncertainty means that λ includes the actual (possibly null) label generated by a system transition, but further spurious labels may be involved too. Temporal uncertainty means that only partial emission ordering is known among fragments. A sub-observation O [i] of O, i [0..n], is the (possibly empty) prefix of O up to the i-th fragment, O [i] = ϕ 1,...,ϕ i. Example 1. Let Λ = {short, open,ε}, O = ϕ 1, ϕ 2, ϕ 3, ϕ 4, where ϕ 1 =({short,ε}, ), ϕ 2 =({open,ε}, {ϕ 1 }), ϕ 3 =({short, open}, {ϕ 2 }), ϕ 4 =({open}, {ϕ 1 }). ϕ 1 is logically uncertain (either short or nothing has been emitted by the system). ϕ 2 follows ϕ 1 in the emission order and is logically uncertain (open vs. nothing). ϕ 3 follows ϕ 2 and is logically uncertain (short vs. open). ϕ 4 follows ϕ 1 and is logically certain (open). No temporal relationship is defined between ϕ 4 and ϕ 2 or ϕ 3. Based on Λ, a temporal observation O = ϕ 1,...,ϕ n can be represented by a DAG, called an observation graph, γ(o) = (Λ, Ω, E), where Ω = {ω 1,...,ω n } is the set of nodes isomorphic to the fragments in O, each node being marked by a nonempty subset of Λ, and E is the set of edges isomorphic to the temporal content of fragments in O. An emission-order precedence relationship is defined between nodes of the graph, specifically, ω ω means that γ(o) includes a path from ω to ω, while ω ω means either ω ω or ω = ω. We assume that the temporal content of each fragment is minimal, which translates to the canonicity of the observation graph. Specifically, the following condition holds: (ω j ω i E) ( (ω k ω i E),ω k ω j ). Example 2. The observation graph γ(o), relevant to the observation defined in Example 1, is shown in Fig. 1. Note how γ(o) implicitly contains several candidate temporal sequences, each generated by picking up a label from each node of the graph without violating the partial emission-order relationships among nodes. Possible candidates are, among others, short, open, short, open, short, open, open, and short, open. 1 However, we do not know which of the candidates is the actual temporal sequence generated by the system, the other ones being the spurious candidate sequences. 1 The length of a candidate temporal sequence may be shorter Figure 1: Observation graph γ(o). Consequently, from the observer (and, therefore, from the diagnosis) viewpoint, all candidate sequences share the same ontological status. 3 Indexing Temporal Observations Both for computational and space reasons, the observation graph is inconvenient for carrying out a task that takes as input a temporal observation. This claim applies to linear observations as well, each of which is merely a sequence O of observable labels. In this case, it is more appropriate to represent each sub-observation O of O as an integer index i corresponding to the length of O. As such, i is a surrogate of O. An analogous approach was proposed for graphbased temporal observations in [Lamperti and Zanella, 2000], where the notion of an index was extended so as to perform model-based reasoning on a surrogate of the temporal observation, called an index space. Let γ(o) = (Λ, Ω, E). A prefix P of O is a (possibly empty) subset of Ω where ω P( ω P(ω ω)). The formal definition of an index space is supported by two functions on P. The set of consumed nodes up to P is Cons(P) ={ω ω Ω,ω P,ω ω }. The frontier of P is Front(P) ={ω ω (Ω Cons(P)) where (ω ω) E (ω Cons(P))}. Example 3. Considering γ(o) in Fig. 1, with P = {ω 2,ω 4 }, we have Cons(P) ={ω 1,ω 2,ω 4 } and Front(P) ={ω 3 }. The prefix space of a temporal observation O is the nondeterministic automaton where is the set of states, is the set of labels, Psp(O) =(S n, L n, T n,s n 0, S n f ) S n = {P P is a prefix of O} L n = {l l λ, (λ, τ) Ω} S n 0 = than the number of nodes in the observation graph owing to the immateriality of the null label ε, which is transparent. For instance, candidate ε, ε, short, open is in fact short, open. 148 DX'06 - Peñaranda de Duero, Burgos (Spain)

159 Figure 2: Prefix space Psp(O) and index space Isp(O). is the initial state, S n f = {P P S n, Cons(P) =Ω} is the set of final states, and T n : S n L n 2 Sn is the transition function such that P l P T n iff, defining the operation as P ω =(P {ω}) {ω ω P,ω ω}, (1) we have ω Front(P),ω =(λ, τ),l λ, P = P ω. The index space of O is the deterministic automaton Isp(O) equivalent to Psp(O). Each state in Isp(O) is an index of O. Each path from the initial state of Isp(O) to a final state is a mode in which we may choose a label in each node of the observation graph γ(o) based on the partial ordering imposed by γ(o) [Lamperti and Zanella, 2002], that is, each path in the index space is a candidate temporal sequence and, being Isp(O) deterministic, there is only one path for each candidate sequence. Example 4. Consider γ(o) in Fig. 1. Shown in Fig. 2 are the prefix space Psp(O) (left) and the index space Isp(O) (shaded). Each prefix is written as a string of digits, e.g. 24 stands for P = {ω 2,ω 4 }. Final states are double circled. According to the standard algorithm that transforms a nondeterministic automaton to a deterministic one [Hopcroft and Ullman, 1979], each node of Isp(O) is identified by a subset of the nodes of Psp(O). Nodes in Isp(O) have been named I 0 I 7. These are the indexes of O. As for observations, we may define a restriction of the index space up to the i-th fragment as follows. Let Isp(O) = (S, L, T,S 0, S f ) be an index space, where γ(o) =(Λ, Ω, E), Ω = {ω 1,...,ω n }. Let S be a node in S. The sub-node S [i] of S, i [0..n], is { { } if i =0 S [i] = (2) {I I S, ω j I(j i)} otherwise. The sub-index space Isp [i] of O, i [0..n], is an automaton Isp [i] (O) =(S, L, T,S 0, S f) where S = {S S S,S = S [i],s } T = {T T T,T = S 1 T = S 1 l S 2,S 1 = S 1[i] S 1,S 2 = S 2[i],S 2 } L = {l S 1 l S 2 T } S f = {S S S, I S Cons(I) ={ω 1,...,ω i }} l S2 The formal relationship between sub-observations and subindex spaces is stated by Theorem 1. Theorem 1. The sub-index space of an observation equals the index space of the sub-observation, Isp [i] (O) =Isp(O [i] ). (3) Proof (sketch). The proof is supported by three lemmas. Lemma 1.1 is grounded on the definition of Psp(O) and, particularly, on Eq. (1). Lemma 1.2 derives from the definition of sub-index space. Lemma 1.3 is based on the subset construction algorithm [Aho et al., 1986], which transforms a nondeterministic automaton A n into a deterministic one. Clos(N n ) denotes the ε-closure of node N n in A n. This is the set made up by N n and all the nodes that are reachable from N n via ε-transitions in A n. Figure 3: Genesis of Isp [i] (O) and Isp(O [i] ). DX'06 - Peñaranda de Duero, Burgos (Spain) 149

160 Figure 4: Psp(O [3] ), Isp(O [3] ), and Isp [3] (O). Lemma 1.1. Let P l P be a transition in Psp(O). Let Max (I) denote the most recent fragment of P in O, namely Max (P) =i ω i P, ω j I(j i). Then, Max(I ) Max (I). Lemma 1.2. Let I i l I i be a transition in Isp [i] (O). Then, I l I is a transition in Isp(O), where I i = I [i] and I i = I [i]. Lemma 1.3. Let I l I be a transition in Isp(O). Then, P I (P Clos(P ), P I, (4) P l P Psp(O), P I). Theorem 1 can be proven by induction on the nodes of the two automata in Eq. (3). The basis states the equality of the initial states. Let P 0, P 0, I 0, and I 0 be the initial states of Psp(O), Psp(O [i] ), Isp [i] (O), and Isp(O [i] ), respectively. We have: I 0 = {P P Clos(P 0 ), ω j P(j i)} (5) I 0 = {P P Clos(P 0 )}. (6) Since P 0 = P 0 =, based on Lemma 1.1, the subset of the nodes within Clos(P 0 ) in Eq. (5) is in fact Clos(P 0), thereby making I 0 = I 0. The induction step is guided by Fig. 3, that shows how Isp [i] (O) and Isp(O [i] ) are generated starting from O. l Assume a transition I i I i Isp [i] (O), where I i is also a node in Isp(O [i] ). We have to show that the same transition is in Isp(O [i] ) too. Consider a P I i. According to Lemma 1.2, I l I Isp(O), where I i = I [i] and I i = I [i]. Based on Eq. (2), P I. According to Lemma 1.3, P is reachable in Psp(O) from a prefix P I via a path, whose first transition is marked by l, all the subsequent transitions being marked by ε. Lemma 1.1 assures that all the prefixes involved in such a path (the root included) are composed of ω j such that j i. This means that the same the same path is also in Psp(O [i] ), as it corresponds to choosing labels from nodes in O that are also in O [i]. Since, by assumption, I i is also a node in Isp(O [i] ), this means that the l latter will include a transition I i I i, where P I i.we have to show that I i = I i. To this end, assume a P I i. Owing to Lemma 1.3, P is reachable in Psp(O [i] ) from a prefix P I i via a path, whose first transition is marked by l, all the subsequent transitions being marked by ε. Being O a monotonic extension of O [i], the same path will be also in Psp(O). Since, I i is also a node in Isp [i] (O) and being the latter deterministic, the target node I i in transition I l i I i will include P. Thus, I i = I i, that is, I l i I i is also in Isp(O [i] ). In a similar way it is possible to prove that, assuming a transition I i I l i Isp(O [i] ), where I i is also a node in Isp [i] (O), the same transition is in Isp [i] (O) too (the proof is left to the reader). This complete the induction step, which indicates the equality of the transition functions. To complete the proof of the theorem, we need showing the equality of the set of final states. Let S and S be the set of states of Isp [i] (O) and Isp(O [i] ), respectively, and S f and S f the corresponding set of final states. According to the definition of sub-index space, we have S f = {I I S, P I, Cons(P) ={ω 1,...,ω i }}. Based on the subset construction algorithm, S f is in fact the set of final states of Isp(O [i] ) too, that is, S f = S f. This concludes the proof of Theorem 1. Example 5. Consider the observation O displayed in Fig. 1 and relevant index space in Fig. 2. We show that, in compliance with Theorem 1, Isp [3] (O) = Isp(O [3] ). To this end, shown on the left-hand side of Fig. 4 is the prefix space Psp(O [3] ), while the relevant index space Isp(O [3] ) is depicted on the center. On the right-hand side of the figure is a transformation of the index space Isp(O) outlined in Fig. 2. Specifically, each node S in Isp(O) has been transformed into the subnode S [3] by removing some (possibly all) of the indexes, as established by Eq. (2). For instance, in node I 5, three, out of five indexes, have been dropped, namely 34, 24, and 4 (which stand for {ω 3,ω 4 }, {ω 2,ω 4 }, and {ω 4 }, respec- 150 DX'06 - Peñaranda de Duero, Burgos (Spain)

161 tively), thereby producing the sub-node marked by 2 and 3. Note how the sub-node of I 6 becomes empty after the removal of (the only) index 34. Based on the definition of subindex space, empty nodes are not part of the result. This is why I 6 and all entering edges are in dotted lines. A further peculiarity is the occurrences of duplicated sub-nodes, as for example {I 3, I 4, I 7 } and {I 1, I 5 }. Each set of replicated nodes forms an equivalence class of sub-nodes which results in fact in a single node in the sub-index space. Thus, {I 3, I 4, I 7 } and {I 1, I 5 } are collapsed into nodes 3 and 2, 3, respectively. This aggregation causes edges entering and/or exiting nodes in each equivalence class to be redirected to the corresponding sub-node in the result. Performing such arrangements on the graph and removing the dotted part, we obtain in fact the same graph depicted on the center of Fig. 4, namely Isp(O [3] ). Corollary 1.1. Let O = ϕ 1,...,ϕ n be a temporal observation. Then, i [0..n], k [0..i], Isp [i k] (O [i] )=Isp(O [i k] ). 4 Incremental Indexing In case we need to compute the index space of each subobservation of O = ϕ 1,...,ϕ n, namely Isp(O [i] ), i [1.. n], the point is, it is prohibitive to calculate each new index space from scratch at the occurrence of each fragment ϕ i, as this implies the construction of the nondeterministic Psp(O [i] ) and its transformation into the deterministic Isp(O [i] ). A better approach is generating the new index space incrementally, based on the previous index space and the new observation fragment, avoiding the generation and transformation of the nondeterministic automaton. This is performed by algorithm Increment, generating the new observation graph γ(o [i] ) and relevant index space Isp(O [i] ), based on the previous γ(o [i 1] ) and Isp(O [i 1] ), and the new fragment ϕ i, as specified in Fig. 6. Corollary 1.1 provides the formal basis for stating that Isp(O [i 1] ) is a good starting point for building Isp(O [i] ). In fact, for k =1, the corollary becomes Isp [i 1] (O [i] )= Isp(O [i 1] ), which means that there exists an operation (the sub-indexing) for obtaining Isp(O [i 1] given Isp(O [i] ). What we are looking for is the inverse operation. Our claim (which is not formally proven in the present paper) is that the inverse operation exists and that the Increment algorithm performs it. In so doing, Increment is supported by a data structure, the bud set, and a piece of knowledge, the rule set, denoted B and R, respectively. Each bud in B is a triple (N,P,ω), where N is a node of the index space, P a prefix in N, and ω a node of the observation graph belonging to the frontier of P. A bud indicates 1. I ncrement(γ(o [i 1] ), Isp(O [i 1] ),ϕ i) (γ(o [i] ), Isp(O [i] )) 2. begin 3. Generate γ(o [i] ) by means of the new fragment ϕ i =(λ i,τ i); 4. Initialize Isp(O [i] ) as a copy of Isp(O [i 1] ); 5. B := {(N,P,ω i) N Isp(O [i] ), P N,ω i Front(P)}; 6. loop 7. Pick up a bud B =(N,P,ω), ω =(λ, τ), from the bud set B; 8. P := P ω; 9. for each l λ do 10. Extend Isp(O [i] ) based on the rule set R defined in Table end for; 12. Remove bud B from B 13. while B ; 14. Yield the final states of Isp(O [i] ) 15. end. Figure 6: Increment algorithm. that, owing to the new observation fragment, N needs further processing. This means, for instance, that all the candidate sequences of labels up to N are followed by a label belonging to the logical content of the new fragment. Therefore, N has to be extended, possibly either by new edges, leading to old nodes, and/or by new edges leading to new nodes. Once processed, the bud is removed from B. However, processing a bud possibly causes the generation of new buds since, for instance, the candidate sequences of labels up to a newly created node N, ending with a label of the new fragment, can be followed by labels inherent to fragments received before it. Therefore, also N has to be extended. Each rule R i in R, i [1..8], is an association conditionaction (Table 1). The conditions are mutually exclusive. They involve the current topology of the index space, the bud B =(N,P,ω) picked up at the beginning of the body of the loop (line 7), the new prefix P (computed at line 8), and label l λ (line 9), being ω =(λ, τ). If no condition holds, then no operation is performed. For instance, the action of R 1 merges nodes N and N, as shown in Fig. 5. To do so, all edges entering/leaving N are redirected to/from N, while N is removed. After the merging, the bud set must be updated. The action of R 8, instead, redirects the edge N l N towards the new node N {P }, as shown in Fig. 7, and duplicates the edges leaving N. This operation too requires updating the bud set. When the loop terminates, the new index space Isp(O [i] ) is topologically complete. Only the final states must be yielded (line 14): these are the nodes that contain a prefix P such that Front(P) =. Example 6. Suppose that the sub-observation O [3] of observation O of Example 1 has already been received by the ob- Figure 5: Effect of merging N and N. Figure 7: Effect of redirecting N l N. DX'06 - Peñaranda de Duero, Burgos (Spain) 151

162 Figure 8: Tracing of the incremental computation of Isp(O [4] ). 152 DX'06 - Peñaranda de Duero, Burgos (Spain)

163 Table 1: Rule set R: each rule R i, i [1..8], is an association condition-action guiding the execution of Increment algorithm. Rule Condition Action R 1 l = ε, N = N {P } exists already,n N. N and N are merged; B is updated. R 2 l = ε, N = N {P } does not exists. N is extended with P ; B is updated. R 3 l ε, no edge leaving N marked by l, N = {P } already exists. A new edge N l {P } is created. R 4 l ε, no edge leaving N marked by l, N = {P } does not exist. N = {P } and N l N are created; B is updated. R 5 l ε, there exists N l N, no other edge entering N, N = N {P } already exists, N N. R 6 l ε, there exists N l N, no other edge entering N, N = N {P } does not exist. R 7 l ε, there exists N l N, there exists another edge entering N, N = N {P } already exists, N N. R 8 l ε, there exists N l N, there exists another edge entering N, N = N {P } does not exist. N and N are merged; B is updated. P is inserted into N ; B is updated. N l N is substituted by N l N. N l N is redirected towards N ; B is updated. server, one fragment at a time, and that Increment has correctly generated γ(o [3] ) and Isp(O [3] ). Now the fourth and last fragment of O, ϕ 4, is received and Increment has to generate Isp(O [4] ). Shaded on the top-left of Fig. 8 is Isp(O [4] ) at the beginning of the loop (line 6), which equals Isp(O [3] ), depicted on the center of Fig. 4, with some extra information inherent to B drawn by processing ϕ 4. Specifically, each bud (N,P,ω i ) B is represented by P i in node N. For example, bud (N 2, {ω 3 },ω 4 ) is written in N 2 as 3 4. The subsequent graphs in Fig. 8 depict the computational state of Isp(O [4] ) at each new iteration of the loop. According to the initial (shaded) graph, at first, B includes eight buds. The bud B chosen at each iteration (line 7) is shaded in the corresponding pictorial representation. The loop is iterated fourteen times: (1) The bud picked up at the first iteration is (N 3, {ω 3 },ω 4 ). At line 8, P = {ω 3 } ω 4 = {ω 3,ω 4 }. Since λ(ω 4 )= {open}, the inner loop at line 9 is iterated only once, for l = open. This corresponds to rule R 4 in Table 1: the new node N 4 is created and linked from N 3 by an edge marked by open, as shown in graph Step 1 (no new bud is created). (2) B =(N 2, {ω 3 },ω 4 ), λ = {open}, and P = {ω 3,ω 4 }. This corresponds to rule R 8 : node N 5 is generated (no new bud is created). (3) B = (N 1, {ω 3 },ω 4 ), λ = {open}, P = {ω 3,ω 4 }, rule R 8 : node N 6 is generated; moreover, a new bud (N 6, {ω 2 },ω 4 ) is inserted into B. (4) B = (N 2, {ω 2 },ω 4 ), λ = {open}, P = {ω 2,ω 4 }, rule R 8 : node N 7 is generated; moreover, a new bud (N 7, {ω 2,ω 4 },ω 3 ) is created. (5) B =(N 7, {ω 2,ω 4 },ω 3 ), λ = {short, open}, and P = {ω 3,ω 4 }. For l = short, this corresponds to rule R 3 : open edge N 7 N 4 is created. For l = open, no operation (no condition is met). (6) B =(N 6, {ω 2 },ω 4 ), λ = {open}, P = {ω 2,ω 4 }, rule R 5 : nodes N 5 and N 7 are merged. (7) B =(N 1, {ω 2 },ω 4 ), λ = {open}, P = {ω 2,ω 4 }, rule R 6 : node N 6 is extended with index P, and a new bud (N 6, {ω 2,ω 4 },ω 3 ) is created. (8) B = (N 6, {ω 2,ω 4 },ω 3 ), λ = {short, open}, P = {ω 3,ω 4 }. For l = short, rule R 8 : node N 8 is generated. For l = open, no operation. (9) B =(N 0, {ω 2 },ω 4 ), λ = {open}, P = {ω 2,ω 4 }, and rule R 6 : node N 2 is extended with index P, and a new bud (N 2, {ω 2,ω 4 },ω 3 ) is created. (10) B =(N 2, {ω 2,ω 4 },ω 3 ), λ = {short, open}, and P = short {ω 3,ω 4 }.Forl = short, rule R 7 : edge N 2 N 3 is redirected toward N 8.Forl = open, no operation. (11) B =(N 1, {ω 1 },ω 4 ), λ = {open}, P = {ω 4 }, rule R 6 : node N 6 is extended with P and bud (N 6, {ω 4 },ω 2 ) is created. (12) B =(N 6, {ω 4 },ω 2 ), λ = {open,ε}, P = {ω 2,ω 4 }. For l = open, no operation. For l = ε, no operation. (13) B = (N 0, {ω 1 },ω 4 ), λ = {open}, P = {ω 4 }, rule R 6 : node N 2 is extended with index P, and a new bud (N 2, {ω 4 },ω 2 ) is created. (14) B =(N 2, {ω 4 },ω 2 ), λ = {open,ε}, P = {ω 2,ω 4 }. For l = open, no operation. For l = ε, no operation. Since B is empty, the loop terminates. The final states of Isp(O [4] ) are N 4, N 6, N 7, and N 8. Note how the last (shaded) graph in Fig. 8 represents the same automaton Isp(O) in Fig EXPERIMENTS The Increment algorithm was first coded in Prolog and experiments based on this prototype were run so as to test the soundness and completeness of the algorithm before formally proving such properties. Further experiments on a successive implementation in C have shown that the algorithm achieves the goal of efficiency too, which is the reason for it has been proposed. The diagram in Fig. 9 represents the time (in seconds) to compute the index space of an uncertain temporal observation composed of (up to) 600 fragments. The curve on the top is relevant to the computation of each index space DX'06 - Peñaranda de Duero, Burgos (Spain) 153

164 Figure 9: Experimental results: index-space computationtime (y-axis) vs. number of observation fragments (x-axis). from scratch. The curve on the bottom corresponds to the incremental computation on the same platform. 6 CONCLUSION Both the observation graph and the index space are modeling primitives for temporal observations. Whereas the former, which is a DAG, is the front-end representation, suitable for modeling an observation while it is being received over a time interval, the latter, which is a deterministic automaton, is a back-end representation, suitable for model-based problem-solving. In fact, in case the notion of an index space were not adopted for problem-solving, it would be necessary to compute all the sentences of the language defined by the observation and then to perform model-based reasoning on all of them. Moreover, the notion of an index space brings the advantage of adopting for observations the same formalism traditionally exploited for component models of DESs, be they synchronous or asynchronous. Actually, in the literature each reasoning step performed on the behavior of DESs translates to the composition of two or more automata. Now that the awareness has grown that DES observations can be represented as automata themselves (see, for instance, [Grastien et al., 2005]), the (only) operation that is needed for carrying out several model-based tasks is the synchronization between automata, where observations are handled exactly the same way as the other models. Finally, the index space, since adhering to the standard formal representation of regular languages, could be adopted as an interchange format of uncertain observations among distinct application contexts. This paper has presented an algorithm for constructing the index space incrementally, while receiving observation fragments one at a time. The tests performed so far have shown that the proposed technique brings a significant reduction of the computation time whenever a (nonmonotonic) processing step has to be performed after each observation fragment is received, as is when the tasks of supervision and dynamic diagnosis (and state estimation, in general), are considered. It is likely that other approaches to model-based reasoning on DESs can take advantage of this result since the algorithm proposed in this paper relies on the model of a DES observation, which is, to a large extent, independent of the model adopted for the DES itself. The research still needs to perform computational analysis and to compare it with the experimental results. References [Aho et al., 1986] A. Aho, R. Sethi, and J.D. Ullman. Compilers Principles, Techniques, and Tools. Addison- Wesley, Reading, MA, [Baroni et al., 1999] P. Baroni, G. Lamperti, P. Pogliano, and M. Zanella. Diagnosis of large active systems. Artificial Intelligence, 110(1): , [Brusoni et al., 1998] V. Brusoni, L. Console, P. Terenziani, and D. Theseider Dupré. A spectrum of definitions for temporal model-based diagnosis. Artificial Intelligence, 102(1):39 80, [Cordier and Pencolé, 2005] M.O. Cordier and Y. Pencolé. A formal framework for the decentralized diagnosis of large scale discrete event systems and its application to telecommunication networks. Artificial Intelligence, 164: , [Grastien et al., 2005] A. Grastien, M.O. Cordier, and C. Largouët. Incremental diagnosis of dicrete-event systems. In Sixteenth International Workshop on Principles of Diagnosis DX 05, pages , Monterey, CA, [Hopcroft and Ullman, 1979] J.E. Hopcroft and J.D. Ullman. Introduction to Automata Theory. Addison-Wesley, Reading, MA, [Lamperti and Zanella, 2000] G. Lamperti and M. Zanella. Uncertain temporal observations in diagnosis. In Fourteenth European Conference on Artificial Intelligence ECAI 2000, pages , Berlin, D, [Lamperti and Zanella, 2002] G. Lamperti and M. Zanella. Diagnosis of discrete-event systems from uncertain temporal observations. Artificial Intelligence, 137(1 2):91 163, [Lamperti and Zanella, 2003] G. Lamperti and M. Zanella. Diagnosis of Active Systems Principles and Techniques, volume 741 of The Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publisher, Dordrecht, NL, [Lamperti and Zanella, 2004] G. Lamperti and M. Zanella. A bridged diagnostic method for the monitoring of polymorphic discrete-event systems. IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, 34(5): , [Lamperti and Zanella, 2006] G. Lamperti and M. Zanella. Flexible diagnosis of discrete-event systems by similaritybased reasoning techniques. Artificial Intelligence, 170(3): , [Rozé and Cordier, 2002] L. Rozé and M.O. Cordier. Diagnosing discrete-event systems: extending the diagnoser approach to deal with telecommunication networks. Journal of Discrete Event Dynamic Systems: Theory and Application, 12:43 81, [Wotawa, 2002] F. Wotawa. On the relationship between model-based debugging and program slicing. Artificial Intelligence, 135(1 2): , DX'06 - Peñaranda de Duero, Burgos (Spain)

165 Introducing Data Reduction Techniques into Reason Maintenance Rüdiger Lunde University of Applied Sciences Ulm Prittwitzstrasse 10, Ulm (Germany) Abstract Every problem which can be solved with a reason maintenance system can in theory at least be solved as well without it, using a simple generate and test algorithm. Therefore, its main purpose is to increase the efficiency of a reasoning system. In this paper, we analyze the impact of problem characteristics on the performance. For problems which include reasoning about physical models on quantitative level, the complete dependency network maintained by reason maintenance systems is identified as a major resource consumer. Based on that analysis a new component called value manager is presented, which applies data reduction techniques to limit dependency management costs and is especially designed to support iterative solvers. As two examples of practical applications, experimental results from an automated FMEA generation run for an automotive system and a diagnosis of a flyby-wire system are discussed. 1 Introduction A reason maintenance system (RMS) is a book-keeping tool supporting a problem solver. It tracks dependencies between given and derived data during the inference process and contributes to the search for solutions in two ways. Firstly, it analyzes the causes of failures and thus helps to avoid useless search in subspaces without solutions. Secondly, it caches inferences and prevents the problem solver from redrawing the same inferences again and again. An RMS views data as propositional symbols and relationships between the data as propositional clauses. This view is independent of the actual meaning of the data for the problem solver, which can be first or higher order, for example. Given sets of assumptions, data, and clauses, the RMS derives the belief status of every datum based on the current belief status of the assumptions. An RMS works incrementally. After the belief in some of the assumptions changes or a new clause is added, the belief status of all affected data is updated without starting from scratch. A contradiction is discovered if the belief status of a special datum denoting falsity changes from disbelieved to believed. Contradictions are immediately signaled to the problem solver, and information about the responsible assumptions is attached to the signal. The use of reason maintenance systems has a long tradition in AI. Starting with the introduction of the non-monotonic JTMS [Doyle, 1979], a wide range of different reason maintenance systems has been developed (e.g. monotonic JTMS, ATMS, LTMS). They differ in the accepted types of clauses, their support w.r.t. contradiction resolution, and the way the belief status is represented in labels associated with data. The assumption-based truth maintenance system (ATMS) [de Kleer, 1986a][de Kleer, 1986b][de Kleer, 1986c] as part of the general diagnostic engine (GDE) [de Kleer and Williams, 1987] has attracted much attention in the field of model-based reasoning. It accepts Horn clauses and uses a special indexing scheme to store the sets of assumptions a datum ultimately depends on. It supports parallel search in different contexts and is especially suited for problems with many solutions, if all or several of them are of interest. This is usually the case in explanatory problems, where we want to know all or at least the most plausible causes for observed or assumed effects. It is well known that the worst-case complexity of an ATMS is exponential in the number of assumptions. Reasoning systems utilizing an ATMS spend a considerable amount of time on label updates. Several extensions have been developed to improve the efficiency of the reasoning system as a whole. Forbus and de Kleer [Forbus and de Kleer, 1988] proposed a consumer control mechanism to focus inference selection on interesting contexts. Additionally, Dressler and Farquhar [Dressler and Farquhar, 1991] showed how the label update can be restricted to interesting contexts. Although label completeness is lost, completeness with respect to the current focus can still be guaranteed. Lazy label evaluation [Kelleher and van der Gaag, 1993] delays the label update until the problem solver shows interest in it. By combining focusing and lazy label evaluation, some synergy effects can be obtained [Tatar, 1994]. Reason maintenance systems have proven to be efficient tools to support reasoning about qualitative abstractions of real systems. However, when choosing lower levels of abstraction, two negative impacts on the efficiency of the reasoning system can be observed: The dependency management costs grow dramatically due to the increasing absolute number of assumptions and inferences needed for behavior prediction. DX'06 - Peñaranda de Duero, Burgos (Spain) 155

166 The benefit achieved by the services of a reason maintenance system decreases because of the decreasing degree of similarity between different contexts and the gap between the true meaning of the derived data and the reason maintenance system s propositional view of it. Both observations give reason to look for more efficient alternatives. 2 Reason Maintenance Systems seen from the Perspective of Machine Learning From the point of view of machine learning, an RMS plays the role of a learning module. It tries to find out as much as possible about the currently solved problem by collecting and analyzing the inferences drawn so far. In all reason maintenance systems memorizing plays a major role. All inferences drawn so far are stored together with their results in a dependency network. No attempt is made to generalize the given information or to remove unimportant nodes. The main problem with this learning method is that memory space grows monotonously and the management overhead with it. So there is a tradeoff between the gains resulting from learned information and the overhead to maintain the knowledge base. In large search problems, the efficiency of the overall performance of an RMS based reasoning system typically increases first, then reaches a maximum, and decreases steadily afterwards. It is obvious that maintaining a cache for all inferences ever performed is inefficient for iterative solvers. To reduce memory consumption for intermediate results, the learning module should abstract from dependencies on inference layer and focus on the relationship between the assumptions defining the interesting contexts and the final results obtained from them. The ATMS already incorporates an efficient method to compile dependency information given on inference layer into an explicit representation of the relationship between assumptions and derived data. For each datum, it maintains a label which represents the set of all currently known consistent combinations of assumptions supporting the derivation of the datum. In contrast to the memorizing method used for low level dependency tracking, the learning method used here incorporates a generalization step. The combinations of supporting assumptions are not enumerated directly but characterized by a lower bound, which is updated after each inference step for all affected data. The label representation can be viewed as a conceptual description of all consistent contexts to which the corresponding datum belongs, and the incremental label update as a special kind of inductive concept learning. Since the label update algorithms do not necessarily require the availability of a complete dependency network, a new class of dependency trackers can be defined which is based only on the second learning method. In the following, we discuss one special instance of this class. 3 The Value Manager The value manager is a dependency tracking tool which is especially designed to support efficient reasoning about systems on quantitative physical level. It features fast inference selection and label update during behavior prediction and negligible management overhead for intermediate results. The value manager is based on Horn logic and also shares the label computation algorithms with an ATMS, but does not maintain a dependency network. Instead, it incorporates some new functionality: It forgets. By applying data reduction methods, resource consumption with respect to space and computation time is significantly reduced. It focuses on one context at a time. It distinguishes between short-term and long-term memory management. Separate data buffers give fast access to context specific data and use a common knowledge base to exchange learned inference results. The main functional difference w.r.t. a focused ATMS is that the value manager actively controls resource consumption by forgetting unimportant data. The task of selecting data for removal from storage is seen here as a technical resource optimization task like garbage collection, rather than a mission critical strategic task. Therefore, the control about data reduction is assigned to the value manager and not to the problem solver. This design decision has significant consequences. 3.1 Consequences on the View of Data A propositional view of data, which is common to all reason maintenance systems, is not sufficient for the value manager. If the value manager shall select data for removal, it must have the necessary knowledge to estimate the importance of data for further reasoning. This includes knowledge about the true meaning of data as well as methods to detect and remove redundancy. Since this knowledge cannot be formalized without assumptions about the inner structure of a datum, confinement to a certain kind of data is the price we have to pay for the extended functionality. In our application focus, state equations are the atomic pieces of knowledge the problem solver reasons about. State equations are pairs v, d containing a model variable and a (possibly infinite) set of values which is a subset of the corresponding variable domain. They represent propositions about the possible values for a quantity of the physical system under analysis in a certain context. Every state equation with an empty value set denotes falsity. For effective data reduction, the value manager needs at least to be able to compare value sets for the same variable with respect to set inclusion. To reduce the communication overhead, the value manager should also be able to intersect those value sets. While the rigidity of information hiding between problem solver and value manager is not as strict as in the classical approach, there is still some abstraction in the value manager s view of the inference process. For instance, the true meaning of assumptions, which are used to define the contexts of interest for the problem solver, is completely hidden from the value manager. Within a diagnostic problem, different contexts may represent system states of different candidates as well as state snapshots for dynamic systems at different points in time, or a combination of both. 156 DX'06 - Peñaranda de Duero, Burgos (Spain)

167 3.2 Consequences on Label Completeness Whenever a dependency tracking unit decides to remove a datum, it is possible that later on another justification may be found by the problem solver for the same datum. Together with the removed datum, the knowledge about its successors in the dependency network is lost. Consequently, when a previously removed datum gets a new justification, label completeness of its successors is lost, at least as long as the consequences have not been redrawn. Different strategies can be imagined to handle gaps in a dependency network. Their efficiency will strongly depend on the average size and absolute number of gaps. Our data reduction strategies aim at keeping memory consumption constant during iterative pruning loops and will therefore in general lead to rather large gaps. In that case, the usefulness of the dependency network itself can be doubted. Consequently, the value manager does not maintain such a data structure at all. Instead, it merely maintains compiled dependency information with respect to contextdefining assumptions in the labels associated with the data. This reduces the overhead to zero when removing redundant or unimportant data. The resources additionally required for redrawing inferences depend on the desired degree of completeness. Completeness with respect to the original problem is not a realistic goal when dealing with numeric constraint networks, as is illustrated in Example 1 below. If we are satisfied with at least one (possibly not minimal) justification per datum, no redrawing is necessary at all during the investigation of a certain context. A slightly more liberal strategy suppresses the redrawing of consequences only for data which is classified as unimportant by the value manager. Independent of the chosen strategy, completeness of labels cannot be guaranteed any more at any time, not even with respect to the current focus. Example 1 Let CN = {x, y}, R 2, {c 1, c 2, c 3, c 4, c 5 } be a constraint network which contains the following constraints: c 1 : y = x + 1 c 4 : y < 10 c 2 : y = x 2 c 5 : y < c 3 : y > 3 Let further for all i {1,..., 5}, A i denote the assumption that c i holds. The first two constraints obviously have only one solution: x = 1 and y = 2. Therefore, the first three constraints cannot hold at the same time, and consequently, {A 1, A 2, A 3 } is a conflict. But this conflict cannot be found by local propagation because the variable domains are not bounded. If we perform domain reduction based on the first three constraints, the lower bounds of the variable domains are pushed up steadily. After more than 2000 constraint evaluations, starting with c 3 and than iterating over c 1 and c 2, it becomes clear that no solution exists within the range of double precision machine numbers. But even this result does not guarantee that there is no solution at all. A fix point is never reached. Also domain splitting does not help to identify the conflict. In spite of the high number of potential propagation steps, which can be performed to compute reduced domains using different parts of the constraint network, the datum y, {} denoting falsity can be obtained in less then 10 reduction steps by simply focusing the propagation on the smallest value sets known for each variable. Dependent on the order in which the constraints are evaluated the corresponding label will be {A 1, A 2, A 3, A 4 } or {A 1, A 2, A 3, A 4, A 5 } respectively. Obviously, both are not the minimal conflicts with respect to the original constraint problem. 3.3 Focusing and Data Reduction Strategies For efficient data management, focusing and data reduction techniques have to cooperate in a productive way. While the first technique tries to avoid the generation of unnecessary or unimportant data, the second aims at getting rid of ballast accumulating during the reasoning process. Both strategies must be based on the same criteria of importance with respect to the task at hand. To apply data reduction without focusing does not make sense. In general, it is more efficient to avoid the production of useless data in the first place than to remove it afterwards. So data reduction produces a real benefit only if applied to data which could not be avoided by focusing. As described in Section 1, the current state of the art in ATMS technologies includes focusing techniques which restrict inference drawing to consequences of interesting combinations of assumptions. Especially when reasoning about numeric values, focusing on contexts of interest is necessary for efficiency, but not sufficient. In Example 1 we have seen that focusing inferences within a single context is required as well. Publications in the area of model-based reasoning which address the subject of focusing within a single context are rare. However, an equivalent to the idea to focus inferences on most restricted values can be found in [Goldstone, 1992]. In this paper, a diagnostic system called Skordos is described. The concepts discussed address the difficulty of managing the tremendous number of possible predictions in quantitative value set propagation algorithms. Intervals of continuous domains are represented by inequalities like x 7. A process called hibernation delays propagation of some inequalities until they become important for diagnostic reasoning. The importance of data is measured by its usefulness for finding new conflicts within the focused contexts 1. The proposed strategy starts with the computation of consequences only for those inequalities which are the mathematically strongest for a certain variable in at least one focused context. All other consequences are delayed until conflicts have been found. The more carefully the inference process is controlled within the focused contexts, the higher is the overhead for inference selection. For instance, to decide whether to compute consequences for a newly derived datum, in [Goldstone, 1992] an algorithm is used which determines for each focused context the corresponding mathematically strongest inequalities. The worst case complexity of that algorithm is quadratic with respect to the number of focused contexts. To focus inference drawing on those steps, whose results are valid in at least one focused context, additional checks on the set of antecedent nodes are necessary. While determining whether a 1 In fact, [Goldstone, 1992] defines hibernation directly on diagnostic candidates, but replacing the set of candidates by an arbitrary set of focused contexts is a straightforward generalization. DX'06 - Peñaranda de Duero, Burgos (Spain) 157

Figure 1: The value manager set of data holds in at least one focused context can be organized very efficiently 2, the costs of testing mathematical strongness depend on the kind of data used.

168 Figure 1: The value manager set of data holds in at least one focused context can be organized very efficiently 2, the costs of testing mathematical strongness depend on the kind of data used. If state equations with interval sets as values are used, the costs are in the same order of magnitude as the costs of the inference step itself. To avoid these extra costs, the value manager restricts reasoning to a single focused context at a time. Of course, this increases the number of necessary focus adjustments. Instead of investigating a set of candidates simultaneously, a diagnostic problem solver using a value manager is forced to investigate one candidate after another, state by state. Between every two investigations, the focused context has to be changed. The main reason why this strategy is more efficient for the value manager is the fact that the value manager forgets. As can be seen in Section 4, in relevant applications the amount of data produced during the investigation of a context is by orders of magnitude larger than the amount of data the value manager remembers after the investigation. When changing the focusing environment, still a search for the mathematically strongest data for the new focus is necessary. However, the amount of data to be checked is strongly reduced since no comparisons are necessary between intermediate results. A second reason is the explosion of focused contexts, which is caused by special assumptions needed for disjunction encoding (see Section 3.5). Without further focusing within the set of all interesting contexts, the selection of useful inference steps becomes extremely expensive. The strategy of focusing on the mathematically strongest data defines the importance of data with respect to the currently investigated contexts. For the development of efficient data reduction strategies, this criterion is helpful, but we have to take into account another aspect of importance: The relevance of a datum for further context investigations. The value manager provides a framework in which memory management is divided into short-term and long-term management. While short-term memory management is based on the former aspect of importance, the long-term memory management is based on the latter. Since the value manager does not maintain a dependency network, it can ef- 2 In [Tatar, 1994] a 2vATMS is described which checks whether a set of data holds within one of the given focused contexts in linear time with respect to set size. fectively separate the knowledge about the currently investigated context from the knowledge about previously investigated ones. We call the memory for context specific knowledge context buffer and the memory for learned knowledge about other contexts value database. Each memory maintains a set of conflicts, a set of value manager nodes each comprising a datum, an ATMS-like label and possibly some other administrative information and also means to look up nodes for a given variable efficiently (see Figure 1). Short-term memory management is performed during the investigation of a context. Whenever a new datum together with at least one justification is added to the context buffer, the data reduction strategy decides whether to store it, to forget it, to combine it with one of the currently maintained nodes for the same variable, or to replace some of these nodes by it. This step includes value set manipulations as well as ATMS-like label computations. Several strategies have been tested during the development of our reference implementation. For models with a high proportion of quantitative relations, the far most efficient one turned out to be a strategy which computes intersections whenever possible and for each variable only keeps the node with the most restricted value set 3. The main advantages of this strategy are the fast convergence and the limited memory consumption, which remains constant during propagation. While the value manager is open for strategies which also keep other than the most restricted value sets in memory, it expects all strategies to compute the most restricted value set and provides a special interface method to access the corresponding value manager node. Long-term memory management is performed during context changes. Changing the focused environment of a context buffer includes knowledge transfer between the buffer and the value database in both directions. First, nodes which have been added to the context buffer after the last context change are selected for saving in the long-term memory. Adding clones of those nodes to the value database can include some reorganization, for example removing other nodes which are not necessarily needed any more and are unlikely to be useful in future. Then, the environments in the labels which are 3 For debugging purposes, it is useful to maintain more than one node in case of conflicting data. 158 DX'06 - Peñaranda de Duero, Burgos (Spain)

169 maintained by the context buffer are restricted to subsets of the new focused environment. Nodes with empty label are removed from the buffer. Finally, nodes from the database which are valid in the new focused context are cloned and added to the buffer. This addition adjusts the labels of the nodes to subsets of the new focused environment and makes use of short-term memory management, which can include intersection computations. The selection strategy of data to be stored in the value database, as well as the memory reorganization strategy, are not fixed within the value manager. For the experiments in Section 4, a very simple strategy was used. It adds all final context investigation results to the value database without any reduction on data level. 3.4 Supporting Inference Selection Besides the strategic conflict information to direct the problem solver s search, the value manager also provides tactical support on inference selection level. After adding a newly derived state equation together with the corresponding Horn justification to a context buffer, the need for computing its consequences depends on whether the addition changed the available knowledge about that variable or not. Since the data reduction strategies of the value manager compare the newly added data with existing data for the same variable, the value manager can provide information about relevant value set reductions without extra costs. For that purpose, each context buffer maintains a list called open nodes. When adding new data to a context buffer, all modified and all newly created value manager nodes are added to that list. The problem solver can access that list whenever convenient to select new tasks for the agenda. After each access, the list is automatically cleared. The list of open nodes is also modified by the context buffer when changing the focused context. By tracking relevant changes during the update of the set of all maintained value manager nodes, all nodes which were affected by the context change can be identified. Local propagation benefits from that information, because the number of tasks initially put on the agenda can be reduced. It should be emphasized that the open node list mechanism strongly differs from the consumer mechanism used in classical ATMS approaches (see [de Kleer, 1986c]). Both mechanisms have the same goal, namely to avoid unnecessary inferences, but the means are quite different. The consumer mechanism allows the problem solver to attach markers (so called consumers) to RMS nodes which indicate the consequences that should be computed for the node (and also contain the code to perform the necessary inferences). Propagation is performed by selecting one of those markers, removing it from the corresponding RMS node and performing the corresponding inferences. Advantages of this mechanism are that no inference needs to be drawn twice, even after context changes, and that the responsibility for completeness of the inference control is completely assigned to one component (the RMS). On the other hand, considerable management overhead is generated for inferences with more than one antecedent node. The more the inference process is focused within a focused context, the more useless consumers are created but never removed. For dependency trackers which apply data reduction, the consumer mechanism is even less suited since removing a node with an attached consumer may affect completeness. 3.5 Disjunction Handling To solve cyclic dependencies in physical systems, inference methods which go beyond local propagation are needed. Current approaches combine different techniques such as local propagation, domain splitting and network decomposition. Since results of domain splitting steps do not have Horn justifications, the value manager has to be extended to support a branch&prune solver as described in [Lunde, 2005]. We first focus on the logical problem of computing sound labels. Let L(eq) denote the label of the state equation eq. Labels are sets of sets of assumptions and have the same meaning as within an ATMS. The logical relationship between two split equations x, d 1 and x, d 2 resulting from splitting the value set of a third equation x, d can be expressed by means of two split assumptions A 1 and A 2. For both x, d i, we define L( x, d i ) = {e {A i } e L( x, d )}. Since both split assumptions are related by disjunction, we can now define split assumption elimination based on hyperresolution. Let y, d 1 be a consequence depending only on the first split assumption A 1 and y, d 2 a consequence depending only on A 2. A sound, minimal, and consistent label for y, d 1 d 2 is obtained by removing supersets of contained environments and known conflicts from the following label: L = {e e L( y, d 1 ) A 1 / e e L( y, d 2 ) A 2 / e e 1 L( y, d 1 ) e 2 L( y, d 2 ) : e = e 1 \ {A 1 } e 2 \ {A 2 }} Since falsity can be expressed by arbitrary state equations which contain an empty value set, this specification also covers conflict handling. The next question is how to control hyperresolution within the value manager. The branch&prune algorithm investigates in each recursion level the consequences of the split equations x, d 1 and x, d 2 in a sequence. Therefore, the problem solver could easily navigate the context buffer through both corresponding extended contexts (each defined by extending the original focused environment by one of the split assumptions). Following this idea, hyperresolution inferences could be realized as an extension of long-term memory management. Unfortunately, this usage of the value manager dramatically increases the number of context changes, and reduces the effectiveness of long-term memory management, since intermediate results now find their way into the value database. The resulting system will spend quite a large amount of time with context changes. Therefore, the context buffer is extended instead. This extension supports domain splitting within the context buffer and eliminates the need to communicate with the value database until the original context is completely investigated or the problem solver loses the interest in it. The chosen solution exploits the depth first control strategy used by the branch&prune algorithm. In spite of maintaining just one set of value manager nodes and one set of nogoods, the extended context buffer maintains a tree called context tree which is DX'06 - Peñaranda de Duero, Burgos (Spain) 159

170 composed of context tree nodes. This tree reflects the hierarchical structure of context extensions generated by domain splitting operations. Each context tree node comprises a focused environment, a set of value manager nodes, a set of nogoods, and optionally a split assumption and a split equation. The root node is initialized and marked as current node when changing the context. Child nodes are added to the current tree node whenever split operations are performed. The problem solver gets means to navigate to certain tree nodes and to evaluate their children. Evaluation is based on hyperresolution as described above and includes modification of the content of the current tree node and removal of the evaluated children. Compared to an extension for general disjunction handling as suggested in [de Kleer, 1986b], the expressive power of the sketched functionality is rather limited. Nevertheless, it is very efficient because no search is necessary to apply hyperresolution, and because it supports removal of assumptions and data which are not needed anymore, without additional costs. 4 Experimental Results A Java implementation of the presented concepts has been integrated into the commercial model-based engineering tool RODON (see [Lunde et al., 2006]) and tested in various experiments. The results of three of them are summarized in the following. All measurements have been performed on a standard PC with 2.2 GHz and 512 MB RAM. 4.1 Automated FMEA Generation for an Automotive System The analyzed system of the first experiment comprises the electrical equipment of the right door of a current car series. The most important components are the electronic control unit, the exterior mirror assembly, the door lock assembly, the window pane control motor, the switch assembly and some bulbs. The corresponding Rodelica 4 model is currently used in a commercial project by a major German car manufacturer to generate decision trees for workshop diagnosis. It comprises 167 subsystems and 580 atomic components, and covers more than 60 fault codes within the electronic control unit. The constraint network is composed of 8863 variables and 7338 constraints. In this experiment, we focus on fault effect prediction for the operational state window pane manually up. This task includes 288 state investigations; in 265 states, domain splitting is activated, which leads to 1290 investigations of extended contexts. Table 1 summarizes the obtained reduction with respect to needed inference steps and computation time when using the value manager. The efficiency gains are significant even though no use is made of conflicts to direct a 4 Rodelica is a dialect of Modelica, which is a standardized object-oriented language to describe physical systems in a component-oriented and declarative way (see Rodelica differs from Modelica in some details since it uses constraints instead of differential algebraic equations to describe component behavior. search. The comparatively small context navigation time confirms the decision to focus on one context at a time. computed data computation time total ctx-nav total # [sec] [min:sec] Without value manager :55 With value manager :22 Table 1: Impact of the value manager on simulation performance In spite of the large number of intermediate results, the size of the value database is quite limited at the end of the analysis. Only value manager nodes are maintained. Justifications are not maintained at all, and the storage consumption of the environments, which directly depends on the maximal size of the assumption database, is also limited. All in all 1611 assumptions are introduced during the analysis, but thanks to the removal of split assumptions when evaluating extended contexts, the size of the assumption database never exceeds 609. As a consequence, the memory space for label management is reduced by more than 50 percent and the performance of label computations is improved. 4.2 Model-based Diagnosis of a Fly-by-Wire System The pitch elevator control system [Lunde, 2003] of the next experiments is a typical fly-by-wire system. It consists of an electronic control unit called primary flight control unit (PFCU) which controls the angle of a pitch elevator surface by means of an electro-hydraulic servo valve and a hydraulic cylinder. The top-level layout of the system (see Figure 3) also includes a power supply unit (PSU), three redundant position sensors, and some electrical wires. Figure 2 shows the actuator part of the system in more detail. Here, electrical signals are converted into hydraulic flows and finally into mechanical movements. Besides the three main components, some redundant components have been added to the design, to keep the system in a safe state in case of faults. To control Figure 2: The actuator part of the pitch elevator control system 160 DX'06 - Peñaranda de Duero, Burgos (Spain)

171 the surface angle, the PFCU compares the actual surface position with the required angle and uses the deviation to adjust the position of the electro-hydraulic servo valve. These adjustments determine the movement of the piston in the cylinder, which finally changes the actual angle of the surface. As an example of an interesting diagnostic case we analyze the observed response of the system to a control command from the cockpit which requires the surface angle to change by 5 degrees. Starting with an initial angle of 0 degrees a movement into the right direction is observed but, due to a defect, the movement does not stop at the angle of -5 degrees. 5 We want to know which faults can explain the observed behavior. Our Rodelica model of the pitch elevator control system is again component-oriented and exactly matches the structure shown in Figures 2 and 3. The system behavior is defined by 621 variables and 501 constraints. The main difference with respect to the automotive system of the first experiment is that this model is dynamic to a great extent. To provide the PFCU with a realistic feedback from the controlled components, difference equations are used in several parts of the system, e.g. in the cylinder. They compute Euler steps for the corresponding differential equations, which describe the behavior on physical level. A second difference is that reliable predictions are achievable even without domain splitting. Therefore, the experiments were performed with local propagation only. Due to the fact that most dynamic state variables are continuous (e.g. the cylinder position), we cannot expect too much efficiency gains by reusing results from previous state investigations. But here, the conflict computation contributes to our application, since it is basically a search problem. In the second experiment, we simulate the response of the system to the cockpit command in nominal mode. To this purpose a sequence of 30 states is computed in which the initial value ranges of the dynamic state variables of each state are determined by the corresponding predecessor states and the difference equations. After 25 states, the surface angle converges at -5 degrees. Table 2 shows, that the value manager still improves the simulation performance, though the gains are not as impressive as in the automotive example. The high absolute number of inferences highlights the importance of data reduction. computed data computation time total ctx-nav total # [msec] [msec] Without value manager With value manager Table 2: Impact of the value manager on simulation performance In the last experiment, we use the GDE based diagnostic engine of our reference implementation to diagnose the described symptom. For this purpose, we restrict the range of 5 In reality, this wrong behavior is detected by some monitors, and fault compensation functions are activated. But this mechanism is out of scope here. Figure 3: The pitch elevator control system the actual surface position variable in the thirtieth state to the interval [- -6] and start diagnosis on that data. The search space is defined by the 26 component fault mode variables and their possible values. The model contains 50 single faults. The number of double faults is approximately During the initial check of the candidate system ok a conflict occurs in state 30 because the actual surface position is predicted to be around -5 degrees, which is out of the specified range. The corresponding conflict is mapped back (see e.g. [Tatar, 1996]) to the initial state. It comes out that nominal mode assumptions of 11 components are involved. This information reduces the search space, because it proves that 15 of the 26 suspicious components cannot explain the symptom, at least not by single fault. During the diagnostic process, the consistency of 61 candidates is checked with respect to the specified symptom. Conflict back-mapping leads to 11 additional conflicts between the initial fault state assumptions. At the end, the following three minimal candidates remain, which are guaranteed to be the only explanations within the scope of single and double faults. LVDT Exc C H disconnected actuatorlvdtsensor disconnected LVDT Act V1 disconnected & LVDT Act V2 disconnected Figure 3 shows the top-level view of the pitch elevator control system with the corresponding components highlighted. Our reference implementation needs 26.3 seconds for the computation. In this diagnosis, the conflict sets computed by the value manager lead to a reduction of the search space size from more than 2000 to just 61 candidates. This result emphasizes that dependency tracking is very useful to solve explanatory problems, even if the underlying model is characterized by a low abstraction level and includes continuous dynamic behavior. The level of label completeness which is provided by the value manager has shown to be adequate for this application. DX'06 - Peñaranda de Duero, Burgos (Spain) 161

172 5 Conclusion Quantitative reasoning about real physical systems usually leads to a huge amount of intermediate results, which may cause severe complications if the reasoning process is supported by a classical reason maintenance system. As shown in this paper, effective data reduction is crucial for efficient dependency tracking. The presented value manager is designed as a light-weight alternative to an RMS. It completely avoids inference caching and concentrates on ATMS-style label computation. Although developed for a special model-based analysis tool, the concept of the value manager is rather general. It can be utilized to support any problem solver which reasons about values of variables and provides Horn justifications for all inferred results. The suggested solution for disjunction handling requires the problem solver to evaluate disjunctions in a special depth-first order. It is especially efficient in combination with a solver which is based on domain splitting. Two applications have been discussed, which confirm the importance of data reduction and demonstrate the efficiency of the value manager. The significant difference in their characteristics also indicates that scalability is necessary for a widespread applicability in reliability analysis and diagnosis. The value manager is flexible regarding the actually used data reduction strategies, and thus well-prepared for task specific adaptations. Dependency tracking costs and the benefits obtained for the analysis task at hand can be balanced effectively. The presentation in this paper focuses on performance with respect to single processor computers, but special care has been taken to support parallel computing as well. The number of context buffers within a value manager is not limited. A problem solver can benefit from this feature by delegating candidate checking to different threads. Each thread can open its own context buffer to access data. Since the data within the buffers are physically separated from the data of the commonly used value database, synchronization is only needed when changing the context of one of the buffers. [Forbus and de Kleer, 1988] K. Forbus and J. de Kleer. Focusing the ATMS. In Proceedings of AAAI 88, pages MIT Press, [Goldstone, 1992] David Jerald Goldstone. Controlling inequality reasoning in a TMS-based analog diagnosis system. In Readings in model-based diagnosis, pages Morgan Kaufmann Publishers Inc., [Kelleher and van der Gaag, 1993] Gerry Kelleher and Linda van der Gaag. The lazy RMS: Avoiding work in the ATMS. Computational Intelligence: An International Journal, 9(3): , [Lunde et al., 2006] K. Lunde, R. Lunde, and B. Münker. Model-based failure analysis with rodon. In Proceedings of ECAI 06, Italy, (to appear). [Lunde, 2003] K. Lunde. Ensuring system safety is more efficient. Aircraft Engineering and Aerospace Technology: An international Journal, 75(5): , ISSN [Lunde, 2005] R. Lunde. Combining domain splitting with network decomposition for application in model-based engineering. In Armin Wolf, Thom Frühwirth, and Marc Meister, editors, 19th Workshop on (Constraint) Logic Programming W(C)LP 2005, number in Ulmer Informatik-Berichte, pages University of Ulm, Germany, [Tatar, 1994] M. Tatar. Combining the lazy label evaluation with focusing techniques in an ATMS. In Proceedings of ECAI 94, Amsterdam, the Netherlands, [Tatar, 1996] M. Tatar. Diagnosis with cascading defects. In Proceedings of ECAI 96, Budapest, Hungary, pages , References [de Kleer and Williams, 1987] J. de Kleer and B. C. Williams. Diagnosing multiple faults. Artificial Intelligence, 32:97 130, [de Kleer, 1986a] J. de Kleer. An assumption-based TMS. Artificial Intelligence, 28: , [de Kleer, 1986b] J. de Kleer. Extending the ATMS. Artificial Intelligence, 28(2), [de Kleer, 1986c] J. de Kleer. Problem solving with the ATMS. Artificial Intelligence, 28(2): , [Doyle, 1979] J. Doyle. A truth maintenance system. Artificial Intelligence, 12: , [Dressler and Farquhar, 1991] Oskar Dressler and Adam Farquhar. Putting the problem solver back in the driver s seat: Contextual control of the AMTS. In João P. Martins and Michael Reinfrank, editors, Truth Maintenance Systems (ECAI-90 Workshop), volume 515 of Lecture Notes in Computer Science, pages Springer, DX'06 - Peñaranda de Duero, Burgos (Spain)

173 A Supervision Architecture to Deal with Disruptive Events in UAV Missions Rachid El Mafkouk (*), Jean-François Gabard, Catherine Tessier (**) (**) Office National d Études et de Recherches Aérospatiales (Onera) Département Commande des Systèmes et Dynamique du Vol (DCSD) 2 avenue Édouard-Belin, Toulouse cedex 04, France Jean-Francois.Gabard@onera.fr, Catherine.Tessier@onera.fr (*) at Onera-DCSD for a training period April-Sept Abstract This paper presents a generic supervision architecture dedicated to autonomous response to disruptive events for a UAV. We consider a UAV whose mission may be disrupted by internal or external events (e.g. failures, weather situation, interfering aircraft ), and we make the assumption that the environment is such that the UAV cannot communicate with the ground segment to deal with these events. The same approach is used for modelling and monitoring the nominal mission and for the design of the reaction and replanning strategies; special attention is paid to the management of multiple concurrent events, and a classification is proposed according to their impact on the mission, which allows event combining rules to be designed; two main types of reconfiguration strategies are considered, depending on the disruptive event seriousness: those implying an immediate safety reaction before replanning, and those enabling to engage a replanning process without preliminary safety procedures. The architecture is implemented with ProCoSA, an asynchronous Petri net-based tool dedicated to mission monitoring and procedure execution in autonomous systems; formal verification tools offered by ProCoSA are used to validate the architecture, and scenarios are tested in a simplified simulation environment. Contrary to e.g [Hamilton et al., 2001] who have designed and tested RECOVERY, an heterogeneous knowledge-based diagnosis method for AUV internal failures, the paper does not address diagnosis in itself, but rather how to use the results issued from the Detection and Isolation functions of the FDIR process for Reconfiguration. 1. Introduction Onboard decision capabilities allow an uninhabited vehicle to reach mission objectives taking into account disruptive events. Decisional autonomy is necessary when the vehicle manoeuvres in a partially known, dynamic and hostile environment, such that the communication with the operator may not be available anytime. Research on autonomy is done for ground robots, Uninhabited Aerial Vehicles (UAVs), Autonomous Underwater Vehicles (AUVs) and space vehicles. Autonomy is characterised by the level of interaction between the vehicle and the human operator: the higher level the operator s decisions are, the more autonomous the vehicle is. Between teleoperation (no autonomy) and full autonomy (no operator intervention), there are several ways to allow a system to control its own behaviour during the mission [Clough, 2002]. One way to make the vehicle autonomous is to implement onboard decision capabilities to allow the vehicle to perform the mission even when the initial plan prepared offline is no more valid. Decision capabilities must be implemented within the closed loop {perception, situation assessment, decision, action} and include autonomous response to disruptive events. The DCI system [Schreckenghost et al., 2005] provides two data monitoring and event detection capabilities: the Event Detection Assistant (EDA) triggers simple conditional events whereas the Complex Event Recognition Architecture (CERA) detects situations consisting of sets of events organised temporally and hierarchically. EDA monitors telemetry data and CERA compares incoming data to pre-defined event conditions such as logical relations on incoming data. Events together with urgency DX'06 - Peñaranda de Duero, Burgos (Spain) 163

174 information are then presented to the user. The same kind of event generation is used in [Barbier et al., 2006a] but it is implemented within a UAV onboard architecture including monitoring and replanning tasks in order to avoid systematic return to base and proceed with the mission autonomously given the new constraints. This paper goes further in so far as the main focus is the processing of multiple concurrent events in the autonomous reconfiguration phase. Let us consider a UAV whose mission may be disrupted by internal or external events (e.g. failures, weather situation, interfering aircraft, threats...) The environment is such that the UAV cannot communicate with the ground segment to deal with disruptive events. The nominal mission of the UAV is defined through a set of operational tasks (i.e. tasks involving payloads) and non-operational tasks (takeoff, waypoint rejoining...) Operational tasks are described as sets of legs that include waypoints to be rejoined. The nominal flight plan is described as a list of tasks, legs and waypoints requiring or not the use of one or several payload modes. The nominal mission is represented by a set of ProCoSA Petri nets (see Appendix) allowing the current activities to be monitored on-board. A classification of the possible disruptive events is given in the second section, so as a way to deal with several concurrent events. The strategies enabling concurrent disruptive events to be dealt with are described in the third section, and the software environment used for the first simulation tests is presented in the fourth section. Appendix A gives the main features of ProCoSA, which is used for implementing the supervision architecture, and Appendix B is a short reminder about Petri nets. 2. Disruptive events A disruptive event is a logical condition on the values of parameters coming from the telemetry frame. Example: (Frame_OK AND sensor_block OK AND RPM_parameter OK AND RPM_measure < N) OR (Frame_OK AND pilot_block OK AND EngineOff = true) is the event corresponding to an engine failure, with RPM the rotation speed of the engine. Many kinds of disruptive events may occur during a mission, and multiple events have to be considered. In order to avoid the combinatorial aspect of multiple events, individual events are classified according to their impacts on the mission: absorbing events (e.g. engine failure), safetyrelated events (e.g. interfering aircraft), mission-related events (e.g. payload failure), communication-related events (e.g. telemetry failure). This allows event combining rules to be designed, e.g. a catastrophic event masks any other kind of event, the constraints of two (or more) safety-related events are considered together, etc Event classification The following classification is proposed: absorbing events (E A ) lead to mission abortion. They cannot be recovered and the reaction amounts to make the UAV land as smoothly as possible. When such an event occurs, the processing of any other kind of events is aborted and no further incoming event can be processed. Example: engine total failure; safety-related events (E S ) lead to modifying the flight profile or the flight plan - e.g. route change for a while - which may induce delays or new constraints on the use of the payload. Examples: interfering aircraft, new forbidden area, turbulence; mission-related events (E M ) only have consequences on the mission itself. Replanning amounts to adapt the mission to the new constraints, e.g. remove waypoints. Examples: camera failure, violated temporal constraint, new mission goal; communication-related events (E C ) are related to communication breakdowns between the UAV and the ground. Such events result in the UAV being fully autonomous therefore it has to proceed with the mission as planned. Example: telemetry failure. Remark: the UAV can detect events only if the relevant information is available either from its own sensors or from communication. In case a sensor or communication breaks down, some information is no more available and consequently some disruptive events may be missed. This would also be the case for an inhabited aircraft Event combining rules The assumption is made that events occur asynchronously, therefore the occurrence of simultaneous events is not considered. The following rules are set to deal with two successive events, i.e. such that the second event occurs while the first one is dealt with: an absorbing event E A has priority over any other type of event; a safety-related event E S has priority over a missionrelated event E M ; two successive safety-related events E S1 and E S2 are dealt with within a unique process: the flight plan is updated taking account of the constraints resulting both from E S1 and E S2. Should some constraints be incompatible, the most time-critical or safety-critical are dealt with first; two successive mission-related events E M1 and E M2 are dealt with within a unique process: the mission is updated taking account of the remaining available resources, resulting in a degraded mission plan; 164 DX'06 - Peñaranda de Duero, Burgos (Spain)

175 communication-related events E C do not interfere with the other types of events: indeed they are not dealt with explicitly and the UAV goes on with the processing of the other events it can be aware of. Table 1 sums up the combining rules: considering the type of a first event and the type of a second event occurring while the first one is being dealt with, the result indicates the type of the event that will actually be dealt with. When two E S or E M events occur successively, notations E 2 S or E 2 M mean that the on-going reconfiguration procedure will be rerun on a new set of constraints built from the constraints of both events. second event first event E A E S E M E C E A E A E A E A E A E S E A E 2 S E S E S E M E A E S E 2 M E M E C E A E S E M E C Table 1: event combining rules As a matter of fact the rules can be applied recursively to n successive events as E 2 S are E S type events and E 2 M are E M type events. This will be one of the main points featured by the reconfiguration strategy. 3. Dealing with disruptive events The strategy that is designed to deal with disruptive events is a two-step process: (1) a pre-processing of incoming events (applying the combining rules defined previously should successive events occur) and (2) reconfiguration procedures Event pre-processing A Valid Event Generator (VEG) is designed in order to filter the events coming from the onboard telemetry frame according to the current state of the mission especially the events that are currently being dealt with and to the combining rules (see section 2.2). Consequently only valid events (noted VE) are actually dealt with. In order to deal with multiple successive events and considering the fact that combining rules can be applied recursively, the rank of a valid event is defined as follows: a valid rank1-event (VE1) is issued by the VEG when the state of the mission is nominal (there is no on-going reconfiguration process). A rank1-event will trigger a rank1-decision. a valid (VE2) rank2-event is issued by the VEG while another event is being dealt with. A rank2-event will trigger a rank2-decision. Table 2 shows the event that is issued by the VEG according to the event being processed and the incoming event: event being processed incoming event VE1 A VE1 S VE1 M VE1 C E A VE1 A VE2 A VE2 A VE2 A E S VE1 A VE2 2 S VE2 S VE2 S E M VE1 A VE1 S VE2 2 M VE2 M E C VE1 A VE1 S VE1 M VE2 2 C Table 2: events issued by the VEG Let us explain the first two columns (the explanations are similar for the last two): when an E A event is being processed (VE1 A ), no other event can be dealt with; when an E S event is being processed (VE1 S ): if the incoming event is an E A event, it is dealt with immediately as a rank-2 event (VE2 A ); if the incoming event is an E S event, both events are dealt with together as a rank-2 event (VE2 2 S); the other incoming events are not dealt with. Remark: the events that are not dealt with because they are filtered by the VEG are not forgotten : depending on the state of the mission, they can be dealt with afterwards. Example: E M event payload failure can be dealt with if the processing of E S event interfering aircraft has led to a new plan that allows payload legs to be performed. The following ProCoSA Petri net (figure 1) synthesises how multiple events are taken into account: Figure 1: decision_levels Petri net DX'06 - Peñaranda de Duero, Burgos (Spain) 165

176 The state of the mission is one of the following: nominal state; on-going reconfiguration in case of a single disruptive event (on-going rank1-decision); if the process is interrupted by a VE2 event, the VE1 event is memorised for possible later processing; on-going reconfiguration in case of multiple disruptive events (on-going rank2-decision); the loop on the place associated with this state represents the possible recursion on event combining rules; if the VE1 and VE2 events belong to the same category, a new reconfiguration problem is solved, taking into account both sets of event parameters; if the VE2 event belongs to a higher priority category, the pending VE1 event may be processed later if the state reached by the aircraft after the processing of VE2 allows it. Remark: considering two decision ranks (and therefore two event ranks) is relevant because: dealing with a disruptive event when the mission is nominal and when the situation is already degraded may lead to different decisions; for a particular mission, the possibility is offered to cut the loop and only allow two successive events to be dealt with (a third one would trigger a Return To Base) Reconfiguration procedures Whatever the decision rank, two reconfiguration strategies are designed so as to cope with (1) disruptive events that call for an immediate reaction to secure the UAV and (2) disruptive events for which such an immediate reaction is not needed: Reaction-Then-Replanning (RTR) is triggered for VE A and VE S events: such events affect the flight safety and require an immediate reaction, before flight plan replanning. Examples: for VE S event interfering aircraft, the reaction consists in immediately modifying the UAV route according to the rules of the air; the flight plan is replanned afterwards. For VE A events (e.g. total engine failure), the reaction consists in trying to reach the closest landing field; most often the mission is aborted and there is no replanning. Reaction-With-Replanning (RWR) is triggered for VE M events: a smooth reaction may be performed (usually a holding pattern) during which a corrected plan is computed. Example: for VE M event payload failure, the UAV may fly a holding pattern while the legs involving the failed payload are removed from the plan and a new mission plan is computed. The reaction and replanning computations are based on generic procedures that are implemented as dedicated processes within ProCoSA. The on-line computation of a relevant reaction and a corrected plan amounts to selecting one or several procedures and instantiating them with the current mission and UAV parameters. - The ProCoSa Petri net shown in figure 2 implements the way the current plan is modified by the reactions and replanning that are triggered by the reconfiguration strategies. Figure 2: Planner Petri net a VE A event leads to an emergency landing procedure (the closest landing field is chosen among the possible emergency landing fields) and the mission is aborted: it is an RTR with no replanning phase; a VE S event triggers an RTR; a VE M event triggers an RWR; the execution of the new plan has to be appended smoothly to the end of the holding pattern (place synchro_holding_pattern); a VE C event is just memorised and the UAV goes on with its current plan. 4. Simulation environment and first results 4.1. Simulation environment Besides the architecture we have developed for flight testing [Barbier et al., 2006] that only deals with single disruptive events, a simplified software simulation environment has been built in order to validate the processing of multiple successive events. This environment includes two main components, which are developed as ProCoSA sub-system functions (see Appendix A): the Valid Event Generator (VEG) process; the simulation process. The behaviour of the VEG is implemented as described in Table 2. A simplified user interface allows any kind of disruptive event to be entered anytime during the course of 166 DX'06 - Peñaranda de Duero, Burgos (Spain)

177 the simulated mission. The VEG sends triggering events to the ProCoSa procedure Petri nets dedicated to disruptive event management, like the decision_levels Petri net (figure 1). The simulation process includes three main functions: mission data acquisition; nominal mission execution; replanning actions. VEG filters the events correctly and that the computed reaction and replanning are relevant. Example 1: two successive interfering aircraft (figure 3). The simulated UAV initiates a first 90º heading modification to the right to avoid the first aircraft and then increases it to avoid the second one. Then the UAV enters the replanning phase. Two data files are used for the nominal mission: the first one is dedicated to the flight plan, which is described as a list of waypoints; each waypoint is a triplet (waypoint type, required fly-over time, payloads to be used); three payloads are considered; the second one contains the locations of emergency landing fields that are available for the mission. As soon as the mission data are read, the mission is executed and monitored. The flight plan is executed sequentially, until it is interrupted by a disruptive event. 1 st interfering aircraft WP 2 WP 3 WP 4 2 nd interfering aircraft rank-2 reaction end-replanning rank-1 reaction WP 5 Several simplified reactions and replanning actions have been implemented for individual events, among which: E A type events: the closest emergency landing field to the current waypoint is selected; interfering aircraft (E S type event): the reaction procedure is an immediate trajectory change for collision avoidance; the replanning strategy aims at modifying (or not) the initial flight plan so that the waypoint fly-over times are satisfied; partial loss of engine power (E S type event): the replanning function aims at maximising the number of operational tasks to be performed, taking into account the waypoint fly-over times, especially for the FEBA 1 exit point; no immediate safety reaction is needed in this case; payload failure (E M type event): the replanning function elaborates a new plan with the remaining payloads First results Formal validation The ProCoSA verification tool (see Appendix A) allows formal properties of the Petri nets to be checked. This tool has been used to check the consistency of the nominal and replanning procedure Petri nets. Simulations Simulation runs have enabled the correct behaviour of the replanning functions to be checked, especially in case of multiple successive events. Indeed no pre-planned procedure exists for multiple events and what is checked is that the 1 Forward Edge of Battle Area WP 1 Figure 3: two interfering aircraft Indeed both events are E S events and the associated constraints are aggregated. Therefore the second reaction is a global reaction to both events. Example 2 : payload failure then partial engine failure (figure 4). When the first event occurs, the simulated UAV replans its mission taking account of the failed payload (i.e. the waypoints involving this payload are cancelled). When the engine fails (second event), the UAV speed is reduced and the UAV goes on with the mission cancelling some more waypoints so as to meet the FEBA time constraint. WP 2 WP 3 WP 4 E S rank-2 decision E M rank-1 decision WP 5 end-replanning WP 6 WP 8 WP 1 Figure 4: payload + partial engine failure WP 7 DX'06 - Peñaranda de Duero, Burgos (Spain) 167

178 The payload failure is an E M event therefore a new plan is elaborated. The partial engine failure is an E S event therefore it has priority over the E M event: the VEG issues this event and the corresponding reaction is triggered. 5. Conclusion This work has focused on a generic approach to deal with multiple successive disruptive events in UAV missions: a classification of events has been given, together with combining rules allowing a generic decision framework to be designed. A major advantage is that only the decision and replanning algorithms for single events are implemented: any event chain can be dealt with through the combining rules, thus avoiding the combinatorial aspects of multiple events. The decision architecture is implemented with ProCoSA, which allows (1) nominal mission supervision and abnormal situation management to be dealt with within the same framework and (2) a generic supervision architecture to be designed: indeed the architecture is not dedicated to a specific UAV for a specific mission, the classification of disruptive events allows generic reactions and replanning processes to be implemented and they are coded within independent subsystem software functions. Moreover, the ProCoSA architecture can be implemented straight onboard. Simulation tests have highlighted the robustness of the decisions, which is due to Petri net modelling and the associated analysis techniques. On going work focuses on the following: tests with real telemetry data and real time operating systems are conducted to get more realistic simulation conditions; more elaborated replanning strategies [Chanthery et al., 2005] are considered to cope with real time requirements and to deal with complex constraints (e.g. new threats, fuel consumption); situation assessment procedures including a prediction function are considered to help anticipating the occurrence of disruptive events, especially in the case of multiple events. nominal mission monitoring and control (vehicle and payload control actions); decision (management of disruptive events, replanning). These functions are often developed as separate subsystems and they have to co-operate in order to fulfil the autonomous system behaviour requirements for the specified missions. More precisely, the needs are the following: off-line tasks: specification of the co-operation procedures between subsystem software, subsystem coding for embedded operation; on-line tasks: procedure monitoring, event monitoring, and management of the dialog with the operator. ProCoSA includes the following components: EdiPet, a graphical interface for Petri nets which is used both by the developer for procedure design and by the operator for execution monitoring (figure 5); JdP, the Petri net player, which executes the procedures, fires the event-triggered transitions of the Petri nets and synchronises the activation of the associated sub-system functions; a socket-based communication protocol allows data to be exchanged with external sub-system software; Tiny, a Lisp interpreter that is dedicated to distributed embedded applications. The ProCoSA procedures are modelled with interpreted Petri nets (see Appendix B): triggering events are associated with transitions: a validated transition is fired if and only if the associated triggering event occurs; triggered actions are also associated with transitions; they consist in messages sent to JdP when the transition is fired, and the possible actions are: Petri net activation requests, sub-system software function activation requests, event generation requests. Appendix A : ProCoSA ProCoSA [Barbier et al., 2006b] is a software environment meant for controlling and monitoring highly autonomous systems. System autonomy is usually obtained by putting together various functions, among which: data analysis (sensor data, monitoring data, operator s inputs); Figure 5: EdiPet graphical interface 168 DX'06 - Peñaranda de Duero, Burgos (Spain)

179 Timers can be programmed with ProCoSA: a special activation request enables a timer variable to be instantiated, which allows actions with a limited duration to be modelled. The ProCoSA procedures are used to model the desired behaviours of the autonomous system; the hierarchical modelling features offered by ProCoSA enable to structure the whole application in a generic way: at the highest description level, generic behaviours can be described, regardless of the characteristics of a given vehicle; at the lowest level, they specify the sequences of elementary actions to be performed by the vehicle or the payloads; this modular approach enables a quick adaptation to system changes (e.g. taking into account a new payload). An important feature of ProCoSA lies in the fact that there is no code translation step between the Petri net procedures and their execution: they are directly interpreted by the Petri net player, thus avoiding any supplementary error causes. ProCoSA finally includes a verification tool, which makes use of the Petri net analysis techniques to check that some good properties are satisfied by the procedures, both at the single procedure level and at the whole project level (that is to say taking into account inter-net connections); the following properties are checked: place safety (not more than one token per Petri net place); detection of dead markings (deadlocks); detection of cyclic firing sequences (loops). Appendix B : a Petri net reminder A Petri net [Murata, 1989] <P, T, F, B > is a bipartite graph with two types of nodes: P is a finite set of places; T is a finite set of transitions. Arcs are directed and represent the forward incidence function F : P T /N and the backward incidence function B : P T /N respectively. The marking of a Petri net is defined as a function from P /N: tokens are associated with places. The evolution of tokens within the net follows transition firing rules. Petri nets allow sequencing, parallelism and synchronisation to be easily represented. An interpreted Petri net is such that conditions and events are associated with transitions. package for autonomous system supervision. In CAR 06, 1 st Workshop on Control Architectures of Robots,Montpellier, France, April [Chanthery et al., 2005] É. Chanthery, M. Barbier and J.-L. Farges. Planning Algorithms for Autonomous Aerial Vehicle. In 6th IFAC World Congress, Prague, Czech Republic, July [Clough, 2002] B.T.Clough. Metrics, Schmetrics! How the heck do you determine a UAV s autonomy anyway? In Performance Metrics for Intelligent Systems Workshop. Gaithersburg, MA, USA, 200 [Hamilton et al., 2001] K. Hamilton, D. Lane, N. Taylor and K. Brown. Fault diagnosis on autonomous robotic vehicles with RECOVERY: an integrated heterogeneousknowledge approach. In ICRA 2001, IEEE International Conference on Robotics and Automation, Seoul, Korea [Murata, 1989] Tadao Murata. Petri nets: properties, analysis and applications. In IEEE. 77(4): , April [Schreckenghost et al., 2005] D. Schreckenghost, C. Thronesbery and M.B. Hudson. Situation awareness of onboard system autonomy. In i-sairas 2005, 8th International Symposium on Artificial Intelligence, Robotics and Automation in Space, Munich, Germany, Sept References [Barbier et al., 2006a] M. Barbier, J.-F. Gabard, J.-H. Llareus, C. Tessier, J.Caron, H. Fortrye, L. Gadeau, G. Peiller. Implementation and flight testing of an onboard architecture for mission supervision. In UAVs 2006, 21st International conference on Unmanned Air Vehicle Systems, Bristol, UK, April [Barbier et al.,2006b] M. Barbier, J.-F. Gabard, D. Vizcaino, O. Bonnet-Torrès. ProCoSA: a software DX'06 - Peñaranda de Duero, Burgos (Spain) 169

180 170 DX'06 - Peñaranda de Duero, Burgos (Spain)

181 Debugging Failures in Web Services Coordination Wolfgang Mayer and Markus Stumptner Advanced Computing Research Centre University of South Australia Abstract The rise of Web Services over the past years offers a new development paradigm for distributed applications: high level communication using exchange of structured XML data, using communication protocols orchestrated by workflow languages with complex control constructs. We study the use of modelbased techniques that have been used for fault analysis in imperative (Java) and concurrent (VHDL) languages in a Web Service environment, with the goal of diagnosing Web service interactions specified in BPEL4WS, using an Abstract Interpretation approach. 1 Introduction Web services are currently gaining ground as a new paradigm for distributed applications [Alonso et al., 2004], using Web protocols and XML-based data formats to replace the traditional middleware layer for communication between selfcontained externally invokable applications, called services. The XML encoding and the standardised interface definitions such as provided by the Web Service Definition Language (WSDL) [Christensen et al., 2001] facilitates interoperability, making Web services a well suited basis for EAI and cross-company application integration. This has led to the development of service-oriented architectures, where codefined as the ability to compose and describe the relationships between lower-level services. Although differing terminology is used in the industry, such as orchestration, collaboration, coordination, conversations, etc., the terms all share a common characteristic of describing linkages and usage patterns between Web services. Web Services Choreography Working Group Charter, applications (business processes) are assumed to be composed from interacting (Web) services. The development of such service constellations requires specifying and programming the actual interaction patterns (referred to as choreography) 1 between the different services, 1 Defined as the ability to compose and describe the relationships between lower-level services. Although differing terminology is used in the industry, such as orchestration, collaboration, coordination, conversations, etc., the terms all share a common characteristic of describing linkages and usage patterns between Web and dedicated languages have been developed for this purpose, with BPEL4WS [Curbera et al., 2003] and OWL- S [owl, 2004] currently the foremost representatives. The services coordinated in this fashion could be themselves written in these languages, or implemented as traditional applications with a WSDL interface. The fact that these languages provide high level process description constructs (easily mappable to standard process design notations) while their XMLbased code and data structures make them amenable to metadata descriptions such as the various Semantic Web service proposals, invites speculation about automated composition, and repair, e.g., in the work on self-healing services depicted in [Ardissono et al., 2005], which assumes the ability to diagnose individual services which may be written in arbitrary languages) and of diagnosing their choreography. Taking a slightly different approach, we look at a classical debugging scenario for the choreography itself, and consider the task of diagnosing BPEL choreographies. 2 A Crash Course in BPEL4WS The definition of BPEL4WS [Curbera et al., 2003] (shortened to BPEL from here) is based on the assumption that to realise the full potential of Web Services as an integration platform, applications and business processes will need a standard process integration model to interact in a principled fashion, and that model will need to support business process executions: potentially long sequences of message exchanges within stateful interactions run by two or more parties. Thus, need was perceived for a language to specify the patterns of message exchanges and the description of process states (since Web Services, as defined by WSDL, are essentially stateless). BPEL has been defined to be used in two ways. Either it is used as a specification language for business protocols, which describes patterns of possible message exchanges while abstracting away from specific process details (in particular, aspects that individual companies may want to keep out of the process definition, referred to as opaque. In this case, BPEL definitions are nondeterministic (for example, the specific choice between multiple offered selections could not be identified ahead of time, what is specified is that a choice will be made). They could also be understood as constraints on an actual execution. In this fashion, they could be used services. Web Services Choreography Working Group Charter, DX'06 - Peñaranda de Duero, Burgos (Spain) 171

182 as a machine readable specification that may be use as additional information in debugging the actual code that implements it (similar to [Stumptner, 2001]). (However, then the question arises why designers would not use common other techniques, such as the various UML diagrams, for specification purposes.) The second way, which we will focus on here, is to use BPEL as an executable language that describes the actual computations and message contents to be passed on at the coordination level. The business processes that BPEL defines are supposed to coordinate Web Services that communicate according to the specifications of WSDL [Christensen et al., 2001]. WSDL services define named porttypes (corresponding to interfaces in normal programming languages) that specify individual operations. An operation is a predefined exchange of messages, which can be one-way (receiving), request-response, solicit-response (two-way, depending on which side starts the exchange) and notification (just sending). A service and message name together with a specification of the actual protocol used for sending the message (e.g., SOAP) is called an endpoint. BPEL builds on this structure by establishing partner links that send messages between service endpoints. A process also possesses variables that can be used to store state data and process history resulting from message exchanges. (WSDL parameters and therefore BPEL variable values are XML structures; the expression language specified to access these structures or parts of them is XPath.) Example 1 A standard example frequently mentioned in literature is the Loan Approval process (Figure 1). The process involves four processes, each represented as independent Web Service: the client (Cl), the Risk Assessor (RA), the Loan Approver (LA), and the Financial Institution (F I). Initially, the client Cl requests a loan of amount monetary units from F I. If the amount is above 10000, the application must be studied in detail and is forwarded to LA for this purpose. Otherwise, if the amount is low, a risk assessment is obtained through the RA service. If the risk is low, approval is automatic. High-risk cases are processed the same way as applications involving large amounts. Once the decision about a request has been made, the Cl is notified. Provided the application was approved, the client can then withdraw money from the account. In case the amount withdrawn is less than the approved amount, the client is notified, the credit balance is updated and the process ends. Otherwise, an exception is thrown, which subsequently triggers a reply to the Cl. In addition to that, the risk assessment needs to be discarded as the client is no longer trustworthy. In the rest of this paper, we are interested in the process description dealing with F I and consider the other processes as opaque processes. Occurrences of individual actions in a BPEL process are referred to as activities The basic message passing activities are invoke, receive, and reply, of which the first refers to initiating a one-way or request-response message exchange. Variables are assigned values using the assign activity. Basic activities can be grouped by various control constructs (with the usual semantics): while, sequence, and switch. Client:risk=? RA Client:risk=low client,100 low Client:balance=8000 amount<10000 B invoke risk=low assign D throw receive reply receive invoke wda>amount wda<=amount Fault Handler: G E F Client:balance=8000 risk=high A reply update balance amount>=10000 accept=yes H K C request amount: 100 message accept amount assign reply Figure 1: Loan Application Process I J FI approved LA withdraw: 5000 The two most important constructs that express concurrency and nondeterminism are flow and pick. The first, unlike a sequence, allows to express explicit synchronisation arcs between the activities included in the flow. (These activities can be basic or themselves be nested.) The pick activity waits for a set of incoming messages; once one has been received it is chosen for processing and the others are ignored. BPEL also permits the specification of temporal events, e.g., for defining wait periods and timeouts. 2.1 Fault handling Since BPEL processes represent long-running business activities, classic atomic transaction models were considered inapplicable. Instead, BPEL relies on explicit reporting of faults, either through a set of system exceptions or by using the throw activity, and their handling by dedicated fault handling activities. The set of activities that are terminated by throwing a fault is defined by explicitly grouping activities in user-defined scopes. A scope is simply a group of activities that are considered grouped for fault handling purposes. Scopes can be nested, and fault handlers can throw faults in enclosing scopes. (Conversely, a fault in a given scope A results in the termination of all activities in A and all scopes nested in A.) A key property for the fault handlers is that they may contain so-called compensation handlers. Transaction effects are not assumed to be automatically rolled back; instead it is the duty of the developers to program compensating actions that restore a correct overall system state, a classic concept from long-running transaction research. An implication of this choice is that compensation handlers are completely application dependent. 2.2 Executable vs Non-executable BPEL BPEL extensions for executable processes include the ability to explicitly terminate the behaviour of a business process instance. It requires the use of input/output variables in denied Client 172 DX'06 - Peñaranda de Duero, Burgos (Spain)

183 message-related activities (they can be abstracted out in nonexecutable processes). Also added are fault definitions in case an XPath expression on the right side of an assignment selects no or more than one node; in case a variable or correlation is used before it is initialised; in case multiple receive actions are enabled for the same partner, link, porttype, operation and correlation sets; and finally for multiple outstanding synchronous requests. Example 2 (cont d) Modelling the process in Figure 1 in BPEL, the decision of whether to automatically assess an application or to undertake a more detailed analysis can be modelled as a flow, where the choice which transition to follow is done by the transition condition amount The communication with RA and LA is done using synchronised invoke activities, blocking until the response has arrived. The decision whether to accept or reject the application is stored in a variable accept. The value of the variable is obtained from the response from LA, or, if the automatic assessment predicted a low-risk case, through an explicit assignment. The reply to the client is modelled using reply activities to make sure the client can associate the asynchronous reply with the initial request. In case the withdrawn amount is invalid, an error message is constructed in variable message and is subsequently forwarded to the client. Note that the BPEL description contains an error: if the withdrawal fails, the risk assessment is not undone. This is because the fault handler does not include the explicit compensate action to trigger the rollback. 2 3 Modelling Web Services for Diagnosis In this work we are primarily concerned with locating faults in business process models and Web service descriptions. In contrast to previous work [Ardissono et al., 2005], our goal is to locate errors in the description of a service coordination rather than to identify a faulty service in a groups of interacting processes. While [Ardissono et al., 2005] employ local reasoners to monitor the Web service execution and eliminate impossible explanations, we assume a more centralised approach where a single coordination template is the focus of interest. Information obtained from a failing service execution together with descriptions of the interaction protocols of the peer services involved in the execution allows us to derive possible explanations for the misbehaviour in the specification of the local service. In principle, the same techniques employed to debug computer programs [Stumptner and Wotawa, 2000; Mayer and Stumptner, 2002] may seem suitable for analysing BPEL descriptions. On deeper analysis, however, it becomes evident that there is a fundamental difference between computer programs and BPEL programs : while the effects of arbitrary programs can be derived from the program s structure and the semantics of the language constructs, actions in BPEL specification are usually described on a very high level which is not directly amenable for analysis or execution. Conse- 2 While BPEL provides default fault and compensation handlers that would trigger compensation actions in case the exception escapes unprocessed, the explicitly specified fault handler disables this functionality. quently, models based solely on the propagation of values between statements are not effective and result in many possible explanations. To overcome this limitation, we propose to incorporate information about the messages passed between communicating processes to eliminate explanations that imply lost messages or blocked processes. In addition, concrete values obtained from successful and failing executions of the service are exploited to derive contradictions and focus the search to relevant paths through the process. 3.1 Modelling Constraints Typically, models created for diagnosis purposes are created statically, considering only the structure of the system and the flow between components. This approach has proved to be successful and often allows for optimising the performance of the diagnosis engine by pre-compiling parts of the model description [Darwiche, 1999; Fröhlich and Nejdl, 1997]. This approach has also been applied to computer programs [Stumptner and Wotawa, 2000; Mayer and Stumptner, 2002], but is limited to deterministic execution paths. In particular, presence of loops requires a meta-layer which dynamically modifies the model to accommodate additional iterations. More importantly, the models assume that the data flow between components can be determined statically, which makes them unsuitable for expressing concurrent executions. Much effort has been invested in modelling different aspects of concurrent systems as automata or transition systems, applying various forms of state space reductions to keep the models small [Corbett, 1998]. A prerequisite of many reduction techniques is that all possible transitions between locations in the system are known. Unfortunately, the presence of externally triggered termination of processes, as is present in BPEL, thus renders many standard techniques unusable for our purposes. Prior to describing a model that can be used for debugging BPEL descriptions, we first take a look at the constraints the modelling process must adhere to: Business processes and Web services are not purely computational services that can be invoked arbitrarily, but also may have some effects on the real world. For example, a commercial service would charge a fee. Thus, the diagnostic process should not rely on the repeated execution of a service, but exploit the information obtained through a single (or a small set of) executions. The assumption that a precise description of the (correct) operation of the service would be available is somewhat unrealistic. While description languages such as BPEL and OWL-S gain popularity, generally only a partial description of message sequences and preconditions is provided; a precise specification at a level that could be used for diagnosis is usually not available (yet). Thus, the diagnosis engine should not assume the presence of specifications beyond what is provided through the execution to be debugged and a specification of the expected results. In contrast to many programming languages and diagnostic models, BPEL processes operate on complex messages containing structured but dynamically generated data, such as XML documents. Consequently, DX'06 - Peñaranda de Duero, Burgos (Spain) 173

184 building precise models that can not only check but also predict values becomes more difficult in the general case. Concurrent execution and termination of processes are essential features of BPEL and must be supported to a certain degree. 3.2 Dynamic Model To overcome the limitations discussed in the previous section, we propose a dynamic modelling approach where the model is not derived statically, but built by taking the information provided by observations and test cases into account. First, the process description to be diagnosed is modified to reflect the current fault assumptions of the diagnosis engine. Then, an abstract execution engine tries to find a path that is (a) feasible given the observations and (b) the modified process specification leads to a state where all observations are satisfied. The behaviour implied by the process specification needs to be modelled only to the extent where it is consistent with the constraints given by the test case. A conflict is derived when a consistent model cannot be found. Using (incremental) dependency tracking, a conflict set can be derived while the infeasible trace is constructed. Definition 1 (Debugging Problem) A BPEL debugging problem DP is a tuple BP, T C, COM P where BP denotes the BPEL specification to be debugged, T C denotes a set of test case specifications, and COMP represents the set of diagnosis components in terms of elements of BP. BP can be seen as a template describing all possible executions, where the elements that appear in COMP may be substituted with holes representing arbitrary behaviour, corresponding to fault assumptions made by the diagnosis engine. Typically, elements of COM P would be activities, transition conditions, or even entire scopes. For our purpose, arbitrary behaviour is defined as assigning unspecified values to visible variables 3, continue normally, send or receive messages, or throw an exception. T C describes the state of the environment and the inputs provided to BP, together with the expected result and any other constraints that every valid execution must satisfy. In the following we limit our attention to test specifications which fail at runtime. The integration of passing tests in this framework can be done by altering the likelihood of certain faults, given the observed correct runs [Jones et al., 2002]. 3.3 Activation Model To simplify modelling, we abstract from the concrete syntax of BPEL and represent the process specification as nested graphs. The core of BPEL is built around the notion of execution scopes. A scope contains a process specification, together with a fault handler and a compensation handler. In the simplest form, a scope contains a single activity. Dependencies between scopes are modelled through links, which define a partial execution order. The execution of a scope can be interrupted either internally, through an uncaught exception in one of the nested scopes, or externally, when a nested scope is terminated due to an exception in an enclosing scope. When 3 Each variable is visible in the scope where it is defined and in all nested scopes. a scope S is terminated, all scopes enclosed within S are terminated immediately, possibly executing fault and compensation handlers. Fault handlers are also represented as activities in a scope, which become active as soon as an activity in the scope raises an exception. Compensation handlers are embedded within the enclosing scope s fault handler. Definition 2 (Scope) A scope S is a tuple N, O, V, F, C where N denotes a set of actions and nested scopes directly contained in S, O N N denotes the partial execution order between elements in N, V denotes a set of variables visible in S and all contained elements, F the fault handler, C the compensation handler. For all nested scopes N, S is the parent of N. For simplicity, we assume that there is only one fault handler in each scope. Note that our notion of scope does not totally agree with the standard BPEL definition. To keep our model elements simple, we model join conditions of activities and transition conditions as separate actions j without side-effects, apart from determining the activation of the component for which j is specified. In the following, we assume that this transformation has been applied and omit conditions from actions and links. O specifies a partial execution order in the sense that s may be considered for execution only if, for all links s i s F, the status of all s i has been determined (see below). For each scope S, we generate a number of artificial variables in the enclosing scope that represent the current status of S: (+ denotes yes, denotes no ). S active =+ if the scope is ready for execution, otherwise. S done =+ if S has finished executing, otherwise. S abort =+ if the scope containing S forces S to abort (due to a failure in a sibling or parent of S), otherwise. S exc =+ if S has thrown an exception, otherwise. S active and S abort must be defined before S is executed, while S done and S exc are defined only after S has completed execution. The reason for having two separate variables S active and S abort is that the execution engine must be able to distinguish the case where S was interrupted while executing from the case where S was never executed because of earlier termination of the enclosing scope. Simply setting S active = would not work because that would lead to a spurious contradiction where S had already been active. A snapshot E of the current state of execution of BP at any time can be obtained by capturing the values of all variables in all active scopes. Whenever a scope S is scheduled for execution, the variables defined in S are instantiated in E (with initial value, uninitialised ) and removed again after S has completed. Activation variables S are handled specially in that S active =S active where S is the parent scope of S. The activation variables associated with each scope identify the current progress of execution. This corresponds roughly to a variable environment found in traditional programming languages. The difference is that here the program counter found in sequential execution is encoded in the S variables. As mentioned previously, a conflict has been derived if no feasible execution satisfying all constraints in BP T C can be found: 174 DX'06 - Peñaranda de Duero, Burgos (Spain)

185 Definition 3 (Conflict) A set c 1,..., c k COMP is a conflict set for DP if AB c1 AB ck (BP ) =, where denotes impossible behaviour and the AB ci denote mutation functions which modify BP to reflect the abnormality of the source expressions in BP corresponding to c i. denotes function composition. In contrast to conventional execution, the presence of fault assumptions precludes us from assuming that every execution is deterministic. Instead, the execution engine must be able to follow multiple paths even if every concrete execution of BP was deterministic. For example, it may be necessary to follow multiple paths in a flow construct even in case the transition expressions are complementary, simply because with some variable values unknown, the expressions do not evaluate to a unique value. Another example is if we assume the receive activity in our running example is assumed abnormal. Then, the value of amount is undefined and the abstract execution engine must analyse both paths. The basic debugging algorithm is outlined as follows: the model is simulated using the test specifications in T C and conflicts are computed from failing test runs. From conflicts, possible explanations are derived, each corresponding to an altered version of the BPEL process, where the precise behaviour of the process is replaced with a loosely constrained one. Each of the candidates is subsequently simulated again to eliminate impossible explanations. 3.4 Data Model To be able to carry some information even in the case where precise values cannot be derived, the Abstract Interpretation [Cousot and Cousot, 1977] framework provides us with the means to predict values. Abstract Interpretation works by substituting the precise effects of each language element with an approximation thereof, which operates over a (finite) abstract domain AD. Thus, an approximation of the true behaviour can be computed in a finite amount of time. For our purposes, a simple abstraction which either predicts a concrete value or does not predict any value for a BPEL variable, is sufficient. For the status variables S, a power set (or interval abstraction) is used. This does not impede efficiency, as the domain of these variables is small. The behaviour of a simple scope S containing only a normal action A reflects the effects as defined in BP and the BPEL language specification, operating on the abstract domain AD. For example, the expression amount < evaluates to true or false in a snapshot E if E(amount) is a concrete value. Otherwise, the result is also undefined. In contrast to the simulation of programming languages like Java, the descriptions of the BPEL activities are not detailed enough to actually predict new values given the input values. Instead, we must rely on the values derived through the actual execution T E of the test case. However, those values are only guaranteed to be valid in executions where the involved variables V i and peer processes P i exhibit the same state as in T E. Otherwise, no values can be predicted. If this guarantee is available, the values of V i can be directly compared with the value in E. For every peer P i, it is necessary to track the messages sent and received to and from P i to ensure the results are the same as the ones obtained in T E. This can be done by introducing an additional variable MS i for each P i which acts as an index into the message sequence MSi T E that was obtained from T E. The value of MS i is updated whenever a send or receive action involving P i is executed. MS i is incremented if the sent or received message matches the next one in MSi T E, or undefined otherwise. While this value predictor is quite weak, it can still be effective for limiting the search space of feasible executions in case the fault assumptions are near the end of the execution or if the messages passed between BP and P i are independent from messages to another P k. 3.5 Complex Behaviour The forward behaviour of an individual action A is formalised as transfer functions, taking a process snapshot E as input and computing a set of new process snapshots E as result, which reflect possible effects of A in E. Typically, E contains a single snapshot, except for actions which may trigger exceptions. In this case, E contains two snapshots; one reflecting the normal path, while the other snapshot corresponds to BP following the exception path. In case A is assumed abnormal, a generic snapshot containing unspecified values for the set V A of BPEL variables accessible at A is returned, as well as an exception snapshot. For example, consider the assign X:=Y+1 activity 4 (abbreviated A ): 8 {E[X E(Y )+1, A done +]} if E + A active + / A abort >< + / A done AB(A) [[A]](E)= {E[V A ], E[S exc, S abort +]} if E AB(A) >: {E} otherwise where E[X E(Y )+1] denotes the substitution of a new value for X in E and V A denotes the set of all variables accessible at A. denotes the unknown value. The first clause specifies the normal behaviour if A is active and ready for execution, the second clause specifies the abnormal behaviour. The last clause catches all cases where either A is not ready for execution, or E is infeasible (i.e. that path can never be realised). Similar to [[A]](E), the abstract effects of each activity in BP can be derived from the BPEL semantics and the structure of BP. Each scope S in BP must respect flow constraints between its activation variables S : S can complete normally iff S was active and S was not aborted (internally or from the outside). S active= S abort =+ S exc S done = S active=+ S abort = S exc= S done =+ An activity may be activated if the status of all preceding activities in the partial execution order O has been determined and at least one of the activities has completed normally: 8 S active if and S is the parent of S S ><,S F + if S done S done=+ S active= S,S F S,S F if S done= S >:,S F S,S F otherwise 4 For brevity, we abstract from the concrete XML and XPath representation of the actual BPEL description. DX'06 - Peñaranda de Duero, Burgos (Spain) 175

186 Some BPEL constructs may enforce additional constraints. For example, the pick action enforces that only one of the successor transitions must be active at any given time. In practice, the values of some of the S variables may not be known precisely and over-approximation must be performed. (This is the reason for the rather unusual notation + / A abort above: A abort may have a value v {+,, }). To compute the effects of a scope S consisting of multiple activities, a worklist algorithm is applied to compute the possible result process snapshots. Starting with the initial process snapshot E, all nested scopes S S are chosen such that + E(S active ) and new snapshots E= S [[S ]](E) are computed for all S. All E E replace E in the worklist. For cyclic control flow, for example loops, a fixpoint algorithm is applied. This is guaranteed to terminate, as the abstract domain for each variable is finite. To account for external terminate events, all snapshots after each [[S ]] is applied must be combined together to simulate termination of the scope at any point in the execution. This provides a safe, but coarse over-approximation of the true behaviour. This is refined later in case it is discovered subsequently that the scope cannot be interrupted externally. To improve precision slightly, grouping of snapshots is done according to the values of S abort and S exc, keeping the paths that terminate early separate from those that complete normally. The purpose of the initial forward analysis is to determine if each scope may be terminated either internally or externally. The model is then refined in subsequent passes to take that information into account and eliminate spurious paths. If an internal terminate event T, such as a throw activity is encountered, all S abort =+ where S done + for all scopes S that do not strictly precede T in the partial execution order O. Once the scope S has been analysed, S done is set to + in the snapshot corresponding to the normal completion of S, and to in the exception case. The snapshots corresponding to normal completion and exception exits are then propagated to the parent of S for further processing. In case the propagated snapshot has S exc set, the fault handler becomes active. In case the exception is not re-thrown to an outer scope, S exc and S done are both set to. 3.6 Observations To incorporate observations about the state of processes in BP or T C, the values of observed values in snapshots corresponding to the observation location are intersected with the observed values, potentially leading to a conflict. Execution paths not satisfying these values are eliminated from the model by applying a similar procedure as described in section 3.5 in backward direction, intersecting the snapshots obtained thorough backward reasoning with the values derived from forward reasoning [Bourdoncle, 1993]. In case is derived, the entire path is eliminated from the model and the analysis resumes with a different branch, until no more branches can be eliminated. A conflict has been derived if there is no feasible execution from the start node in the model to the final state specified in T C. The forward and backward analysis are repeated until a fixed-point (or a specified limit of iterations) has been reached. After the backward phase, the subsequent forward analysis may transform the model to obtain a more precise abstraction. 5 In case S abort = for a scope S, it is known that no external termination event is received. Therefore, the coarse approximation of combining the execution snapshots after every activity is carried out is not necessary, leading to a better approximation. Similarly, if S exc =, the fault handlers of S containing S need not be considered. For cyclic control structures, if the snapshot before the cycle is inconsistent with the snapshot after the cycle, the cycle must execute at least once to obtain a consistent path. This unrolling can be done to adaptively refine a model and either improve the approximation of the true behaviour, or eliminate the entire path. Values propagated from the observations may contradict certain concurrent execution schedules for interfering components. The corresponding join operations of snapshots can then be eliminated. 3.7 Communication Requirements So far, only the values monitored during the execution of T C and the observations specified in T C have been considered for modelling and diagnosis. With the weak prediction scheme described in the previous section, this is not sufficient to obtain a sufficiently small number of diagnoses candidates. To eliminate spurious candidates, we make use of interactions recorded through the initial execution of T C and/or specified in T C. For our purposes, we assume that the messages passed between BP and a peer process P i are known and their sequence MSi T E is fixed and correct. While this assumption may seem strong for general processes, in the context of a single BP run it seems to be reasonable if it is assumed that the peer processes would reply with an error message in case their protocol is not followed properly. As mentioned before, we assume that a newly developed process description can rely on the other descriptions to be correct, which is not unreasonable if it is considered that only services that have proved to be reliable would be used to develop a new process. In addition to tracking values, we introduce separate variables seen i to keep track of the progress of sent and received messages for each process P i. As described previously, the progress can be represented as an index (or as interval in case the value is not known precisely) in the message sequence for pre-specified sequences in T C. For processes where message sequences are not specified, we assume that for every message that is sent there must be a matching receiving process, and vice versa. Thus, no messages should be lost nor any process should be caught in a waiting state forever. 6 Keeping track of the messages allows to exclude paths and concurrent execution schedules that would not satisfy the observed message sequences from our analysis and prune the search space early. This is possible in particular if the interaction protocols include different method types in subse- 5 With the exception of the concurrency-related improvements, the transformation are essentially those described in [Mayer and Stumptner, 2004]. 6 Extensions incorporating timeouts and timed alarms are left for future work. 176 DX'06 - Peñaranda de Duero, Burgos (Spain)

187 quent steps, as misaligned send or receive operations can be detected early rather than at the end of the analysis. 3.8 Diagnosing an Example Continuing the Loan Assessment example, assume we have a test case that specifies the following sequence of operation for a hypothetical client: 1. A loan application for $100 is requested. 2. The answer is received that the application was approved. 3. The client requests to withdraw $ The request is denied. Assume further that the test case also expects the risk assessment to be cancelled because of the failed withdrawal attempt. For this purpose, RA provides an optional WSDL port that invokes RA s compensation handler. Further, we know that the interfaces of LA and RA differ, i.e. the WSDL port types are different. We also observe that the loan assessor is not used. The set COMP contains all activities A,..., K in BP. In the following we do not follow the exact forward- and backward reasoning procedure, but short-cut the computation as suits the illustration. When the test case is run, we observe that the client receives the expected messages, but the credit assessment is not rolled back (RA::risk ). Therefore, we find that the execution does not satisfy the test specification and a contradiction has been found. Starting the diagnosis process, assume the invoke activity B communicating with RA is abnormal. Therefore, after starting the analyser, the initial snapshot E 0 contains the variables of F I: E 0 (message)=, E 0 (accept)=, E 0 (amount)=, E 0 (A active )=+, and E 0 (A abort )=. Also, no messages have been sent or received to and from any peer: E(seen i =0) for i {RA, LA, Cl}. A is assumed normal. The initial client request is received and, because the message history is the same as for the test execution, the values must be the same. Therefore, E 1 =E 0 [amount 100]. The (normal) transition conditions are evaluated and it is determined that E 1 (B active =+). As there is no other active activity, B is selected for execution. As B is assumed abnormal, neither the communication events nor the variable assignments need to occur as in the original execution. We obtain two possible successors; one for the normal completion/fail stop case: E 2=E 1[X, B done, B exc, seen i, seen Cl [1, 4]], and one for the exception case: E 2=E 1[X, B done, B exc +, seen i, seen Cl 1], with X ={message, accept, amount} and i {RA, LA}. On analysing the exception case E 2, it is determined that because the scope does not contain a fault handler, it is aborted. This case contradicts the expected behaviour and E 2 is eliminated. Therefore we continue with E 2. It is determined that because the value of E 2 (risk) is not known, both paths are feasible, implying B done =, C active = and D active =. Because the Loan Assessor is not to be used, i.e. MS LA =, activity C must find a different partner. As RA and LA offer incompatible interfaces, RA cannot be a target. Assume that a pair of client ports suits the interface description: seen Cl =[1, 3] (the first message in the client sequence was consumed by the initial receive and the next message must be a send operation). Continuing the propagation, D is executed and assigns the value yes to accept. Because both C and D could possibly be active at the same time, the result must be combined into a single result E 3 and the preceding process repeated until a fixpoint is reached. This leads to an environment where the values of risk, accept, are unknown and the seen i variables have value [1, 4], and activity E is activated. As E is not abnormal, the message matches either the second or the last client message, resulting in E 4 (seen Cl )=[2, 4] and the nested scope being active. As the only active operation at this time is F, we must find a message in the sequence that matches a receive operation. Here, only the third message is valid and we obtain seen Cl =3. After the forward analysis has completed, a backward pass is done and the seen Cl =3 is propagated backwards to E, which at this point must match the second client message. Propagating this further up to C and D, it is derived that there is no valid message left to choose for C; thus the path is infeasible and is removed. Continuing propagation with D and subsequently B, we derive that the abnormal B must account for the two messages exchanged with RA. As the communication history for the two messages is the same as in the failing test run, the same result must be derived and we obtain that RA::risk=low. Again, this is a contradiction with the test specification and, therefore, the abnormal B cannot explain the behaviour of the test run. Continuing with the remaining components, we obtain four single fault diagnoses: either I or J may be faulty in that they should issue a message to RA invoking the compensation handler, or F or H should abort in addition to performing their normal actions. 4 Related work Our work complements the work in [Ardissono et al., 2005] which concentrates on the definition of decentralised interactions between individual diagnosers for each Web service in an cooperative orchestration, but does not examine the individual diagnosis engines or their communication strategies in detail. An earlier approach dealt with generic componentbased software systems while considering the individual components as black boxes [Grosclaude, 2004]. We focus on the actual orchestration mechanisms, using BPEL4WS as an exemplary choreography language. Considerable effort has been spent on the use of semantic service descriptions for more or less automated composition of services, e.g., [Narayanan and McIlraith, 2002; Pistore et al., 2005], but so far, detailed error handling has only been addressed at the level of fault handler implementation [Chafle et al., 2005], or, at best, verification [Kazhamiakin and Pistore, 2005]. 5 Conclusion in this paper we consider the task of diagnosing cooperative business processes specified in BPEL4WS, building on and extending earlier work on imperative languages [Mayer and DX'06 - Peñaranda de Duero, Burgos (Spain) 177

188 Stumptner, 2002; 2004], while incorporating BPEL-specific concerns such as concurrency and nondeterminism. We focus on the development of the top level process itself, assuming that pre-existing remote services accessed in the collaboration are less likely to be the source of the fault than their specific usage and the choreography itself (although the individual component services are amenable to being examined in a hierarchic diagnosis process). This work only scratches the surface, with many aspects not yet considered, such as timeouts and timed alarms, deeper analysis of data structures (XML trees that are accessed using XPath this would be subject to earlier work on debugging in functional languages), and the vast space of possibilities opened up by the incorporation of Semantic Web service specifications in OWL-S or other formalisms. These could be incorporated in the diagnosis process in similar fashion to pre- and postconditions in Java debugging [Stumptner, 2005]. Ultimately, reconfiguration and planning could be incorporated to effect Web Service repairs. References [Alonso et al., 2004] Gustavo Alonso, Fabio Casati, Harumi Kuno, and Vijay Machiraju. Web Services. Springer- Verlag, [Ardissono et al., 2005] L. Ardissono, L. Console, A. Goy, G. Petrone, C. Picardi, M. Segnan, and D. Theseider Dupre. Cooperative model-based diagnosis of web services. In Proceedings of the Sixteenth International Workshop on Principles of Diagnosis, Monterey, June [Bourdoncle, 1993] François Bourdoncle. Abstract debugging of higher-order imperative languages. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, pages 46 55, [Chafle et al., 2005] Girish Chafle, Sunil Chandra, Pankaj Kankar, and Vijay Mann. Handling faults in decentralized orchestration of composite web services. In Proc. ICSOC, pages , Amsterdam, Springer-Verlag. [Christensen et al., 2001] Erik Christensen, Francisco Curbera, Greg Meredith, and Sanjiva Weerawarana. Web Services Description Language (WSDL) 1.1. Technical report, World Wide Web Consortium, March [Corbett, 1998] James C. Corbett. Using shape analysis to reduce finite-state models of concurrent Java programs. Technical report, Department of Information and Computer Science, University of Hawaii, [Cousot and Cousot, 1977] Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction of approximation of fixpoints. In POPL 77, pages , Los Angeles, [Curbera et al., 2003] Francisco Curbera, Yaron Goland, Johannes Klein, Frank Leymann, Dieter Roller, Satish Thatte, and Sanjiva Weerawarana. Business Process Execution Language for Web Services 1.1. Technical report, IBM and Microsoft, May [Darwiche, 1999] Adnan Darwiche. Compiling knowledge into decomposable negation normal form. In Proc. 16 th IJCAI, pages , [Fröhlich and Nejdl, 1997] Peter Fröhlich and Wolfgang Nejdl. A Static Model-Based Engine for Model-Based Reasoning. In Proceedings 15 th International Joint Conf. on Artificial Intelligence, Nagoya, Japan, August [Grosclaude, 2004] Irene Grosclaude. Model-based monitoring of component-based software systems. In Proceedings of the Fifteenth International Workshop on Principles of Diagnosis, Carcassonne, June [Jones et al., 2002] James A. Jones, Mary Jean Harrold, and John Stasko. Visualization of test information to assist fault localization. In Proceedings of the 24 th International Conference on Software Engineering, Zurich, Switzerland, September [Kazhamiakin and Pistore, 2005] Raman Kazhamiakin and Marco Pistore. A parametric communication model for the verification of bpel4ws compositions. In EPEW/WS-FM, LNCS, pages , Versailles, Springer-Verlag. [Mayer and Stumptner, 2002] Wolfgang Mayer and Markus Stumptner. Modeling programs with unstructured control flow for debugging. In Proc. 15 th Australian Joint Conf. on AI, pages , Canberra, December Springer-Verlag. [Mayer and Stumptner, 2004] Wolfgang Mayer and Markus Stumptner. Debugging program loops using approximate modeling. In Proc. ECAI, Zaragoza, August [Narayanan and McIlraith, 2002] Srini Narayanan and Sheila A. McIlraith. Simulation, verification and automated composition of web services. In Proceedings of the 11th International Conference on the World Wide Web, pages ACM Press, [owl, 2004] OWL-S, [Pistore et al., 2005] Marco Pistore, Paolo Traverso, Piergiorgio Bertoli, and A. Marconi. Automated synthesis of composite bpel4ws web services. In ICWS, pages , Orlando, [Stumptner and Wotawa, 2000] Markus Stumptner and Franz Wotawa. Using Model-Based Reasoning for Locating Faults in VHDL Designs. Künstliche Intelligenz, 14(4):62 67, [Stumptner, 2001] Markus Stumptner. Using design information to identify structural software faults. In Proc. 14 th Australian Joint Conf. on AI, Springer LNAI 2256, pages , Adelaide, December [Stumptner, 2005] Markus Stumptner. Web service composition. In 4th International Conference on Information Systems Technology and its Applications (ISTA 05), Palmerston North, NZ, May GI - Gesellschaft fuer Informatik. 178 DX'06 - Peñaranda de Duero, Burgos (Spain)

189 Observer Gain Effect in Linear Interval Observer-Based Fault Isolation Jordi Meseguer, Vicenç Puig, Teresa Escobet, Joseba Quevedo Automatic Control Department - Campus de Terrassa Universidad Politécnica de Cataluña (UPC) Rambla Sant Nebridi, Terrassa (Spain) vicenc.puig@upc.edu Abstract In fault diagnosis, the integration between the fault model-based detection and isolation modules plays a significant role. Thus, model-based fault detection methods have inherent problems as the lack of fault indication persistence, noise sensitivity or model errors which cause fault detection not to be as good as it is required. Consequently, the fault isolation module may be confused in case the associated residual should be evaluated along with a set of residuals. In the present work, interval observers are used in the fault detection module. Then, the effect of the observer gain in the fault indication persistence is recalled and extended to the fault isolation task. A case study based on real data taken from the Barcelona s urban sewer system limnimeters is used to illustrate this effect. Keywords: Fault detection, Fault isolation, Robustness, Intervals, Observers. 1 Introduction These last years, the integration between fault detection and fault isolation tasks in model-based fault diagnosis has been a very active research area (see among others [Combastel et al., 2003], [Pulido et al., 2005] or [Puig et al., 2005]). The typical binary interface between these two modules has been improved using additional information. One example is the algorithm presented in [Pulido et al., 2005], where the next aspects are taken into account: residual sign, fault residual sensitivity and fault order. The core of this fault-isolation algorithm for the non-uncertain system case has been proposed in [Puig et al., 2005]. However, the model-based fault detection tasks are still the Achiles talon since the noise sensitivity model errors and the lack of fault indication persistence may cause fault detection not to be as good as it is needed. Consequently, the fault isolation module may be confused: especially, in case the associated residual set is not diagonal with respect to faults being necessary that a subset of them is active at the same time instant to isolate a given fault. When using observer-based fault detection methods, the observer gain plays an important role because it determines the time evolution of the residuals, their sensitivity to a fault and therefore the minimum detectable fault at any time instant [Chen and Patton, 1999]. Thus, the fault indication persistence depends on the observer gain [Meseguer et al., 2006] and on the other hand, according to this observer gain effect, a fault can be permanently detected, nonpermanently detected or non-detected [Meseguer et al, 2006]. The above mentioned fault detection problems and their influence on the fault isolation module have been already noticed by [Combastel et al., 2003] who suggests registering the maximum residual value once reached. However, this strategy introduces an additional problem, since then, it is not possible to know when the fault disappears. Another approach would be using structured residuals in a diagonal form [Gertler, 1998] but it might be too complicated in this case because of the parameter uncertainty. When using an interval observer method, the effect of those fault detection problems might be partially avoided designing properly the observer gain matrix and therefore, the fault isolation result might be also improved. The goal of this work is to show how different fault isolation results are obtained depending on the observer gain for a given fault scenario in spite of using an accurate fault isolation algorithm: right persistent fault isolation, right nonpersistent fault isolation, wrong fault isolation, lack of fault isolation. The interval observer-based fault diagnosis algorithm will be applied to the limnimeters of Barcelona s urban sewer system. Regarding the structure of the remainder of the paper, passive robust fault detection using interval observers is presented in Section 2. In Section 3, the integration of robust interval observer methods with the fault isolation algorithm is discussed. In Section 4, for a given limnimeter fault scenario, several observer gain sets are used in order to show their influence on the resulting fault isolation. Finally, in Section 5, the main conclusions are presented. DX'06 - Peñaranda de Duero, Burgos (Spain) 179

190 2 Passive robust based fault detection using interval observers 2.1 Input/output interval observer expression Considering that the system to be monitored can be described by a MIMO linear uncertain dynamic model in discrete-time, then its input-output relationship, without faults, disturbances and noise, is 1 y( k) = M( q, θ) u ( k) (1) 1 q where M (, θ) is the transfer function matrix expressed using the shift operator q -1 and θ Θ is a set of interval bounded parameters representing the model uncertainty: p Θ = { θ R θ θ θ }. This type of model is known as an interval model. Instead of using directly the model of system given by (1) to detect faults, the following observer form will be used: xˆ( k+ 1) = ( A( θ) WC( θ)) xˆ( k) + B( θ) u( k) + Wy( k) (2) yˆ( k) = C( θ) xˆ( k) where matrices A(q -1 ), B(q -1 ) and C(q -1 ) are obtained from the system model (1) using its observer canonical state space form and W is the observer gain, designed to stabilize the matrix 0 = ( ) ( ) A A θ WC θ and to guarantee a desired performance for all θ Θ regarding fault detection and avoiding the wrapping effect. The effect of the uncertain parameters θ on the observer temporal response will be bounded using an interval: [ y ˆ( k), y ˆ( k) ], where: yˆ ( k) = min( yˆ ( k, θ )) i θ Θ yˆ ( k) = max( yˆ ( k, θ )) i θ Θ Such interval can be computed using the algorithm presented in [Puig et al., 2003]. The observer given by (2) can also be described by the following input-output relationship for each output: ny i n n u uij ˆi iv ˆi ijv j v= 1 j= 1v= 1 ny i + wiv ( y i(k v) ˆy i(k v) ) v= 1 θ = a i1, L,a in,b y ijn i ij1, L,b uij Θ y(k) = a y(k v) + b u (k v) + where y(k) i is a given measured system output and u j (k) is a given measured system input. Moreover, n yi determines the model order associated to the system output i, n u is the number of the system inputs and n uij is the needed number of past values of the input j regarding the output i. When there is no fault, each of the system outputs verifies: y ( k) [ yˆ ( k), yˆ ( k) ] (4) i i i i i (3) Equivalently, this observer can be expressed in transfer function form using the shift operator q -1 and assuming zero initial conditions as: n 1 u G(q 1 ij ) W(q i ) ŷ(k) i = u 1 1 j(k) + y(k) (5) 1 1 i j= 1H i(q ) + W(q i ) H i(q ) + W(q i ) where: n uij n 1 v G(q ij ) = bijvq, yi n 1 v H i(q ) = 1 aivq, yi 1 v W(q i ) = wivq v= 1 v= 1 In case of all observer gains w iv are zero, the observer is in fact a simulator [Chow et al., 1984] but if the condition wiv = aiv is fulfilled, the observer becomes a predictor [Chow et al., 1984]. 2.2 Fault detection using an interval observer Fault detection is based on generating a residual comparing the measurements of physical variables yi ( k ) of the process with their estimation yˆ i ( k ) provided by the associated system observer: r(k) i = y(k) i ˆy(k) i (6) This residual can be also expressed in transfer function form as it follows: 1 n u 1 H(q i ) G(q ij ) r(k) i = y(k) 1 1 i u 1 1 j(k) (7) H i(q ) + W(q i ) j= 1H i(q ) + W(q i ) leading to its computational form [Gertler, 1998]. In the used FDI interface, residual (6) is computed regarding o the nominal observer model yˆ i ( k ) obtained using o θ = θ Θ, o 0 r i (k) = y(k) i ˆy i(k) (8) In normal conditions, ( ) should be zero under an ideal r 0 i k situation at each time instant k. However, when considering model uncertainty located in parameters, the residual generated by (8) will not be zero even in a non-faulty scenario. In this case, the possible values of this residual could be bounded using an interval [Puig et al., 2002] o i o i o i o i o i v= 1 r ( k) [ r ( k), r ( k)] (9) where: r ( k) = yˆ ( k) yˆ ( k) and r ( k) = yˆ ( k) yˆ ( k) i o o i i i The interval for the residual constitutes an adaptive threshold. 2.3 Degree of residual violation As it is proposed in [Puig et al., 2005], the activation value for each residual is calculated as in the DMP-approach [Petti et al., 1990] using the Kramer function: o o 4 ( ri ( k)/ ri ( k)) o if r ( ) 0 4 i k o o 1 + ( ri ( k)/ ri ( k)) φi ( k) = (10) o o 4 ( ri ( k)/ ri ( k)) o if r ( ) 0 o o 4 i k < 1 + ( ri ( k)/ ri ( k)) 180 DX'06 - Peñaranda de Duero, Burgos (Spain)

191 In this way, residuals are normalized to a metric between -1 and 1, φi ( k) [ 1.1], which indicates the degree to which each equation is satisfied: 0 for perfectly satisfied, 1 for severely violated high and -1 for severely violated low. 2.4 Fault residual sensitivity concept The sensitivity of the residual [Gertler, 1998] to a fault is given by 1 ri 1 S F (q ) = = G ij F (q ) (11) ij f j where 1 G F (q ) is the transfer function that describes the ij effect on the residual, r i, of a given a fault, f j. In the following, Eq. (11) is particularized to an additive output / input sensor fault using a fault sensitivity analysis. This Section focuses on the effect of the observer gain on the fault residual sensitivity time evolution since the rest of FD observer properties depend on it [Meseguer et al, 2006]. The expression of the residual sensitivity to an additive output sensor fault obtained using Eq. (11) and Eq. (7) is 1 1 ri Hi( q ) SF yi ( q ) = = (12) 1 1 f Hi( q ) + Wi( q ) This expression shows the residual sensitivity to an additive output sensor fault is a time function and how its dynamics and steady-state gain is influenced by the observer gain. Then, its value at time instant k=0, i.e., when the fault appears is s f yi (0) = 1 (13) independently of the observer gain. On the other hand, the steady-state value for an abrupt fault modeled as a unit-step function is given by [Gertler, 1998] hi (14) s f ( ) = y n i yi h w where : h i n yi = 1 a iv v= 1 i + v= 1 When all observer gains are zero (simulation case), the sensitivity value is 1, independently of time instant k However, in case of the observer gains satisfy wiv = aiv (prediction case), the sensitivity is h i. Besides, if the model is stable and isotonic [Puig et al., 2003], it is satisfied n yi 1> aiv > 0 (15) v= 1 then: s f ( ) = h ij i < 1. In general, Eq. (12) shows the residual sensitivity steadystate value is inversely proportional to the observation gain. Regarding the sensitivity of the residual to an additive input sensor fault, the same analysis could be done and then, similar results would be obtained regarding its observer gain dependence. iv 2.5 Observer gain influence on fault detection The residual expression given by Eq. (7) shows the influence of the observer gain on its dynamics and on its steadystate value. On the other hand, Eq. (9) sets the condition a nominal residual must fulfil in order to indicate a fault and consequently, this condition is fully affected by the used observer gain. Moreover, according to Section 2.4, the fault residual sensitivity is deeply affected by the observer gain. At fault apparition time instant, simulators, observers and predictors have the same fault sensitivity but from that time instant, while simulators keep this good fault detection property, the corresponding one to observers and predictors is worsened, being the predictor approach the most deeply affected. That means observers and predictors are loosing their aptitude for indicating faults once the fault has occurred. On the other hand, simulators have other serious problems as their high initial condition sensitivity or their high model error sensitivity and consequently, this approach is not very suitable for fault detection. Thus, although the fault might not be persistently indicated, the model-based fault detection approach must use observers or predictors, being the first approach the one that offers more fault indication persistence according to the mentioned above. 3 Fault isolation algorithm 3.1 Objective In this Section, the fault isolation algorithm integrated with the interval observer-based fault detection approach presented previously is introduced. This algorithm is based on the proposed in [Puig et al., 2005] but in order to show clearly the observer gain effect on the fault isolation task, a modification is considered. In the original algorithm, the first component between the fault detection and fault isolation modules is an interface that is based on a memory that stores along a time window given by T w and for each residual, the time instant (k φ ) in which the residual has been activated (10) ( φ i (k φ ) =0.5) and the activation value (φ imax ) whose absolute value is maximum. Then, at the end of the considered time window, an isolation result is given based on the memory stored information and on a pattern comparison component that is introduced in the following Section. This fault isolation algorithm needs this time window in order to avoid the mentioned fault detection problems in spite of this opens a new problem: which is the length of this time window? This approach is known as relative fault isolation. In the algorithm version considered in this paper, there is no memory component and consequently, a fault isolation result is given at every time instant: absolute fault isolation. As it was already mentioned above, this assumption is done in order to show the effect of the observer gain on the fault isolation result and how these gains might avoid the fault detection problems. DX'06 - Peñaranda de Duero, Burgos (Spain) 181

192 The efficiency of the absolute approach relies basically on a proper design of the observer gain matrix in order to avoid the mentioned fault detection problems. On the other hand, the efficiency of the relative approach based on predictors relies on determining an optimal time window T w since in this case, residual activation lasts only few time instants as it was already mentioned. Once activated a residual because of the effect of a fault, the length of this time window T w must be enough so that all residuals affected by this fault keep activated at least a time instant, otherwise a wrong isolation result could be set. In consequence, the T w dependency regarding to fault residual sensitivity dynamics can be stated. 3.2 Pattern comparison component While at least one of the residuals is activated (10) ( φ i (k) 0.5), the pattern comparison component compares at every time instant the residuals activation values given by Eq. (10) with the stored fault patterns. Given a set of residu- 0 als, i is affected by a set of faults r, and the possible faults F { f f f,... } f j F,,..., 0 = 1 2 j f m, each r i. The fault patterns are organized according to a theoretical fault signature matrix, named FSM. An element FSM ij of the matrix contains the 0 pattern if f j is expected to affect r i, otherwise it is equal to 0. Four different fault signature matrices are considered in the evaluation task: Boolean fault signal activation (FSM01), fault signal signs (FSMsign), fault residual sensitivity (FSMsensit), and, finally, fault signal occurrence order (FSMorder). FSM01: Evaluation of fault signal appearance The FSM01-table contains the theoretical binary patterns that faults produce in the residual equations. Those patterns can be codified using the values 0 for no influence, 1 otherwise. factor01 j is calculated for the j th fault hypothesis in the following way: n ( boolean( φi( k) ) FSM01ij) factor01 ( k) = j i= 1 n i= 1 FSM01 ij zvf j (16) with 0, if φi ( k) = 0 boolean( φi ( k)) = (17) 1, if φi ( k) 0 and the zero-violation-factor as 0, if i { 1,..., n} with FSM01ij = 0 zvf j = and φi ( k) 0 (18) 1, otherwise That leads to the following behaviour: expected fault signals support a fault hypothesis, unexpected fault signals are eliminated through the zero-violation-factor. FSMsign: Evaluation of fault signal signs The FSMsign-table contains the theoretical sign patterns that faults produce in the residual equations. Those patterns can be codified using the values 0 for no influence, +1/-1 for positive/negative deviation for every FSMsign ij. The factorsign j is calculated comparing theoretical signs to the signs of new activated residual. with Then: numsign ( k) = max( ckecksign( φ ( k), FSMsign ), j i ij i= 1 n i= 1 n ckecksign( φ ( k), FSMsign )) ckecksign( φ ( k), FSMsign ) = i i ij 0 sign( φi( k)) sign( FSMsignij) 1 sign( φi( k)) = sign( FSMsignij) numsign j ( k) factorsign ( k) = zvfsign j n i= 1 FSMsign ij ij j (19) (20) (21) where the factor zvfsign j is defined in a similar way as in the case of factor 01, excluding those fault hypothesis that has a zero in a position where the fault signal presents a sign. o FSMsensit: Evaluation of fault sensitivities This evaluation component uses the residual activation values φ i ( k) and computes factorsensit using the sensitivitybased FSMsensit table for weighting those activation values. That approach can be found as well in the DMP-method [Petti et al., 1990]. The following equations describe how to calculate the entries FSMsensit ji SF ij o if ri ( k) 0 o ri ( k)) (22) FSMsensitij = SF ij o if ri ( k) < 0 o ( ri ( k) where S F ij is the sensitivity associated to the nominal residual ri ( k ) regarding the fault hypothesis f j and it is calculated using Eq. (11). Although fault residual sensitivity depends on time in case of a dynamic system, here the steady-state value after a fault occurrence is considered as it was also suggested in [Gertler, 1998]. The value of FSMsensit ij describes, how easily a fault will cause the ith residual to violate its associated adaptive threshold since the larger the residual partial derivative with respect to the fault, the more sensitive that equation is to deviations of the assumption. Similarly, residuals with large detection thresholds are less sensitive as they are more difficult to violate. Therefore FSMsensit ij can be used to weight the activation value of different fault signals: 182 DX'06 - Peñaranda de Duero, Burgos (Spain)

193 factorsensit ( k) = n j ( φi( k) FSMsensitij) i= 1 n i= 1 FSMsensit ij zvfsensit ij (23) where the factor zvfsensit j is defined similarly as the case of factor01, excluding those fault hypothesis that has a zero theoretical sensitivity while fault signal presents an non-zero value. FSMorder: Fault signal occurrence order evaluation In dynamic systems a fault f j symptom not appears at same time in all residuals. FSMorder table contains the order of the symptom apparition for each fault hypothesis; this order is codified using ordinal numbers, starting with 1. If two fault signals appear at the same time or if they explicitly may commute their order, then they should share the same ordinal number. Fault signals that must not appear get the code 0. factororder is calculated comparing the apparition order of the fault signal to the theoretical order. factororder ( k) = where n ( ckeckorder ( φi( k),fsmorderij)) i= 1 n i= 1 j boolean(fsmorder ) ckeckorder( φ ( k), FSMorder ) = i ij ij zvforder 0 order( φi( k)) FSMorderij) 1 order( φi( k)) = FSMorderij) ij (24) (25) and order( φ i ( k)) is the apparition order in which the i th fault signal (φ i (k)) has been activated regarding the first activated. Decision logic component Once the pattern comparison component has evaluated the pattern factors explained above at a given time instant k, the decision logic component gives a fault isolation decision based on those evaluation factors. The decision logic takes into account the most probable fault for each operator based on the number of coincidences between the observed pattern and the theoretical one stored in the corresponding FSM matrix. The result gives 4 measures for the confidence of this fault hypothesis. 3.3 Observer gain influence on fault isolation In fault isolation where in most of the cases, a subset of residuals are needed to be active at the same time in order to isolate the fault, the lack of fault indication persistence presented by a residual or its associated activation value (10) and the fact the subset of residuals have different fault sensitivities may confuse the isolation module and in consequence, different fault isolation results may appear: right persistent fault isolation, right non-persistent fault isolation, wrong fault isolation, lack of fault isolation. Thus, according to Section 2, the fault indication persistence given by a residual or its associated activation value (10) is influenced by the observer gain. Then, regarding the fault isolation algorithm presented in this Section, its fault isolation result is based on evaluating at every time instant several factors (factor01, factorsign, factorsensit, factororder) computed using the activation values (10) of the corresponding residual fault hypotheses. Consequently, given these activation values are affected by the observation gain, the fault isolation algorithm result is also affected by them 4 Application example 4.1 Application description The proposed fault diagnosis algorithm has been applied to 13 limnimeters of the Barcelona urban drainage system. In this Section, a given limnimeter fault scenario is analyzed in order to show how different fault isolation results could be obtained depending on the used observer gains. In particular the following cased could appear: right persistent fault isolation, right non-persistent fault isolation, wrong fault isolation and lack of fault isolation. Limnimeters can be monitored using a rainfall-runoff on-line model of the sewerage network. One possible model methodology to derive a real-time model of this kind is through a simplified graph relating the main sewers and a set of virtual and real reservoirs [Cembrano et al., 2002]. A virtual reservoir is an aggregation of a catchment of the sewer network which approximates the hydraulics of rain, runoff and sewage water retention. Its hydraulics is given by: dv( t ) = Q ( t ) Q ( t ) I( t )S (26) up down + dt where: V is the volume of water accumulated in the catchment, Q up and Q down are flows entering and exiting the catchment, I is the rain intensity falling in the catchment and S its surface. Input and output sewer levels are measured using limnimeters and they can be related with flows using a linearised Manning relation: Q () t = M L () t and up up up Qdown () t = MdownLdown () t. Moreover, it is assumed that: Qdown () t = KvV() t. Then, substituting in Eq. (26) and discretising: Ldown ( k + 1) = aldown ( k ) + blup ( k ) + ci( k )) (27) where: a = (1 K v t ), b = MupKv t/ Mdown and c = SKv / Mdown. Using this modelling methodology, a model of the select part of the Barcelona s sewer network is presented in Fig. 1. Its structure depends on the topology of the network and its parameters must be estimated using real data from the sensors in the network. DX'06 - Peñaranda de Duero, Burgos (Spain) 183

194 C1 u5 L7 V9 V10 L56 q910 q10m I2 I11 C2 I10 C5 q24 V2 q96 u4 u4 R2 L41 C6 u3 TARRAGONA GATES LLOBREGAT (no WWTP) u1 I1 ESCOLA u2 INDUSTRIAL TANK V3 L47 I7 q7l R4 L80 q945 q7m V1 q57 q14 V4 L39 C4 L16 V5 V7 L27 L3 I4 R7 q946 I5 I8 R8 q68 L11 V6 V8 L8 q8m I6 R12 MEDITERRANEAN SEA Gate (Cx) Rain (Ix) Virtual tank (Vx) Real tank (Vx) CSO Connections between cathments (weir type) Water level sensor q128 R11 q811 L19+L20 I12 V12 L9 V11 q12s I11 q11m q11d Lxx WWTP (BESOS) Fig. 1. Virtual reservoir model of the Barcelona prototype network This model and the measurements provided by 5 raingauges and 13 limnimeters allow deriving 12 residuals. Once the structure of the models for each limnimeter has been selected, the interval for the parameters will be determined. Such interval model would be calibrated in order to guarantee that the predicted behaviour interval includes all the non-modelled effects. An algorithm inspired on the one proposed by [Ploix et al., 1999] will be used to provide the system parameter nominal estimation. Then, using optimization tools, the uncertainty parameter intervals of the considered reduced observer are adjusted using a worst-case approach until all the measured data is covered by the interval of prediction for the considered observer gain. 4.2 Fault scenarios The proposed interval observer-based fault diagnosis approach has been tested using a faulty scenario affecting sensor L 39 in which its output is zero-valued from time instant k=150. According to the binary fault signature matrix FSM01, this faulty scenario has just an influence on the residuals associated to limnimeters L 39 and L 41 and thus, both must be activated in order to isolate that fault. The reduced observer associated to L 39 and L 41 is given by Lˆ 39( k+ 1) = a ˆ 39(1 w39) L39( k) + c39i16( k) + a39w39l39( k) (28) where w 39 (w 39 =0, simulation; w 39 =1, prediction) is the associated observer gain, I 16 is the rain intensity measured by the rain gauge P 16 and a 39 =[0.496, 0.744], c 39 =[0.601, 0.901]. These interval parameter values are valid for the observer gains tested in this paper. Lˆ 41( k+ 1) = a41(1 w ˆ 41) L41( k) + b41l39( k) + c41i16( k) (29) + a41w41l41( k) where w 41 is the associated observer gain and a 41 =[0.869, 1.063], b 41 =[-0.296, ], c 39 =[0.734, 0.897]. These interval parameter values are valid for the observer gains tested in this paper. First, the observer gains are set so that a right persistent fault isolation result is obtained. Then, new observer gain values are given so that the fault is almost non-detected. Finally, new values are assigned so that the fault isolation result is non-permanent and partially wrong. a) Persistent fault isolation case In case observer gains are: w 39 =0.35 and w 41 =0.4, the fault isolation algorithm indicates L 39 as the faulty sensor from the fault apparition time. The observer gains have been chosen so that none residual is activated in a non-faulty scenario and their values are as low as possible so that the fault does not fully contaminate the observer model. Thus, once the fault has occurred, the residual violation factor (10) associated to L 39 and L 41 is activated while the fault lasts. In Figure 2, the time evolution of fault detection test (9) associated to both limnimeters is drawn. This Figure shows how the limnimeter nominal residuals (8) are kept out of their associated adaptive thresholds from fault time apparition (k=150) and thus, both of them are indicating persistently that fault. In Figure 3, the absolute value of the residual violation factors (10) associated to both limnimeters is drawn. Indeed, in order to avoid the noise undesired effect, the diagnosis algorithm does not use the instant value of factor (10) but an average of the last values associated to a given time window and this is what is plotted in Figure 3. These factors indicate the fault while their absolute value is bigger than 0.5 what occurs few time instants once the fault has appeared (k=150) and they keep activated till the end of the scenario. Consequently, derived from Figure 3, the fault is persistently detected by the interval observers associated to both limnimeters. In Figure 4, the fault isolation result time evolution is plotted: factor01 (16), factorsign (21), factorsensit (23) factororder (24) associated to each fault hypothesis. In this fault scenario and for the used observer gains, only the L 39 indicators are activated and they do persistently from fault time apparition till the end of the scenario. Consequently, the fault is clearly isolated by the interval observers for the considered observer gains L39 residual & adaptive threshold time evolution L41 residual & adaptive threshold time evolution Time Fig. 2.Time evolution of the residuals and their adaptive thresholds 184 DX'06 - Peñaranda de Duero, Burgos (Spain)

195 1 L39 residual violation degree time evolution 1 L39 residual violation degree time evolution L41 residual violation degree time evolution 1 L41 residual violation degree time evolution Time Fig. 3.Evolution of the residual violation degree absolute value Time Fig. 6.Evolution of the residual violation degree absolute value 1 factor01 time evolution 1 factor01 time evolution 0.5 L L factorsign time evolution L factorsensit time evolution L factororder time evolution L Time Fig. 4. The factor indicators time evolution b) Almost non-isolated fault case In this case, using the same fault scenario, the isolation algorithm can just isolate the fault for very few time instants because the fault detection does not last longer. This is because the interval observers associated to both limnimeters are using high observer gain (w 39 =0.85 and w 41 =0.85 ) values and consequently, their behavior is quite close to predictors: the model predicted values are almost fully contaminated by the fault since few time instants later the fault has occurred. In Figure 5, the time evolution of fault detection test (9) is plotted showing the nominal residuals (8) are kept into the adaptive threshold for the most time instants once the fault has occurred. Consequently, their residual violation factors (10) whose absolute values are plotted in Figure 6 are hardly activated for very few time instants. Consequently, the fault isolation indicators (factor01 (16), factorsign (21), factorsensit (23) factororder (24) ) associated to L 39 fault hypothesis are hardly activated and therefore, the fault can not be isolated. The time evolution of these indicators is plotted in Figure L39 residual & adaptive threshold time evolution L41 residual & adaptive threshold time evolution factorsign time evolution 0.5 L factorsensit time evolution 0.5 L factororder time evolution 0.5 L Time Fig. 7. The factor indicators time evolution c) Non-persistent fault and partially wrong fault isolated case In this case, using the same fault scenario, the fault is clearly isolated from its occurrence but it is just for a time window because the L 39 residual violation factor (10) does not indicate permanently the fault. Then, the isolation algorithm decreases the values associated to the L 39 isolation indicators and on the other hand, it activates the indicators associated to L 41 fault hypothesis. This fact could lead a wrong isolation result from that time instant. The fault isolation behavior described previously is obtained when L 39 model uses an observer gain quite similar to the used in case b) while the corresponding to L 41 is quite similar to the used in case a) (w 39 =0.7 and w 41 =0.5 ). In Figure 8, the time evolution of fault detection test (9) is plotted for both limnimeters while in Figure 9, it is the time evolution of the absolute value of the corresponding residual violation factor (10). Both Figures are in line with the behavior described previously. In Figure 10, the time evolution of the fault isolation indicators (factor01 (16), factorsign (21), factorsensit (23) factororder (24) ) associated to L 39 and L 41 is plotted. This Figure shows how the fault isolation algorithm is confused between L 39 and L 41 fault hypothesis once L 39 observer model does not longer indicate the fault. In spite of this fact, the factor factorsensit (23) associated to L 39 continues having a bigger value than the corresponding to L 41 and consequently, this fault hypothesis might still be the best candidate Time Fig. 5.Evolution of the residuals and their adaptive thresholds DX'06 - Peñaranda de Duero, Burgos (Spain) 185

196 L39 residual & adaptive threshold time evolution avoid fault detection problems enhancing the fault isolation results L41 residual & adaptive threshold time evolution Time Fig. 8.Evolution of the residuals and their adaptive thresholds L39 residual violation degree time evolution L41 residual violation degree time evolution Time Fig. 9.Evolution of the residual violation degree absolute value factor01 time evolution factorsign time evolution L39 L39 & L factorsensit time evolution L41 L factororder time evolution Time L39 L39 L39 & L41 L39 & L41 Fig. 10. The factor indicators time evolution 5 Conclusions In general, model-based fault detection methods have inherent problems which cause fault detection not to be as good as it is needed and therefore, the fault isolation module may be confused. In particular, this paper shows fault isolation results may be very sensitive to the fault indication persistence provided by the fault detection module. It also shows that this lack of persistence can deteriorate the integration between the fault model-based detection and isolation modules. When using interval observers, the fault indication persistence might be improved designing properly the observer gain matrix. Therefore, the fault isolation results might be also improved. As a further work, the observer gain influence on the fault isolation module should be studied more quantitatively using different fault types. Thus, the comparison between absolute and relative fault isolation will be discussed in more properly terms analyzing the relation between T w and the observer gain. Moreover a proper observer gain matrix design should be studied in order to Acknowledgments The authors wish also to thank the support received by the Research Comission of the Generalitat of Catalunya (Grup SAC ref. 2005SGR00537) and by CICYT (ref. DPI ) of Spanish Ministry of Education. References [Cembrano et al., 2002] Cembrano, G. Global Control of the Barcelona Sewerage System for Environment Protection. In Proceedings of IFAC World Congress, Barcelona, [Chen and Patton, 1999] Chen J. and R.J. Patton. Robust Model- Based Fault Diagnosis for Dynamic Systems. Kluwer Academic Publishers [Chow et al., 1984] Chow, E., Willsky, A. Analytical redundancy and the design of robust failure detection systems. IEEE Transactions on Automatic Control, Volume: 29, Issue: 7, Jul 1984, Pages: [Combastel et al., 2003] Combastel, C., S. Gentil, and J. P. Rognon. Toward a better integration of residual generation and diagnostic decision, in Proceedings of IFAC Safeprocess 03, Washington, USA, [Gertler, 1998] Gertler, J. Fault Detection and Diagnosis in Engineering Systems. M. Dekker, [Pulido et al., 2005] B. Pulido, V. Puig, T. Escobet, and J. Quevedo. A new fault localization algorithm that improves the integration between fault detection and localization in dynamic systems. 16 th International Workshop on Principles of Diagnosis (DX 05). Monterey, California, USA, June 1-3, 2005 [Petti et al., 1990] Petti, T.F., J. Klein, and P. S. Dhurjati. Diagnostic model processor: Using deep knowledge for process fault diagnosis, AIChE Journal, vol. 36, p [Ploix et al., 1999] Ploix, S., Adrot, O. and J. Ragot. Parameter Uncertainty Computation in Static Linear Models. 38th IEEE Conference on Decision and Control. Phoenix. Arizona. USA. [Meseguer et al 2006] Meseguer, J., Puig, V., Escobet, T Observer gain effect in linear observer-based fault detection IFAC SAFEPROCESS 06. [Puig et al., 2002] Puig, V., Quevedo, J., Escobet, T., De las Heras, S. Robust Fault Detection Approaches using Interval Models. IFAC World Congress (b 02). Barcelona. Spain. [Puig et al., 2003] Puig, V., Saludes, J., Quevedo, J. Worst-Case Simulation of Discrete Linear Time-Invariant Dynamic Systems, Reliable Computing 9(4): , August. [Puig et al., 2005] Puig, V. J. Quevedo, T. Escobet, and B. Pulido (2005). A New Fault Diagnosis Algorithm that Improves the Integration of Fault Detection and Isolation in ECC-CDC 05, Sevilla, Spain. 186 DX'06 - Peñaranda de Duero, Burgos (Spain)

197 A Generalization of the GDE Minimal Hitting-Set Algorithm to Handle Behavioral Modes Mattias Nyberg Department of Electrical Engineering, Linköping University, SE Linköping, Sweden Phone: , Fax: , Abstract A generalization of the minimal hitting-set algorithm given by dekleer and Williams is presented. The original algorithm handles only one faulty mode per component and only positive conflicts. In contrast, the new algorithm presented here handles more than two modes per component and also non-positive conflicts. The algorithm computes a logical formula that characterizes all diagnoses. Instead of minimal diagnoses, or kernel diagnoses, some specific conjunctions in the logical formula are used to characterize the diagnoses. These conjunctions are a generalization of both minimal and kernel diagnoses. From the logical formulas, it is also easy to derive the set of preferred diagnoses. 1 Introduction Within the field of fault diagnosis, it has often been assumed that each component has only two possible behavioral modes, e.g. see [Reiter, 1987; dekleer and Williams, 1987]. For this case, and given a set of conflict sets, it is well known that a minimal hitting set corresponds to a minimal diagnosis [Reiter, 1987] 1. Algorithms for computing all minimal hitting sets have been presented in [Reiter, 1987; dekleer and Williams, 1987]. Improvements have later been given in e.g. [Greiner et al., 1989; Wotawa, 2001]. In [Reiter, 1987; dekleer and Williams, 1987] it is assumed that a conflict can only imply that some component is faulty. We call this a positive conflict [dekleer et al., 1992]. If all conflicts are positive, it is also well known that the set of all minimal diagnoses characterizes all diagnoses [dekleer and Williams, 1987]. This will for example be the case if the faulty modes of the components have no fault models. However, if there are fault models, it is possible to have nonpositive conflicts implying that some component is fault-free. If there is a desire to compute something that characterizes all diagnoses when there are non-positive conflicts, the 1 Reiter used the word diagnosis for what in this paper is called minimal diagnosis. concept of minimal hitting sets and the algorithms in [Reiter, 1987; dekleer and Williams, 1987] can not be used. To solve this, an alternative characterization based on so called kernel diagnoses was proposed in [dekleer et al., 1992], where also an algorithm to compute the kernel diagnoses was given. The kernel diagnoses characterize all diagnoses even in the case of non-positive conflicts. It has been noted in several papers that more than two possible behavioral modes are useful for improving the performance of the diagnostic system, see e.g. [Struss and Dressler, 1989; dekleer and Williams, 1989]. For this case, neither minimal diagnoses or kernel diagnoses can be used to characterize all diagnoses. Further, none of the algorithms in [Reiter, 1987; dekleer and Williams, 1987; dekleer et al., 1992] are applicable. To be able to handle both more than two behavioral modes and non-positive conflicts, the present paper proposes a new characterization of all diagnoses. Conflicts and diagnoses are represented by logical formulas, and instead of minimal diagnoses and kernel diagnoses, we use more general conjunctions on a specific form. In the special case of two behavioral modes per component, these conjunctions become equivalent to kernel diagnoses, and in the case of only positive conflicts, they become equivalent to minimal diagnoses. Thus, the here proposed framework can be seen as a generalization of both minimal diagnoses and kernel diagnoses. Another contribution is that we show that the minimal hitting set algorithm given in [dekleer and Williams, 1987] can in fact be generalized to compute the here proposed characterization. Note that, even though the papers [Struss and Dressler, 1989; dekleer and Williams, 1989] consider more than two behavioral modes per component, they are, in contrast to the present paper, not concerned with the characterization or computation of all diagnoses. Under the assumption of only two behavioral modes per component, the minimal diagnoses can be argued to be the most desired diagnoses. This has been called the parsimony principle, e.g. see [Reiter, 1987]. In the generalized case of more than two behavioral modes, the minimal diagnoses are no longer necessarily the most desired diagnoses. Instead the concept of preferred diagnoses has been defined in [Dressler and Struss, 1992]. We will in this paper show how to obtain these preferred diagnoses by means of the above mentioned logical formulas. DX'06 - Peñaranda de Duero, Burgos (Spain) 187

198 The paper is organized as follows. In Section 2, the algorithm from [dekleer and Williams, 1987] is restated as a reference. In Section 3, the logical framework is presented. Then the generalized version of the algorithm from [dekleer and Williams, 1987] is given in Section 4. Sections 5 and 6 discuss the relation to minimal and kernel diagnoses. Finally, Section 7 describes how to compute the preferred diagnoses. All proofs of theorems have been placed in an appendix. 2 The Original Algorithm This section presents the original algorithm and its associated framework as presented in [dekleer and Williams, 1987]. However, since we have a different objective than in the original paper, we will not always use the same notation and naming convention. The system to be diagnosed is assumed to consist of a number of components represented by a set C. A conflict is represented as a set C C. The meaning of a conflict C is that not all components in C can be in the normal fault-free mode. Thus only positive conflicts can be handled. A conflict C 1 is said to be minimal if there is no other conflict C 2 such that C 2 C 1. A diagnosis δ is also represented as a set δ C. The meaning of a diagnosis δ is that the components contained in δ are faulty and the components not contained in δ are fault free. A diagnosis δ 1 is said to be minimal if there is no other diagnosis δ 2 such that δ 2 δ 1. One fundamental relation between conflicts and diagnoses is that if C is the set of all minimal conflicts, δ is a diagnosis if and only if for all conflicts C C it holds that δ C. Given a set of diagnoses and a conflict C the minimal hitting set algorithm in [dekleer and Williams, 1987] finds an updated set of minimal diagnoses. A version of the algorithm, as described in the text of [dekleer and Williams, 1987], can be written as follows. Algorithm 1 Input: a set of minimal diagnoses, and a conflict set C Output: the updated set of minimal diagnoses Θ old = forall δ i do if δ i C = then Remove δ i from old forall c C do δ new := δ i {c} forall δ k, δ k δ i do if δ k δ new then goto LABEL1 end add := add {δ new } LABEL1 end end end Θ := old add The algorithm has the properties that if is the set of all minimal diagnoses, the algorithm output Θ will contain all minimal diagnoses with respect to also the new conflict C. Further, it also holds that Θ will contain only minimal diagnoses. Note that this algorithm does not require the conflict C to be minimal, contrary to what has been stated in [Greiner et al., 1989]. It can also be noted that the loop over δ k could be modified to δ k old, which would be more efficient since old is smaller than. 3 A Logical Framework Each component is assumed to be in exactly one out of several behavioral modes. A behavioral mode can be for example no-fault, abbreviated N F, gain-fault G, bias B, open circuit OC, short circuit SC, unknown fault UF, or just faulty F. For our purposes, each component is abstracted to a variable specifying the behavioral mode of that component. Let C denote the set of such variables. For each component variable c let R c denote the domain of possible behavioral modes, i.e. c R c. We will now define a set of formulas to be used to express that certain components are in certain behavioral modes. If c is a component variable in the set C and M R c, the expression c M is a formula. For example, if p is a pressure sensor, the formula p {NF, G, UF } means that the pressure sensor is in mode NF, G, or UF. If M is a singleton, e.g. M = {NF }, we will sometimes write also p = NF. Further, the constant with value false, is a formula. If φ and γ are formulas then φ γ, φ γ, and φ are formulas. In accordance with the theory of first order logic we say that a formula φ is a semantic consequence of another formula γ, and write γ = φ, if all assignments of the variables C that make γ true also make φ true. This can be generalized to sets of formulas, i.e. {γ 1,..., γ n } = {φ 1,...,φ m } if and only if γ 1 γ n = φ 1 φ m. If it holds that Γ = Φ and Φ = Γ, where Φ and Γ are formulas or sets of formulas, Φ and Γ are said to be equivalent and we write Γ Φ. We will devote special interest to conjunctions on the form c 1 M 1 c 2 M 2 c n M n (1) where all components are unique, i.e. c i c j if j k, and each M i is a nonempty proper subset of R ci, i.e. = M i R ci. Let D i denote a conjunction on the form (1). From a set of such conjunctions we can then form a disjunction D 1 D 2...D m (2) Note that the different conjunctions D i can contain different number of components. We will say that a formula is in maximal normal form MNF if it is on the form (2) and has the additional property that no conjunction is a consequence of another conjunction, i.e. for each conjunction D i, there is no conjunction D j, j i, for which it holds that D j = D i. Note that the purpose of using formulas in MNF is that they are relatively compact in the sense that an MNF-formula does not contain redundant conjunctions and that each conjunction does not contain redundant assignments. For an example consider the following two formulas containing pressure sensors p 1, p 2, and p 3, where all have the behavioral modes R pi = {NF, G, B, UF }. p 1 {UF } p 2 {B, UF } p 3 {UF } p 1 {UF } p 2 {B, UF } p 1 {G, UF } The first formula is in MNF but not the second since p 1 {UF } p 2 {B, UF } = p 1 {G, UF }. 188 DX'06 - Peñaranda de Duero, Burgos (Spain)

199 3.1 Conflicts and Diagnoses A conflict is assumed to be written using the logical language defined above. For example, if has been found that the pressure sensor p 1 can not be in the mode NF at the same time as p 2 is in the mode B or NF, this gives the conflict H = p 1 {NF } p 2 {B, NF } (3) To relate this definition of conflict to the one used in Section 2, consider the conflict C = {a, b, c}. With the logical language, we can write this conflict as a {NF } b {NF } c {NF }. Instead of conflicts, we will mostly use negated conflicts, so instead of H we consider H. In particular we will use negated conflicts written in MNF. For an example, the negated conflict H, where H is defined as in (3), can be written in MNF as p 1 {G, B, UF } p 2 {G, UF }. Without loss of generality, we will from now on assume that all negated conflicts are written on the form c 1 M 1 c 2 M 2 c n M n (4) where c j c k if j k, and = M i R ci. This means that (4) is in MNF. A system behavioral mode is a conjunction containing a unique assignment of all components in C. For example if C = {p 1, p 2, p 3 }, a system behavioral mode could be p 1 = UF p 2 = B p 3 = NF We consider the term diagnosis to refer to a system behavioral mode consistent with all negated conflicts. More formally, if P is the set of all negated conflicts, a system behavioral mode d is a diagnosis if {d} P = or equivalently d = P. To relate this definition of diagnosis to the one used in Section 2, assume that C = {a, b, c, d} and consider the diagnosis δ = {a, b}. With the logical language, we can write this diagnosis as a = F b = F c = NF d = NF. 4 The Generalized Algorithm With only small modifications, the original algorithm stated in Section 2 can be made to work with logical MNF-formulas instead of sets. The result is an algorithm that handles more than two behavioral modes per component and also nonpositive conflicts. With the modification, the algorithm will take as inputs, a formula D and a negated conflict P, both written in MNF. The purpose of the algorithm is then to derive a new formula Q in MNF such that Q D P. The modifications are the following: Instead of using a set of minimal diagnoses as input, use a formula D in MNF. Note that D is not restricted to be a disjunction of system behavioral modes, but instead can be a disjunction of conjunctions on the form (1). Instead of using a conflict set C as input, use a negated conflict P on the form (4). Instead of checking the condition δ i C =, check the condition D i = P. Instead of the assignment δ new := δ i {c}, find a conjunction D new in MNF such that D new D i P j. Instead of checking the condition δ k δ new, check the condition D new = D k. In the algorithm we will use the notation D i D to denote the fact that D i is a conjunction in D. The algorithm can now be stated as follows: Algorithm 2 Input: a formula D in MNF, and a negated conflict P Output: Q D old = D forall D i D do if D i = P then Remove D i from D old forall P j P do Let D new be a conjunction in MNF such that D new D i P j forall D k D, D k D i do if D new = D k then goto LABEL1 end D add := D add D new LABEL1 end end end Q := D old D add To keep the algorithm description clean, some operations have been written in a simplified form. More details are discussed in Section 4.2 below. Note that an improvement corresponding to the change of to old in Algorithm 1 is not possible for the generalized algorithm. The algorithm is assumed to be used in an iterative manner as follows. First when only one conflict P 1 is considered, the diagnoses are already described by P 1. Thus, the algorithm is not needed. When a second conflict P 2 is considered, the algorithm is fed with D = P 1 and P = P 2, and produces the output Q such that Q P 1 P 2. Then, for each additional conflict P n that is considered, the input D is the old output Q. When the algorithm is used in this way, the following results can be guaranteed. Theorem 1 Let P be a set of negated conflicts that is not inconsistent, i.e. P =, and let Q be the output from Algorithm 2 after processing all negated conflicts in P. Then it holds that Q P. Theorem 2 The output Q from Algorithm 2 is in MNF. The proofs for these results can be found in the appendix. 4.1 Example To illustrate the algorithm, consider the following small example where C = {p 1, p 2, p 3 } and the domain of behavioral modes for each component is R pi = {NF, G, B, UF }: D =D 1 D 2 = p 1 {G, B, UF } p 3 {G, UF } P =P 1 P 2 = p 2 {B, UF } p 3 {G, B, UF } First the condition D 1 = P is fulfilled which means that D 1 is removed from D old and the inner loop of the algorithm is entered. There a D new is created such that D new D 1 DX'06 - Peñaranda de Duero, Burgos (Spain) 189

200 P 1 = p 1 {G, B, UF } p 2 {B, UF }. This D new is then compared to D 2 in the condition D new = D 2. The condition is not fulfilled which means that D new is added to D add. Next a D new is created such that D new D 1 P 2 = p 1 {G, B, UF } p 3 {G, B, UF }. Also this time the condition D new = D 2 is not fulfilled, implying that D new is added to D add. Next, the conjunction D 2 is investigated but since D 2 = P holds, D 2 is not removed from D old and the inner loop is not entered. The algorithm output is finally formed as Q := D old D add = D 2 (D 1 P 1 D 1 P 2 ) = =p 3 {G, UF } p 1 {G, B, UF } p 2 {B, UF } p 1 {G, B, UF } p 3 {G, B, UF } It can be verified that Q D P. Also, it can be seen that Q is in MNF. 4.2 Algorithm Details To implement the algorithm, some more details need to be known. The first is how to check the condition D i = P. To illustrate this, consider an example where D i contains components c 1, c 2, and c 3 and P componentsc 2, c 3, and c 4. Since D is in MNF, and P in the form (4), D i and P will have the form D i =c 1 M D 1 c 2 M D 2 c 3 M D 2 (5) P =c 2 M P 2 c 3 M P 3 c 4 M P 4 (6) We realize that the condition D i = P holds if and only if M2 D MP 2 or MD 3 MP 3. Thus, this example shows that in general, D i = P holds if and only if D i and P contain at least one common component c i where Mi D Mi P. The second detail is how to find an expression Q new in MNF such that Q new D i P j. To illustrate this, consider an example where D i contains components c 1 and c 2, and P j the component c 2. Since D is in MNF, and P in the form (4), D i and P j will have the form D i =c 1 M D 1 c 2 M D 2 P j =c 2 M P 2 (7a) (7b) Then Q new will be formed as D new = c 1 M D 1 c 2 M D 2 M P 2 which means that D new D i P j. If it holds that M D 2 MP 2, D new will be in MNF. Otherwise let D new =. The check D new = D k will then immediately make the algorithm jump to LABEL1 meaning that D new will not be added to D add. The third detail is how to check the condition D new = D k. To illustrate this, consider an example where D new contains components c 1 and c 2, and D k the components c 2 and c 3. Since D new and D are both in MNF, D new and D k will have the form D new =c 1 M1 n c 2 M2 n (8a) D k =c 2 M2 D c 3 M3 D (8b) Without changing their meanings, these expressions can be expanded so that they contain the same set of components: D new =c 1 M n 1 c 2 M n 2 c 3 R c3 (9) D k =c 1 R c1 c 2 M D 2 c 3 M D 3 (10) Now we see that the condition D new = D k holds if and only if M n 1 R c1, M n 2 M D 2, and R c3 M D 3. The first of these three conditions is always fulfilled and the third can never be fulfilled since, by definition of MNF, M D 3 R c3. Thus, this example shows that D new = D k holds if and only if (1), D k contains only components that are also contained in D new, and (2), for all components c i contained in both D new and D k it holds that M n i M D i. The fourth detail to be considered is the expression D add := D add D new. Since D add is not assigned from the beginning, this expression is to be read as D add := D new when D add is unassigned. Finally, note that D old or D add may be unassigned or empty at some places in the algorithm. In that case, e.g. in Q := D old D add, the missing term can just be neglected. 5 Relation to Minimal Diagnoses The concept of minimal diagnoses was originally proposed in [Reiter, 1987; dekleer and Williams, 1987] for systems where each component has only two possible behavioral modes, i.e. the normal fault-free mode and a faulty mode. Minimal diagnoses have two attractive properties. Firstly, they represent the simplest diagnoses and are therefore often desired when prioritizing among diagnoses. Secondly, in case there are only positive conflicts, the minimal diagnoses characterize the set of all diagnoses. These two properties will now be investigated for the generalized case of more than two modes per component and non-positive conflicts. 5.1 Simplest Property For the case of more than two modes per component, the concept of preferred diagnoses was defined in [Dressler and Struss, 1992] as a generalization of minimal diagnoses. The basic idea is that the behavioral modes for each component are ordered in a partial order defining that some behavioral modes are more preferred than other. For example, N F is usually preferred over any other mode, and a simple electrical fault, such as short-cut or open circuit, may be preferred over other more complex behavioral modes. Further, an unknown fault UF may be the least preferred mode. For a formal definition let b 1 c c b 2 c denote the fact that for component c, the behavioral mode b 1 c is equally or more preferred than b 2 c. For each component, this relation forms a partial order on the behavioral modes. Further, these relations induce a partial order on the system behavioral modes. Let d 1 and d 2 be two system behavioral modes d i = c C (c = b i c). Then we write d 1 d 2 if for all c C it holds that b 1 c c b 2 c. A preferred diagnosis can then formally be defined as a diagnosis d such that there is no other diagnosis d where d > d. In Section 7 we will discuss how the preferred diagnoses can be obtained from an MNF formula representing all diagnoses. Note that in the case of only two modes, preferred diagnoses are exactly the minimal diagnoses. Remark: One may ask what preferred or simplest diagnoses means. One possible formal justification is the following. Let P(d) denote the prior probability of the system behavioral mode d = c C c = b c. We assume that faults occur independently of each other which means that P(d) = 190 DX'06 - Peñaranda de Duero, Burgos (Spain)

201 c C P(c = b c) where P(c = b c ) is the prior probability that component c is in behavioral mode b c. If Q is a formula such that Q P, it holds that P(d P) = P(d Q)/P(Q). This means that P(d P) = P(d)/P(Q) if d = P, i.e. if d is a diagnosis, and P(d P) = 0 if d = P, i.e. if d is not a diagnosis. For a given set P, the term P(Q) is only a normalization constant, which means that to compare P(d P) for different diagnoses it is enough to consider the priors P(d). To know the exact value of a prior P(c = b c ) may be very difficult or even impossible. Therefore one may assume that for each component, the priors are unknown but at least partially ordered. Under this assumption, and given the set of negated conflicts, the preferred diagnoses are then the most probable ones. 5.2 Characterizing Property Now we investigate how the characterizing property of minimal diagnoses can be generalized to the case of more than two modes and the presence of non-positive conflicts. In some special cases, the preferred diagnoses characterize all diagnoses with the help of the partial order. That is, if d 1 is a diagnosis and if d 2 < d 1, we know that also d 2 is a diagnosis. This is always true when there are only two modes per component and only positive conflicts, which in turn is guaranteed when there are no fault models. Note that it may also be true in a case with more than two modes, even in the presence of fault models. However this does not hold generally. In an MNF-formula, the conjunctions have the property that they characterize all diagnoses. For example consider the case when the components are ={a, b, c, d, e}, R = {NF, B, G, UF } for all components, and a {B, UF } b {G, UF } is one of the conjunctions in an MNF formula. By letting each diagnosis be represented as an ordered set corresponding to a, b, c, d, e, this single conjunction characterizes the diagnoses {B, UF } {G, UF } {NF, B, G, UF } {NF, B, G, UF } {NF, B, G, UF } which is 256 diagnoses. For another example assume that each of the components C = {a, b, c, d} has only two modes, i.e. R = {NF, F }. A conjunction a {F } b {F } would then characterize all diagnoses {F } {F } {NF, F } {NF, F }. In Section 2 this conjunction would be represented by {a, b}. If all conflicts are positive, all conjunctions would be on this form, and there is a one-to-one correspondence between the conjunctions in an MNF-formula and the minimal diagnoses in the original framework described in Section 2. If there is a fault model for the mode F of a component a, the non-positive conflict a {F } may appear. Assume also that a conflict b = {NF } appears. This has the consequence that a formula in MNF, describing all diagnoses, may for example contain a conjunction a {NF } b {F }. This conjunction characterizes all diagnoses {NF } {F } {NF, F } {NF, F }, and this is a so called kernel diagnosis (see the next section). Note that to represent this conjunction is not possible using sets as described in Section 2. Note also that there is one minimal diagnosis in this example, namely a = NF b = F c = NF d = NF, and this minimal diagnosis does not characterize all diagnoses. 6 Relation to Kernel Diagnoses The paper [dekleer et al., 1992] defines partial diagnosis and kernel diagnosis. This was done assuming only two modes per component. The purpose of kernel diagnoses is that the set of all kernel diagnoses characterizes all diagnoses even in the case when there are non-positive conflicts. As noted in [dekleer et al., 1992], also a subset of kernel diagnoses is sometimes sufficient to characterize all diagnoses. In the context of this paper we can define partial diagnosis as a conjunction d of mode assignments such that d = P. Then, a kernel diagnosis is partial diagnosis d such that there is no other partial diagnosis d where d = d. According to the following theorem, the output Q from Algorithm 2 is, in the two-mode case, a disjunction of kernel diagnoses. Theorem 3 Let each component have only two possible behavioral modes, let P be a set of negated conflicts, and let Q be the output from Algorithm 2 after processing all negated conflicts in P. Then it holds that each conjunction of Q is a kernel diagnosis. Note that the MNF property alone does not guarantee that all conjunctions are kernel diagnoses. This can be seen in the following formula which is in MNF. c 1 = N c 2 = N c 1 = N c 2 = F (11) All diagnoses represented by (11) are characterized by the single kernel diagnosis c 1 = N. Therefore none of the conjunctions in (11) are kernel diagnoses. Even though the paper [dekleer et al., 1992] defines partial and kernel diagnoses for the case of only two modes per component, the definition of partial and kernel diagnoses given above is applicable also to the case of more than two modes per component. However, the conjunctions in the output Q from Algorithm 2 will for this case not be kernel diagnoses. Instead each conjunction represents a set of partial diagnoses, e.g. the first conjunction of (12) represents the two partial diagnoses c 1 = E c 3 = B and c 1 = E c 3 = G. Since the second conjunction of (12) represents e.g. c 1 = E c 2 = E c 3 = B, it is also obvious that the partial diagnoses represented by each conjunction are not necessarily kernel diagnoses. 7 Extracting Preferred Diagnoses In Section 5 it was concluded that the conjunctions in the output Q from Algorithm 2 characterize all diagnoses, and in the special case of two modes per component and only positive conflicts, there is a one-to-one correspondence between MNF-conjunctions and the minimal diagnoses. This special case has also the property that if we study each conjunction in an MNF formula Q separately, it will have only one preferred diagnosis. This preferred diagnosis is a also a preferred diagnosis when considering the whole formula Q. The consequence is that it is straightforward to extract the preferred diagnosis from a formula Q. In the general case, there is no DX'06 - Peñaranda de Duero, Burgos (Spain) 191

202 such guarantee. For example, in the two-mode case and when some conflicts are non-positive, which means that the negated conflict will contain some assignment c = NF, there may be a conjunction not corresponding to a preferred diagnosis. For an example with more than two modes, consider two components c 1 and c 2 where R ci = {NF, E, F } and NF > ci E > ci F, and a third component c 3 where R ci = {NF, B, G} with the only relations NF > c3 B and NF > c3 G. Then consider the MNF-formula Q = c 1 {E} c 3 {B, G} c 1 {E, F } c 2 {E, F } c 3 {B, G} (12) The preferred diagnoses consistent with the first conjunction are c 1 = E c 2 = NF c 3 = B and c 1 = E c 2 = NF c 3 = G. The preferred diagnoses consistent with the second are c 1 = E c 2 = E c 3 = B and c 1 = E c 2 = E c 3 = G. As seen, the two diagnoses c 1 = E c 2 = E c 3 = B and c 1 = E c 2 = E c 3 = G are not preferred diagnoses of the whole formula Q. The example shows that preferred diagnoses can not be extracted simply by considering one conjunction at a time. Instead the following procedure can be used. For each conjunction in Q, find the preferred diagnoses consistent with that conjunction, and collect all diagnoses found in a set Ψ. The set Ψ may contain non-preferred diagnoses. These can be removed by a simple pairwise comparison. Note that the set Ψ need not to be calculated for every new negated conflict that is processed. Instead only at the time the preferred diagnoses are really needed, for example before a service task is to be carried out, the set Ψ needs to be calculated. One may ask how much extra time that is needed for the computation of the preferred diagnoses, compared to the time needed to process all negated conflicts and compute Q. To give an indication of this, the following empirical experiment was set up. A number of 132 test cases were randomly generated. The test cases represent systems with between 4 and 7 components, where each component has 4 possible behavioral modes. The number of negated conflicts varies between 2 and 12. time [s] reference time [s] Figure 1: The total execution times for computing Q (dashed line) and preferred diagnoses (solid line). In Figure 1, the results for the 132 test cases are shown. The reference time on the x-axis is chosen to be the computation time needed to compute Q. As seen, the figure indicates that the extra time needed to compute preferred diagnoses from the MNF formula Q, is almost negligible compared to the time needed to compute only the MNF formula. 8 Conclusions In this paper the minimal hitting-set algorithm from [dekleer and Williams, 1987] has been generalized to handle more than two modes per component and also non-positive conflicts. This has been done by first establishing a framework where all conflicts and diagnoses are represented with special logical formulas. Then the original minimal hitting-set algorithm needed only small modifications to obtain the desired results. It has been formally proven that Q P, i.e. the algorithm output is equivalent to the set of all diagnoses. Further it was proven that the algorithm output Q is in the MNF-form that guarantees that Q does not contain redundant conjunctions. In a comparison with the original framework where conflicts and diagnoses are represented by sets, it was concluded that the conjunctions in the output Q, from the generalized algorithm, are a true generalization of the minimal diagnoses obtained from the minimal hitting-set algorithm. It has also been concluded that the conjunctions are a true generalization of kernel diagnoses. Since, for the case of more than two mode per component, minimal diagnoses do not necessarily correspond to the most desired diagnoses, it was instead shown how preferred diagnoses could be obtained from the conjunctions with a reasonable amount of effort. References [dekleer and Williams, 1987] J. dekleer and B.C. Williams. Diagnosing multiple faults. Artificial Intelligence, Issue 1, Volume 32:pp , [dekleer and Williams, 1989] J. dekleer and B.C. Williams. Diagnosis with behavioral modes. IJCAI, pages , [dekleer et al., 1992] J. dekleer, A.K. Mackworth, and R. Reiter. Characterizing diagnoses and systems. Artificial Intelligence, Issue 2-3, Volume 56:pp , [Dressler and Struss, 1992] O. Dressler and P. Struss. Back to defaults: Characterizing and computing diagnoses as coherent assumption sets. ECAI, pages , [Greiner et al., 1989] R. Greiner, B.A. Smith, and R.W. Wilkerson. A correction to the algorithm in reiter s theory of diagnosis. Artificial Intelligence, 41(1):79 88, [Reiter, 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57 95, April [Struss and Dressler, 1989] P. Struss and O. Dressler. physical negation - integrating fault models into the general diagnosis engine. IJCAI, pages , [Wotawa, 2001] F. Wotawa. A variant of reiter s hitting-set algorithm. Information Processing Letters, 79(1):45 51, DX'06 - Peñaranda de Duero, Burgos (Spain)

203 Appendix Lemma 1 The output Q from Algorithm 2 contains no two conjunctions such that Q 2 = Q 1. PROOF. Assume the contrary, that Q 1 and Q 2 are two conjunctions in Q and Q 2 = Q 1. There are three cases that need to be investigated: (1) Q 1 D old, Q 2 D add, (2) Q 2 D old, Q 1 D add, (3) Q 1 D add, Q 2 D add. 1) The fact Q 2 D add means that D new = Q 2 at some point. Since Q 1 D old, D new must then have been compared to Q 1. Since Q 2 has really been added, it cannot have been the case that Q 2 = Q 1. 2) Since Q 1 D add, it holds that Q 1 = D i P j for some D i D. The fact Q 2 = Q 1 implies that Q 2 = D i P j = D i. This is a contradiction since Q 2 D, and D is in MNF. 3) There are three cases: (a) Q 2 = D i P j2 = D i P j1 = Q 1, (b) Q 2 = D i2 P j = D i1 P j = Q 1, (c) Q 2 = D i2 P j2 = D i1 P j1 = Q 1, where in all cases, P j1 P j2 and D i1 D i2. a) We know that D i and P are formulas on forms like D i = a A b B c C and P = a A p b B p respectively. This means that Q 1 = a A A p b B c C and Q 2 = a A b B B p c C. The fact Q 2 = Q 1 implies that A A A p which further means that A A p. This implies D i = a A b B c C = a A p = P. Thus, Q 1 and Q 2 are never subject to be added to D add. b) We have that Q 2 = D i2 P j = D i1 P j = D i1 D. This means that Q 2 = D i2 P j can not have been added to D add. c) We have that Q 2 = D i2 P j2 = D i1 P j1 = D i1 D. This means that Q 2 = D i2 P j2 can not have been added to D add. All these investigations show that it impossible that Q 2 = Q 1. Theorem 2 The output Q from Algorithm 2 is in MNF. PROOF. From Lemma 1 it follows that Q contains no two conjunctions such that Q 2 = Q 1. All conjunctions in D old are trivially on the form specified by (1). All conjunctions in D add are also on the form (1) because of the requirement on D new. Thus Q is in MNF. Lemma 2 Let Q be the output from Algorithm 2 after processing all negated conflicts in P. For any two conjunctions Q 1 and Q 2 in Q, there is no component c and conjunction D such that Q 1 D c A 1 and Q 2 D c A 2 where A 1 R c and A 2 R c. PROOF. Assume that there is a component c and conjunction D such that Q 1 D c A 1 and Q 2 D c A 2. We can write Q 1 as c A φ1 D 1 where A φ1 is the intersection of the sets M j obtained from all P φ 1 P, and D 1 is the conjunction of one P j obtained from every P P \ φ 1. Similarly we write Q 2 as c A φ2 D 2. We can find a D such that D D 1 D 2 and where D is the conjunction of one P j obtained from every P P \(φ 1 φ 1 ). Then let D = c A φ1 φ2 D which means that Q 1 = c A φ1 φ2 D 1 D. Similarly we can obtain the relation Q 2 = c A φ1 φ2 D 2 D. By construction of D it can be realized that D = Q k for some conjunction Q k in Q. Because of this relation both Q 1 and Q 2 can not be contained in Q which is a contradiction. This means that there can not be a component c and conjunction D such that Q 1 D c A 1 and Q 2 D c A 2. Lemma 3 Let Q = D old D add be the output from Algorithm 2 after processing all test negated conflicts in P. If D im is not contained in D old, and the set D im P j is not contained in D add, after running the algorithm, then there is a D im+1 such that D im P j = D im+1 and D im+1 P j = D im P j. PROOF. The fact that D im is not contained in D old means that the inner loop of the algorithm must have been entered when D i = D im. Then the fact that D im P j is not contained in D add, means that D im P j = D k for some D k, k i m. By choosing i m+1 = k, this gives D im P j = D im+1. Next we prove that D k P j = D i P j. Let the single assignment in P j be a A p. We will divide the proof into four cases: (1) a comps D i, a comps D k, (2) a comps D i, a comps D k, (3) a comps D i, a comps D k, and (4) a comps D i, a comps D k. 1) The fact D i P j = D k would imply D i = D k which is impossible because D is in MNF. 2) This means that D i can be written as D i = D a A i. The fact D i P j = D k would then imply that D = D k and consequently that D i = D k, which is impossible because D is in MNF. 3) First assume that D i contains a component c D k. Note that this component is not component a. This would imply that c is not contained in P j. Thus the components of D i P j is a not a subset of the components of D k P j, which implies D k P j = D i P j. The case left to investigate is when the components of D i are a subset of the components of D k. Assume that D k P j = D i P j. This relation can be written D k a A p A k = D i a A p where D k is a conjunction not containing component a. For this relation to hold it must hold that D k = D i. This means that D k = a A k D k = D i which is impossible because D is in MNF. 4) Assume that D k P j = D i P j. This relation can be written D k a A p A k = D i a A p A i where D k and D i are conjunctions not containing component a. This relation would imply D k = D i. Further on, the fact D i P j = D k can be written a A p A i D i = a A k D k, which implies that D i = D k. Thus we have D i D k and the only possible difference between D i and D k is the assignment of component a. Lemma 2 says this is impossible. With i = i m and k = i m+1, these four cases have shown that D im+1 P j = D im P j. DX'06 - Peñaranda de Duero, Burgos (Spain) 193

204 Lemma 4 Let D be the output from Algorithm 2 after processing all negated conflicts in P n 1, and Q the output given D and P as inputs. For each conjunction D i in D and P j in P it holds that there is a conjunction Q k in Q such that D i P j = Q k. PROOF. If, after running the algorithm, D i is contained in D old, then the lemma is trivially fulfilled. If instead D i P j is contained in D add, then the lemma is also trivially fulfilled. Study now the case where D i is contained in D old and D i P j is not contained in D add. We can then apply Lemma 3 with P = P n 1 {P}. This gives us a D im+1 such that D im P j = D im+1 and D im+1 P j = D im P j. If D im+1 is contained in D old, then the lemma is fulfilled. If instead D im+1 P j is contained in D add, note that D im P j = D im+1 implies D im P j = D im+1 P j. This means that the lemma is fulfilled. In this way we can repeatedly apply Lemma 3 as long as the new D im+1 obtained is not contained in D old and D im+1 P j not contained in D add. We will now prove that after a finite number of applications of Lemma 3 we obtain a D im+1 where D im+1 is contained in D old or D im+1 P j is contained in D add. Note that that each application of Lemma 3 guarantees that D im P j = D im+1 P j and D im+1 P j D im P j. This fact itself implies that there cannot be an infinite number of applications of Lemma 3. Theorem 1 Let P be a set of negated conflicts that is not inconsistent, i.e. P =, and let Q be the output from Algorithm 2 after processing all negated conflicts in P. Then it holds that Q P. PROOF. Let P n 1 denote the set all negated conflicts in P except P. Then it holds that P P n 1 {P} D P. Lemma 4 implies that D P = Q. Left to prove is Q = D P. Take arbitrary conjunction Q k in the output Q. If Q k is in D old, then it must be in also D, i.e. Q k = D i for some conjunction D i in D. The fact that D i is in D old means also that D i = P. Thus Q k = D i = D P. Lemma 5 Let P n 1 P n be a set of negated conflicts, and let each component have only two possible behavioral modes. If D is the output from Algorithm 2 after processing all negated conflicts in P n 1, then a new call to the algorithm with inputs D and P n gives an output Q in which each conjunction is a kernel diagnosis. PROOF. Take an arbitrary conjunction Q k in Q. It holds that Q k D i P j for some conjunction D i in D and some conjunction P j in P n. If Q k D i, then Q k is a kernel diagnosis since D i is. Next we investigate the other case Q k D i. Assume that Q k is not a kernel diagnosis. The assignment P j can be written as c p = M p. Thus, we can write Q k as Q k = D i (c p = M p ). Since by assumption Q k is not a kernel diagnosis, we can remove one assignment, either c p = M p or some assignment a = M a in D i, from Q k and obtain a partial diagnosis. The partial diagnosis obtained is either D i or D c p = M p, where D i = D a = M a. Study first the case where D i is the partial diagnosis. By definition, this means that D i = P n 1 {P n }, which implies D i = P n. This means that D i would not be removed from D old and thus become one conjunction in Q. Since Q k = D i (c p = M p ) = D i, both Q k and D i cannot be conjunctions in Q because Q is in MNF according to Theorem 2. This contradiction shows that D i can not be a partial diagnosis. Next, study the case where D c p = M p is the partial diagnosis, and let Ma denote the complementary element to M a. This means that both D c p = M p a = M a and D c p = M p a = M a are partial diagnoses. This means, by definition, that D c p = M p a = M a = P n 1 {P n } Q. Since Q k = D a = M a c p = M p, and Q is in MNF, there must be another Q m such that D c p = M p a = M a = Q m. According to Lemma 2, it can not hold that Q m = D c p = M p a = M a. Therefore we can remove one assignment from D c p = M p a = M a and still obtain a conjunction d such that d = Q m. Note then that it can not hold that d = D c p = M p since this would imply that Q k = Q m. Now we investigate the case d = D a = M a. Let Ω denote the set of assignments contained in D. The fact that Q k = D a = M a c p = M p means that each negated conflict P P n 1 {P n } contains an assignment in Ω {a = M a } {c p = M p }. Next, D a = Ma = Q m means that Q m contains a subset of the assignments contained in D a = M a. This further means that each negated conflict P P n 1 {P n } contains an assignment from Ω m {a = M a }. This means that a P that does not contain any assignment from Ω m must contain the assignment a = M a. The consequence of this is that P cannot contain the assignment a = M a. Since it was concluded above that each P contains an assignment in Ω {a = M a } {c p = M p }, P must then contain the assignment c p = M p. Thus each negated conflict P P n 1 {P n } contains an assignment from Ω m {c p = M p }. We can now select one assignment from each P P n 1 {P n } but with the requirement that the selected assignment must be c p = M p or contained in Ω. By forming a conjunction Φ of these assignments, it will hold that D c p = M p = Φ. Therefore Q k = D a = M a c p = M p = Φ. If Φ is not one of the conjunctions in Q, there will be another Q v such that Φ = Q v. This means that Q k = Q v and Q i cannot be contained in Q, which is a contradiction. Thus we have shown that it cannot hold that d = D a = M a, and therefore that D c p = M p cannot be a partial diagnosis. This further means that Q k must be a kernel diagnosis. Theorem 3 Let each component have only two possible behavioral modes, let P be a set of negated conflicts, and let Q be the output from Algorithm 2 after processing all negated conflicts in P. Then it holds that each conjunction of Q is a kernel diagnosis. PROOF. It is not difficult to realize that, after processing the first two negated conflicts in P, each conjunction of the output Q is a kernel diagnoses. For each further negated conflict that is processed, each conjunction of the new output will be a kernel diagnosis according to Lemma DX'06 - Peñaranda de Duero, Burgos (Spain)

205 Runtime Fault Detection and Localization in Component-oriented Software Systems Bernhard Peischl and Joerg Weber and Franz Wotawa Technische Universität Graz Institute for Software Technology 8010 Graz, Inffeldgasse 16b/2, Austria Tel: , Fax: Abstract In this paper we introduce a novel technique for run-time fault detection and localization in component-oriented software systems. Our novel approach allows to define arbitrary properties via rules at the component level. By monitoring the software system at run-time we can detect violations of these properties and, most notably, also localize possible causes for specific property violation(s). Relying on the model-based diagnosis paradigm, our fault localization technique is able to deal with intermittent fault symptoms and it allows for measurement selection. Finally, we discuss results obtained from our most recent case studies and relate our work to those of others. 1 Introduction Several research areas are engaged in the improvement of software reliability during the development phase, for example research on testing, debugging, or formal verification techniques like model checking. Unfortunately, although substantial progress has been made in these fields, we have to accept the fact that faults in complex software systems are facts to be coped with, not problems to be solved [Patterson et al., 2002]. This perspective is supported by historical evidence and by numerous studies. Thus, it is highly desirable to augment complex software systems with autonomic fault localization capabilities, especially in systems which require high reliability. The goal of our work is to detect and locate faults at runtime without any human intervention. Existing techniques like runtime verification aim at the detection of faults. However, it is necessary to locate faults in order to be able to automatically perform repair at runtime. Possible repair actions are, for example, the restart of software components or switching to redundant components. In this paper we propose a technique for runtime fault detection and localization in component-oriented software systems. We define components as independent computational This research has been funded in part by the Austrian Science Fund (FWF) under grant P17963-N04. Authors are listed in alphabetical order. modules which have no shared memory and which communicate among each other by the use of events, which can contain arbitrary attributes. We suppose that the interactions are asynchronous. A component-oriented software system may be a single application which comprises loosely coupled processes (threads), or it may consist of multiple independent applications which communicate among themselves. The components may be splitted over a network. Typical implementations of the asynchronous event-based communication paradigm are, for example, CORBA, COM, JavaBeans, or even low-level communication methods like Unix message passing. Moreover, we suppose that a certain event which is produced by a component can not be directly related to a specific incoming event, for example because incoming events may be internally queued. Furthermore, there may be connections which are not observable. Another assumption is that, as often the case in practice, no formalized knowledge about the application domain exists. We require the runtime diagnosis to impose low run-time overhead in terms of computational power and memory consumption. Ideally, augmenting the software with fault detection and localization functionality necessitates no change to the software. We require the monitoring process to have no noticeable influence on the overall behavior of the system. Moreover, to avoid damage which could be caused by a faulty system, we have to achieve quick detection, localization, and repair. Another difficulty is the fact that the fault symptoms are often intermittent. One reason is that in runtime diagnosis the inputs to the system can not be kept constant while the diagnosis is performed, as the system continues operating. For example, a server application may receive new client requests during the fault localization process. In addition, software systems often operate in a physical environment which permenently changes, e.g. the control software of a mobile robot. Our approach allows to introduce user-defined properties. The target system is continuously monitored by rules, i.e., pieces of software which detect property violations. The fact that the modeler can implement and integrate arbitrary rules provides sufficient flexibility to cope with today s software complexity. In practice, properties and rules will often embody elementary insights into the software behavior rather than complete specifications. The reason is that, due to the DX'06 - Peñaranda de Duero, Burgos (Spain) 195

206 complexity of software systems, often no formalized knowledge of the behavior exists and the informal specifications are coarse and incomplete. In order to enable fault localization, complex dependences between properties can be defined. When a violation occurs, we locate the fault by employing the model-based diagnosis (MBD) paradigm [Reiter, 1987; de Kleer and Williams, 1987]. In terms of the classification introduced in [Brusoni et al., 1998], we propose a state-based diagnosis approach with temporal behavior abstraction. Furthermore, our model is able to deal with intermittent symptoms. We evaluated our approach using the control software of a mobile autonomous robot as target system. The concrete models which we created for this system mainly aimed at the diagnosis of severe faults like software crashes and deadlocks. Among the novel contributions of this paper is the monitoring of user-defined properties at the component level by integrating arbitrary rules. In particular, we employ relationships between properties for the localization of faults. Furthermore, we provide a formalization of the architecture of a component-based software system and of the property dependences, and we outline an algorithm for computing the logical model. We formally describe the diagnosis system and we present a runtime fault detection and localization algorithm which allows for measurement selection. Moreover, we give examples related to the control software of autonomous robots and discuss the results of case studies. Finally, we relate our work to those of others. 2 Introduction to the Model Framework Vision Odometry Kicker OM MD HB WM (WorldModel) WS Planner connections: PS OM... Object Measurement WS... World State MD... Motion Delta PS... PlannerState HB... HasBall Figure 1: Architectural view on the software system of our example. Figure 1 illustrates a fragment of a control system for an autonomous soccer robot as our running example. This architectural view comprises basically independent components which communicate by asynchronous events. The connections between the components depict data flows. The Vision component periodically produces events containing position measurements of perceived objects. The Odometry periodically sends odometry data to the WorldModel (WM). The WM uses probability-based methods for tracking object positions. For each event arriving at one of its inputs, it creates an updated world model containing estimated object positions. The Kicker component periodically creates events indicating whether or not the robot owns the ball. The Planner maintains a knowledge base (KB), which includes a qualitative representation of the world model, and chooses abstract actions based on this knowledge. The content of the KB is periodically sent to a monitoring application. In [Steinbauer and Wotawa, 2005] an abstract behavior model of software components is proposed which is similar to the model in [Friedrich et al., 1999]. This model abstracts over concrete values in terms of functional dependences [Jackson, 1995]: If we assume a correctly working component and all inputs are correct, then the output(s) must be correct as well. In our example, the Planner component would be modelled as follows: AB(P lanner) ok(w S) ok(hb) ok(p S), where AB(c) denotes abnormality of component c and ok(e) states that a connection e is correct during a certain period of time. That is, the model abstracts from both the temporal constraints and the possibly complex values (logical contents) of events. Whether a connection is correct or not is determined by observers. An observer comprises rules, which are pieces of software which monitor certain parts of the software system. For example, the observers for ok(om), ok(m D), and ok(p S), would contain rules which continuously check if events on these connections are produced periodically. While the model in [Steinbauer and Wotawa, 2005] proved applicable in various settings, we argue that in many cases this model is too abstract to express software behavior. As a matter of fact, a component s complex behavior can not be captured by simple dependences. First, a separation of temporal constraints and constraints related to the values of events is highly desirable. For example, in Figure 1, the Planner is supposed to produce events on the connection PS periodically, regardless of the inputs to this component. Thus this constraint does not depend on any input, and from its violation we can directly infer that the Planner has failed. However, the value of the PS events is directly influenced by the Planner s inputs. Second, as events may contain complex values, fine-grained dependences are necessary for capturing the real behavior. For example, some parts of the knowledge base, whose content is transmitted over the PS connection, depend on the world model, i.e. on the WS connection, while other parts depend on the HB connection. Our new model addresses these issues by assigning a set of properties, i.e. constraints, to components and connections. In the logical model, these properties are represented by property constants. We use the proposition ok(x, pr, s) which states that the property pr holds for x during a certain period of time, where x is either a component or a connection. While the system is continuously monitored by the rules, the diagnosis itself is based on (multiple) discrete snapshots of the system. The snapshots are obtained by polling the states of the rules (violated or not violated) at discrete time points. Each observation belongs to a certain snapshot, and we use the variable s as a placeholder for a specific snapshot. The diagnosis accounts for the observations of all snapshots. This approach to MBD is called multiple-snapshot diagnosis or state-based diagnosis [Brusoni et al., 1998]. An example for a component-related property is pr np, expressing that the number of processes (threads) of a correctly 196 DX'06 - Peñaranda de Duero, Burgos (Spain)

207 Vision Odometry Kicker ok(om, pr_pe) ok(md, pr_pe) ok(hb, pr_pe) WM (WorldModel) ok(ws, pr_pe) ok(ws, pr_cons_om) Planner ok(ps, pr_pe) ok(ps, pr_cons_om) ok(ps, pr_cons_hb) Figure 2: The improved model allows to define properties for each connection. working component c must exceed a certain threshold. In our running example, pr pe denotes that events must occur periodically on a connection, and pr cons e is used to denote that the value of events on a certain connection must not contradict the events on connection e. The observer for ok(w S, pr cons OM, s) checks if the computed world models on connection WS correspond to the object position measurements on connection OM. Ideally, such an observer would embody a complete specification of the tracking algorithm used in the WM component. In practice, however, often only incomplete and coarse specifications of the complex software behavior are available. Therefore, the observers rely on simple insights which require little expert knowledge. The rules of the observer for ok(w S, pr cons OM, s) could check if all environment objects which are perceived by the Vision are also part of the computed world models, disregarding the actual positions of the objects (note that the set of perceived objects often changes in a dynamic environment, and those objects which are no longer perceived will be tracked by the WM for a while and finally discarded). Our experience has shown that such abstractions often suffice to detect and locate severe faults like software crashes or deadlocks. Using such properties, the dependences between the inputs and outputs of components can be refined, as the logical model in Figure 3 shows. Figure 2 depicts the properties which we assign to the connections, and Figure 4 shows the dependences between properties on the input and output connection of the WM and the Planner. The model captures, for example, that the WM must generate events periodically, provided that the temporal constraints on the incoming connections hold. Furthermore, the value of the events on connection WS must be consistent with the OM connection, provided that the events on OM occur periodically. To illustrate our basic approach we outline a simple scenario by locating the cause for observed malfunctioning. We assume a fault in the WM causing the world state WS and, as a consequence, the planner state P S to become inconsistent with the object position measurements OM. As a result, the observer for ok(p S, pr cons OM, s) detects a violation, i.e. ok(p S, pr cons OM, s 0 ) is an observation for snapshot 0. All other observers are initially disabled, i.e. they do not provide any observations. Based on this observation, we can compute diagnosis candidates by employing the MBD [Reiter, 1987; de Kleer and AB(V ision) ok(om, pr pe, s) AB(Odometry) ok(md, pr pe, s) AB(W M) ok(om, pr pe, s) ok(md, pr pe, s) ok(w S, pr pe, s) AB(W M) ok(om, pr pe, s) ok(w S, pr cons OM, s) AB(Kicker) ok(hb, pr pe, s) AB(P lanner) ok(p S, pr pe, s) AB(P lanner) ok(w S, pr cons OM, s) ok(p S, pr cons OM, s) AB(P lanner) ok(p S, pr cons HB, s) for each component c : AB(c) ok(c, pr np, s) Figure 3: An improved model with refined dependences for our example. Williams, 1987] approach for this observation snapshot. By computing all (subset minimal) diagnoses, we obtain three single-fault diagnosis candidates, namely {AB(V ision)}, {AB(W M)}, and {AB(P lanner)}. Note that, using the coarse-grained model in [Steinbauer and Wotawa, 2005], the Odometry and the Kicker would be candidates, too. After activating observers for the output connections of these candidates, we obtain the second observation snapshot ok(om, pr pe, s 1 ), ok(w S, pr pe, s 1 ), ok(w S, pr cons OM, s 1 ), ok(p S, pr pe, s 1 ), ok(p S, pr cons OM, s 1 ), and ok(p S, pr cons HB, s 1 ). This leads to the single diagnosis {AB(W M)}. Let us consider a second scenario related to Figure 2. It demonstrates that our model framework allows refinements which may lead to the correct identification of multiple-fault diagnoses in situations in which a less fine-grained model would find solely single-fault diagnoses. We assume that monitoring the connections OM and M D is either impossible or unrealistic due to high costs. Suppose that no events on connection WS occur, thus ok(w S, pr pe, s 0 ) is observed. Given the model in Fig. 3, we obtain 3 single-fault diagnoses: {AB(V ision)}, {AB(Odometry)}, {AB(W M)}. However, the WM generates an output event for each event on one of its incoming connections. Therefore, if only one of the components Vision or Odometry were faulty, the number of events on the connection WS would still be larger than 0. As a consequence, we can conclude that either the WM or both the Vision and the Odometry have failed. We gain a better result by refining the model as shown in Figure 5. The sentences in Figure 5 extend the model in Figure 3. The new property pr eo holds only if at least one event occurs on a connection during a certain time period. A new sentence is added to the model of the WM component. It states that, if the WM works correctly and the property pr eo holds for at least one input connection, then it must hold for the connection WS as well. Note that we use a kind of dependence for pr eo that is different from what we have seen so far. We will call this a partial dependence in Section 3. Now the observers detect two property violations: ok(w S, pr pe, s 0 ) and ok(w S, pr eo, s 0 ). We obtain a single-fault diagnosis {AB(W M)} and a single dualfault diagnosis {AB(V ision), AB(Odometry)}, which ob- DX'06 - Peñaranda de Duero, Burgos (Spain) 197

208 viously resembles the human kind of reasoning. ok(om, pr_pe) ok(md, pr_pe) ok(ws, pr_cons_om) WM Planner ok(wm, pr_np) ok(ws, pr_pe) ok(ws, pr_cons_om) ok(planner, pr_np) ok(ps, pr_pe) ok(ps, pr_cons_om) ok(ps, cons_hb) prop. depends on component only prop. also depends on input Figure 4: Graphical representation of dependences in Fig. 3 for two example components.. AB(V ision) ok(om, pr eo, s) AB(Odometry) ok(md, pr eo, s) AB(W M) (ok(om, pr eo, s) ok(md, pr eo, s)) ok(w S, pr eo, s) Figure 5: Extension of the model in Fig. 3 3 Formalizing the Model Framework In Definition 3.1 we introduce a model which captures the architecture of a component-oriented software system and the dependences between properties. Definition 3.1 (SAM) An software architecture model (SAM) is a tuple (COMP, CONN, Φ, ϕ, out, in p, in t ) with: a set of components COMP a set of connections CONN a (finite) set of properties Φ a function ϕ : COMP CONN 2 Φ, assigning properties to a given component or connection. a function out : COMP 2 CONN, returning the output connections for a given component. the (partial) functions in p and in t : COMP CONN Φ 2 CONN Φ, which express the functional dependences between the inputs and outputs of a given component c. For all output connections e out(c) and for each property pr ϕ(e), they return a set of tuples (e, pr ), where e is an input connection of c and pr ϕ(e ) a property assigned to e. This definition allows to specify a set of properties Φ for a specific software system. We introduce a function ϕ in order to assign properties to components and connections. The functions in p and in t formalize the functional dependences between properties of the inputs and of the outputs. For each property pr of an output connection, they return a set of input properties P R on which pr depends. Function in t expresses total dependences: if a component is correct and all properties in P R hold, then pr must hold as well. By contrast, in p defines partial dependences: if a component is correct and at least one property in P R holds, then pr must hold, too. In our example, the dependence of (W S, pr eo ) on (OM, pr eo ) and (MD, pr eo ) is partial (see Fig. 5). All other dependences in this example are total. Note that [Friedrich et al., 1999] and [Steinbauer and Wotawa, 2005] use only one kind of functional dependence, which is equivalent to what we call a total dependence herein. For example, those part of the SAM which relates to the WM component and its output connection WS are defined as follows (Fig. 3 and 5): ϕ(w M) = {pr np }, ϕ(w S) = {pr pe, pr cons OM, pr eo } out(w M) = {W S} in t (W M, W S, pr pe ) = {(OM, {pr pe }), (MD, {pr pe })}, in t (W M, W S, pr cons OM ) = {(OM, {pr pe })} in p (W M, W S, pr eo ) = {(OM, {pr eo }), (MD, {pr eo })} The logical model is computed by Algorithm 1. Based on a SAM, it generates the logical system description SD. In line (3), we create those sentences which relate to component properties. In line (4), a logical representation of the dependences between properties is computed. It is distinguished between total and partial dependences. Note that the universal quantification implicitly applies to variable s. It denotes a discrete snapshot of the system behavior. Each observation ( )ok(x, pr, s i ) relates to a certain snapshot s i, where i is the snapshot index. A diagnosis is a solution for all snapshots. The temporal ordering of the different snapshots is not taken into account. It is also important that, supposed that the number of snapshots is finite, the logical model which is computed by this algorithm can be easily transformed to propositional Horn clauses and thus the model is amenable to efficient logical reasoning. 4 Runtime Monitoring and Fault Localization The runtime diagnosis system consists of two modules, the diagnosis module (DM) and the observation module (OM). These modules are executed concurrently. While the DM performs runtime fault detection and localization at the logical level, the OM continuously monitors the software system and provides the abstract observations which are used by the DM. Thus, the OM can be regarded as an abstraction layer between the architecture model, as presented in Section 3, and the running software. Let us consider the OM first. It basically consists of observers. Each observer comprises a set of rules which specify the desired behavior of a certain part of the software system. A rule is a piece of software which continuously monitors that part. The execution of the rules is concurrent and unsynchronized, and we do not impose any restrictions on the implementation of a rule and its complexity. Furthermore, while 198 DX'06 - Peñaranda de Duero, Burgos (Spain)

209 Algorithm 1: The algorithm for computing the logical model. Input: The SAM. Output: The system description SD. COMPUTEMODEL(COMP, CONN, Φ, ϕ, out, in p, in t ) (1) SD := {}. (2) For all c COMP : (3) For all pr ϕ(c): add AB(c) ok(c, pr, s) to SD. (4) For all e out(c), for all pr ϕ(e): add AB(c) ok(e, pr, s) ok(e, pr, s) and AB(c) to SD. (5) Return SD. (e, pr ) in t (c, e, pr) (e, pr ) in p(c, e, pr) ok(e, pr, s) ok(e, pr, s) a property is assigned to a single component or connection, a rule may monitor multiple communication links in the target system in order to detect wrong sequences of events. In our example, the rules for ok(w S, pr cons OM, s) take the events on the connections WS and OM into account. When a rule detects a violation of its specification, it switches from state not violated to the state violated. To each observer a set of atomic sentences is assigned which represent the logical observations. Furthermore, an observer may be enabled or disabled. The rules of a disabled observer are inactive, and the observer does not provide any observations. Disabled observers may be enabled in the course of the fault localization. Note that it is often desired to initially disable those observers which otherwise would cause unnecessary runtime overhead. Definition 4.1 (Observation Module OM) The OM is a tuple (OS, OS e ), where OS is the set of all available observers and OS e OS the set of those observers which are currently enabled. Definition 4.2 (Observer) An observer os OS is a tuple (R, Ω) with: 1. a set of rules R. For a rule r R, the boolean function violated(r) returns if a violation of its specification has been detected. 2. A set of atomic sentences Ω. Each atom ω Ω has the form ok(x, pr, s), where x COMP CONN, pr ϕ(x), and s is a variable denoting an observation snapshot (see Definition 3.1). An observer detects a misbehavior if one or more of its rules are violated. Let υ(os e ) denote the set of observers which have detected a misbehavior, i.e. υ(os e ) = {(R, Ω) OS e violated(r) = true, r R}. Then the total set of observations of a certain snapshot s i as computed as shown in Algorithm 2. Algorithm 2: The algorithm for computing the set of observations. Input: The set of enabled observers and a constant denoting the current snapshot. Output: The set OBS which comprises ground literals. COMPUTEOBS(OS e, s i ) (1) OBS := {}. (2) For all os OS e, os = (R, Ω): (3) If os υ(os e ): add ω Ω ω to OBS (4) else: add ω Ω ω to OBS. (5) For all atoms α OBS: substitute s i for the variable s. (6) Return OBS. Algorithm 3 presents the algorithm which is executed by the diagnosis module DM. The inputs to the algorithm are the logical system description SD, which is returned by the computemodel algorithm (Alg. 1), and an observation module OM = (OS, OS e ). In contrast to the work in [Steinbauer and Wotawa, 2005], this algorithm is able to gather additional observations by integrating runtime measurement selection. The algorithm periodically determines whether a misbehavior is detected by an observer. In this case, it waits for a certain period of time (line 6). This gives the observers the opportunity to detect additional symptoms, as it may take some time after faults manifest themselves in the observed system behavior. Thereafter, the diagnoses are computed (line 10) using Reiter s Hitting Set algorithm [Reiter, 1987]. Note that the violated rules are reset to not violated after computing the logical observations (line 9). Therefore, an observer which detects a misbehavior in snapshot s j may report a correct behavior in s j+1. This is necessary for the localization of multiple faults in the presence of intermittent symptoms. When we find several diagnoses (lines 11 and 12), it is desirable to enable additional observers in OS \ OS e. We assume the function ms(sd, OBS, OS, OS e ) to perform a measurement selection, i.e. it returns a set of observers OS s (OS s OS \ OS e ) whose observations could lead to a refinement of the diagnoses. We do not describe the function ms in this paper. In [de Kleer and Williams, 1987] a strategy based on Shannon entropy to determine the optimal next measurement is discussed. Note that the returned set may be empty, even if no unique diagnosis is derivable. The fault localization is finished when either a unique diagnosis is found or the diagnoses can not be further refined by enabling additional observers (line 11). 5 Case Studies and Discussion We implemented the proposed diagnosis system and conducted a series of experiments using the control software of a mobile autonomous soccer robot. We applied a propositional Horn clause theorem prover for consistency checks in the diagnosis engine [Minoux, 1988]. The implemented measurement selection process may enable multiple observers at the same time in order to reduce the time required for fault localization. The components of the control system are executed in DX'06 - Peñaranda de Duero, Burgos (Spain) 199

210 Algorithm 3: The runtime diagnosis algorithm. Input: The logical system description and the observation module. PERFORMRUNTIMEDIAGNOSIS(SD, OM) (1) Do forever: (2) Query the observers, i.e. compute the set υ(os e ). (3) If υ(os e ) {}: (4) Set i := 0, OBS := {}, f inished := f alse, where i is the snapshot index. (5) While not f inished: (6) Wait for the symptom collection period δ c. (7) Recompute υ(os e ). (8) OBS := OBS OBS i, where OBS i := computeobs(os e, s i ) (9) Reset all rules to not violated. (10) Compute D: D := { is a minimal diagnosis of (SD, COMP, OBS)}. (11) If D = 1 or the set OS s := ms(sd, OBS, OS, OS e ) is empty: start repair, set finished := true. (12) Otherwise: set i := i + 1, enable observers in OS s, and set OS e := OS e OS s. separate applications which interact among each other using CORBA communication mechanisms. The software runs on a Pentium 4 CPU with a clock rate of 2 GHz. The model of the software system comprises 13 components and 14 connections. We introduced 13 different properties. 7 different types of rules were implemented, and the observation module used 21 instances of these rule types. For the specification of the system behavior we used simple rules which embody elementary insights into the software behavior. For example, we specified the minimum number of processes spawned by certain applications. Furthermore, we identified patterns in the communication among components. A simple insight is the fact that components of a robot control system often produce new events either periodically or as a response to a received event. Other examples are rules which express that the output of a component must change when the input changes, or specifications capturing the observation that the values of certain events must vary continuously. We simulated software failures by killing single processes in 10 different applications and by injecting deadlocks in these applications. We investigated if the faults can be detected and located in case the outputs of these components are observed. In 19 out of 20 experiments, the fault was detected and located within less than 3 seconds. In only one case it was not possible to detect the fault because the specification of an important connection would have required information about the physical environment which was not available. Note that we set the symptom collection period δ c to 1 second (see Alg. 3, line 6), and the fault localization incorporated no more than 2 observation snapshots. Due to the small number of components and connections, the computation of the diagnoses required only a few milliseconds. Furthermore, the overhead (in terms of CPU load and memory usage) caused by the runtime monitoring was negligible, in particular because calls to the diagnosis engine are only necessary after an observer has detected a misbehavior. Furthermore, we conducted 6 non-trivial case studies in order to investigate more complex scenarios. We injected deadlocks in different applications. We assumed that significant connections are either unobservable or should be observed only on demand, i.e. in course of the fault localization, because otherwise the runtime overhead would be unacceptable. In 4 scenarios we injected single faults, while in the other cases 2 faults occurred in different components almost at the same time. Moreover, in 2 scenarios the symptoms were intermittent and disappeared during the fault localization. In all of the 6 case studies, the faults could be correctly detected and located. In two cases, the fault was immediately detected and then located within 2 seconds. In one case the fault was detected after about 5 seconds, and the localization took 2 more seconds. However, in three case studies the simple rules detected the faults only in certain situations, e.g. when the physical environment was in a certain state. For example, in one case study the fault could be detected only when it occured while the soccer robot was dribbling the ball. The fault was not detected in situations in which the robot did not have the ball. We gained several insights from our experiments. In general, state-based diagnosis appears to be an appropriate approach for fault localization in a robot control system as a particular example for component-oriented software. We were able to identify simple patterns in the interaction among the components, and by using rules which embody such patterns it was possible to create appropriate models which abstract from the dynamic software behavior. Furthermore, the approach proved to be feasible in practice since the overhead caused by the runtime monitoring is low. An important issue is how to find the properties for a specific application. Our work aims at software systems comprising large and complex components. At present, for such systems it is rarely the case that formal specifications are available. Thus, it will often be necessary to manually derive the properties from informal (textual and graphical) specifications, which are often coarse and incomplete. It would be desirable to automatically extract the properties from the source code of the software system, for example by relying on assertions (Design by Contract, [Meyer, 1997]). Unfortunately, in general this is not possible due to several reasons. First, only a part of the source code of a software system may be available, especially in complex systems which often integrate third party frameworks and libraries. Second, we cannot expect that the automated extraction of properties is computationally feasible for complex systems. The granularity of properties at the component level is quite different from that of assertions at the source code level, as properties relate to the overall behavior of a component whereas assertions are assigned to functions and classes in the source code and thus define local conditions. Therefore, in order to derive a single property automatically it would, in general, be necessary to take the entire source code (including all assertions) into account, which is computationally infeasible for large systems. A main problem is the fact that simple rules are often too 200 DX'06 - Peñaranda de Duero, Burgos (Spain)

211 coarse to express the software behavior. Such rules may detect faults only in certain situations. Therefore, it may happen that faults are either not detected or that they are detected too late, which could cause damage due to the misbehavior of the software system. The usage of simple rules also has the effect that more connections must be permanently observed than it would be the case if more complex rules were used. For example, in the control system we used in our experiments we had to observe more than half of the connections permanently in order to be able to detect severe faults like deadlocks in most of the components. 6 Related Research There is little work which deals with model-based runtime diagnosis of software systems. In [Grosclaude, 2004] an approach for model-based monitoring of component-based software systems is described. The external behavior of components is expressed by Petri nets. In contrast to our work, the fault detection relies on the alarm-raising capabilities of the components themselves and on temporal constraints. In the area of fault localization in Web Services, the author of [Ardissono et al., 2005] proposes a modelling approach which is similar to ours. Both approaches use grey-box models of components, i.e. the dependences between the inputs and outputs of components are modelled. However, their work assumes that each message (event) on a component output can be directly related to a certain input event, i.e. each output is a response which can be related to a specific incoming request. As we can not make this assumption, we abstract over a series of events within a certain period of time. Another approach to model the behavior of software is presented in [Mikaelian and Williams, 2005]. In order to deal with the complexity of software, the authors propose to use probabilistic, hierarchical, constraint-based automata (PHCA). However, their work addresses software which is embedded in hardware systems, and they model the software in order to detect faults in the hardware. The authors of [Mikaelian and Williams, 2005] do not detect software bugs. In the field of autonomic computing, there are model-based approaches which aim at the creation of self-healing and selfadaptive systems. The authors of [Garlan and Schmerl, 2002] propose to maintain architecture models at runtime for problem diagnosis and repair. Their architecture models comprise components and connectors. Their notion of a component resembles our definition. Similar to our work, they assign properties to components and connectors. The constraints over the properties are defined in a first-order language. However, their work does not employ fault localization mechanisms. Pinpoint [Chen et al., 2002] is a framework for root-cause analysis in large distributed component applications (e.g. e-commerce systems). Pinpoint monitors client requests, uses traffic sniffing and middleware instrumentation to detect failed requests, and then applies data mining techniques to determine which components are likely to be faulty. The advantage of their approach is that is does not rely on static dependency models. No knowledge of the application components is required. The author of [Auguston, 1998] suggests an approach to assertion checking, debugging, and profiling by building a behavioral model in terms of a number of events (so called event traces). Moreover, the author proposes a language to describe computations over event traces and states that algorithmic debugging [Shapiro, 1983] can be considered as an example of a debugging strategy based on a specific assertion language (e.g. assertions about procedure call outcomes). Moreover, the authors of [Console et al., 1993] discuss the relationship between algorithmic debugging and MBD. In contrast to the work presented herein, research in the area of model-based software debugging deals with verification and particularly fault localization [Mayer and Stumptner, 2003; Köb and Wotawa, 2004] at compile time. Since limitations on computational and memory resources are less stringent than in runtime diagnosis, most of this research deals with fault localization at the object, statement or expression level, whereas our model focuses on capturing componentlevel behavior. Design by Contract [Meyer, 1997] is a lightweight formal technique for runtime detection of specification violations. The trace assertions approach [Brökens and Möller, 2002b; 2002a] extends the Design by Contract approach by specifying the desired behavior of a program in terms of CSPlike processes. This allows for specifying valid events in a systematic fashion also incorporating abstraction techniques. Similar to our approach these so-called trace assertions are checked at runtime. However, the work presented in [Brökens and Möller, 2002b; 2002a] focuses on runtime error detection in Java programs and on specification techniques for traces. Our work fits in the same context, however, we focus on fault detection as well as localization in particular in autonomous software systems. The authors of [Steinbauer and Wotawa, 2005] discuss the repair of component-oriented software systems at runtime. The repair is basically done by restarting failed components. 7 Conclusion and Future Research This paper presents a model-based diagnosis approach for the detection and localization of faults in component-oriented software systems at runtime. Our model allows to introduce arbitrary properties and to assign them to components and connections. The fault detection is performed by rules, i.e. pieces of software which continuously monitor the software system in order to detect property violations. The fault localization utilizes dependences between properties. We formalize the architecture of a component-oriented software system and the dependences between properties. We employ two different kinds of dependences, total dependences and partial dependences. Moreover, we provide algorithms for the generation of the logical model and for the runtime diagnosis. The runtime fault localization integrates measurement selection by enabling additional observers at runtime. Finally, we discuss case studies which demonstrate that our approach is frequently able to quickly detect and locate faults. We were able to create appropriate models which abstract from the dynamic behavior by relying on simple rules which embody elementary insights into the software system. DX'06 - Peñaranda de Duero, Burgos (Spain) 201

212 The main problem we identified is the fact that simple rules, in contrast to more complex specifications, often detect faults only in certain situations. As a consequence, it may happen that faults are either not detected or they are detected too late. We plan to evaluate our approach in other application domains as well. Another open issue is if our approach can be adapted to a distributed diagnostic engine, which would be usefor for software systems which are distributed over a network. Moreover, our future resarch will deal with autonomous repair of software systems at runtime. References [Ardissono et al., 2005] Liliana Ardissono, Luca Console, Anna Goy, Giovanna Petrone, Claudia Picardi, Marino Segnan, and Daniele Theseider Dupré. Cooperative Model-Based Diagnosis of Web Services. In Proceedings of the 16th International Workshop on Principles of Diagnosis, DX Workshop Series, pages , June [Auguston, 1998] Mikhail Auguston. Buildig program behavior models. In Proceedings of the European Conference on Artificial Intelligence (ECAI), Workshop on Spatial and Temporal Reasoning, pages IOS Press, [Brökens and Möller, 2002a] Mark Brökens and Michael Möller. Dynamic event generation for runtime checking using the JDI. In Klaus Havelund and Grigore Rosu, editors, Proceedings of the Federal Logic Conference Satellite Workshops, Runtime Verification, volume 80 of Electronic Notes in Theoretical Computer Science. Elsevier, July [Brökens and Möller, 2002b] Mark Brökens and Michael Möller. Jassda Trace Assertions, Runtime checking the dynamic of Java programs. In Ina Schieferdecker, Hartmund König, and Adam Wolisz, editors, Trends in Testing Communicating Systems, International Conference on Testing of Communicating Systems, pages 39 48, March [Brusoni et al., 1998] Vittorio Brusoni, Luca Console, Paolo Terenziani, and Daniele Theseider Dupré. A spectrum of definitions for temporal model-based diagnosis. Artificial Intelligence, 102(1):39 79, [Chen et al., 2002] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the International Symposion on Dependable System and Networks (DSN), June [Console et al., 1993] Luca Console, Gerhard Friedrich, and Daniele Theseider Dupré. Model-based diagnosis meets error diagnosis in logic programs. In Proceedings 13 th International Joint Conf. on Artificial Intelligence, pages , Chambery, August [de Kleer and Williams, 1987] Johan de Kleer and Brian C. Williams. Diagnosing multiple faults. Artificial Intelligence, 32(1):97 130, [Friedrich et al., 1999] Gerhard Friedrich, Markus Stumptner, and Franz Wotawa. Model-based diagnosis of hardware designs. Artificial Intelligence, 111(2):3 39, July [Garlan and Schmerl, 2002] David Garlan and Bradley Schmerl. Model-based adaptation for self-healing systems. In WOSS 02: Proceedings of the first workshop on Self-healing systems, pages 27 32, New York, NY, USA, ACM Press. [Grosclaude, 2004] Iréne Grosclaude. Model-based monitoring of software components. In Proceedings of the 16th Eureopean Conference on Artificial Intelligence, pages IOS Press, June Poster. [Jackson, 1995] Daniel Jackson. Aspect: Detecting Bugs with Abstract Dependences. ACM Transactions on Software Engineering and Methodology, 4(2): , April [Köb and Wotawa, 2004] Daniel Köb and Franz Wotawa. Introducing alias information into model-based debugging. In 16th European Conference on Artificial Intelligence (ECAI), pages , Valencia, Spain, August IOS Press. [Mayer and Stumptner, 2003] Wolfgang Mayer and Markus Stumptner. Extending diagnosis to debug programs with exceptions. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering (ASE), Montreal, Quebec, Canada, IEEE. [Meyer, 1997] B. Meyer. Object-Oriented Software Construction. OSE Press, 2nd edition, [Mikaelian and Williams, 2005] Tsoline Mikaelian and Brian C. Williams. Diagnosing complex systems with software-extended behavior using constraint optimization. In Proceedings of the 16th International Workshop on Principles of Diagnosis, DX Workshop Series, pages , [Minoux, 1988] Michel Minoux. LTUR: A Simplified Linear-time Unit Resolution Algorithm for Horn Formulae and Computer Implementation. Information Processing Letters, 29:1 12, [Patterson et al., 2002] David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike Chen, James Cutler, Patricia Enriquez, Armando Fox, Emre Kiciman, Matthew Merzbacher, David Oppenheimer, Naveen Sastry, William Tetzlaff, Jonathan Traupman, and Noah Treuhaft. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical report, Berkeley, CA, USA, [Reiter, 1987] Raymond Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57 95, [Shapiro, 1983] Ehud Shapiro. Algorithmic Program Debugging. MIT Press, Cambridge, Massachusetts, [Steinbauer and Wotawa, 2005] Gerald Steinbauer and Franz Wotawa. Detecting and locating faults in the control software of autonomous mobile robots. In Proceedings of the 19 th International Joint Conference on AI (IJCAI-05), pages , Edinburgh, UK, DX'06 - Peñaranda de Duero, Burgos (Spain)

213 Abstract Dependence Models in Software Debugging Bernhard Peischl and Safeeullah Soomro and Franz Wotawa Technische Universität Graz Institute for Software Technology 8010 Graz, Inffeldgasse 16b/2, Austria Abstract In this article we introduce and formalize a novel model particularly tailored for detecting and localizing structural faults in procedural programs. Moreover, we discuss the relationship between this model and the well-known functional-dependence model particularly under presence of partial specification artifacts like assertions or pre- and post conditions. Furthermore, we present novel results obtained from our most recent case study. Notably, whenever our novel model detects a structural fault, it also appears to be capable of localizing the detected misbehavior s real cause. 1 Introduction Abstract dependences [Jackson, 1995] are applied in software analysis in various ways (e.g. software maintenance, program understanding, program slicing, refactoring, and also in software debugging). Fault localization employing abstract dependences has sound theoretical foundations [Friedrich et al., 1999] and the relationships to other techniques in software engineering, for example program slicing [Weiser, 1984], have been clarified [Wotawa, 2002]. Specifically in MBSD (model-based software debugging), we are aware of two different models relying on abstract dependences. Both models have individual strengths, and weaknesses, and, as outlined in this article, appear to complement each other in terms of their diagnostic capabilities. The so called functional-dependence model (FDM) represents program statements as components and completely abstracts form individual values, referring to variables merely as being correct or incorrect with regard to a given expected behavior. For example, this representation allows for debugging VHDL designs up to 10MB of source code [Friedrich et al., 1999]. The FDM captures the program s behavior by stating, that whenever a statement is correct, and its inputs are known to be correct, then the output must be correct, too. For a detailed treatment of the FDM we refer to [Friedrich et al., 1999]. We listed authors in alphabetical order. The Austrian Science Fund (FWF) supports this work under project grant P17963-N04. The Higher Eductaion Comission (HEC), Pakistan, supports this work under its scholarship program. The verification-based model (VBM) for debugging is an extension of the dependence model from Jackson s Aspect system [Jackson, 1995] which has been used for verification of C programs. The Aspect system analyzes the dependences between variables of a given program and compares them with the specified dependences. In case of a mismatch the program is said to violate the specification. Otherwise, the program fulfills the specification. Unfortunately, the Aspect systems does not allow to locate the source of a mismatch. The VBM extends Jackson s idea towards not only detecting misbehavior but also localizing the malfunctioning s real cause. In this article we (1) provide a formalization of the VBM, (2) outline a novel model extension allowing for debugging of procedural programs, (3) discuss the relationship between the FDM and VBM under presence of partial specifications like assertions by exemplifying specific scenarios in software debugging, and (4) present novel results from our most recent case study. 2 The Verification-Based Model In the following we explain the basic ideas using the following small program which implements the computation of the circumference and area of a circle. The program contains one fault in line 2 where a multiplication by Π is missing. 0. // pre true 1. d = r * 2; 2. c = d; // BUG! a = d * pi; 3. a = r * r * pi; 4. // post c = r 2 π a = 2 r π Informally, a program variable x depends on a variable y if a value for y potentially influences the value of x. In our small example program the variable d in the first line depends on the variable r. Hence, every statement of a program introduces new dependences. All defined variables of an assignment statement, i.e., variables occurring on the left side, depend on all variables of the right side of an assignment statement. Similar rules can be obtained for other statements as explained later on. These dependences solely are given by a statement whenever we assume that the statement is correct (w.r.t. the dependences). If a statement is assumed to be incorrect, the dependences are not known. We express the latter fact by introducing a new type of variable, the so called model variables. Model variables are variables that work as DX'06 - Peñaranda de Duero, Burgos (Spain) 203

214 placeholder for program variables. For example, if we assume statement 2 to be incorrect, we introduce a model that says that program variable a depends on model variable ξ 2 (where ξ 2 is unique). The idea behind our approach is to find assumptions about the correctness and incorrectness of statements which do not contradict a given specification. In our running example, the specification is given in terms of a post-condition. From this post-condition we derive that c has to depend on r and pi. However, when assuming statement 1 and 2 to be correct, we derive that a depends on d and d in turn depends on r which leads to c depends on r but not on pi. Hence, the computed dependence contradicts the specified one. To get rid of this inconsistency, we might assume line 2 to be faulty. Hence, we can compute that c depends on model variable ξ 2. When now comparing the specification with the computed dependence we substitute ξ 2 by r and pi and we can no longer derive an inconsistency. In rest of this section we formalize the basic idea. We start with the definition of dependences. The interpretation of a dependence (x, y) of a relation R D is that x depends on y. The dependence relations for every line of our small program are: 1. d = r * 2; r 1 = {(d, r)} 2. a = d; r 2 = {(a, d)} 3. c = r * r * pi; r 3 = {(c, r), (c, pi)} Our novel debugging model allows one for reasoning about functions over dependence relations under given assumptions. Therefore the notion of a dependence relation is fundamental to our approach: Definition 2.1 (Dependence Relation) Given a program with variables V, and a set of model variables M = {ξ 1,...}. A dependence relation is a subset of the set D = 2 V (M V ). For combining the dependences of two consecutive statements we define the following composition operator for dependence relations. Definition 2.2 (Composition) Given two dependence relations R 1, R 2 D on V and M. The composition of R 1 and R 2 is defined as follows: {(x, y) (x, z) R 2 (z, y) R 1 } R 1 R 2 = {(x, y) (x, y) R 1 (x, z) R 2 } {(x, y) (x, y) R 2 (y, z) R 1 } This definition ensures that no information is lost during computing the overall dependence relation for a procedure or method. Hence, the first line of the definition of composition handles the case where there is a transitive dependence. The second line states that all dependences that are not re-defined in R 2 are still valid. In the third line all dependences that are defined in R 2 are in the new dependence set provided that there is no transitivity relation. Note that this composition is not a commutative operation and that {} is the identity element of composition. For example, the combined dependences for our running examples are: r 1 r 2 = {(a, r), (d, r)} = r and r r 3 = {(a, r), (d, r), (c, r), (c, pi)} = r In order to allow the direct comparison of specified dependences with the computed ones we introduce a projection operator which deletes all dependences for variables that are not of interest like the internal variable d. Definition 2.3 (Projection) Given a dependence relations R D and a set of variables A M V. The projection of R on A written as Π A (R) is defined as follows: Π A (R) = {(x, y) (x, y) R x A} For example, Π {a,c} (r ) is {(a, r), (c, r), (c, pi)} which is equivalent to the specification. From here on we assume that the computed dependence relation is always projected onto the variables used within the specification before comparing it with the specification. Definition 2.4 (Grounded dependence relation) A dependence relation is said to be grounded (or variable-free) if it contains no model variables. We assume that all specification are grounded dependence relations. Thus, we have to compare dependence relations containing model variables with grounded dependence relations. We propose a similar solution to that employed in the resolution calculus of first-order logic, namely substitution and finding the most general unifier. However, in contrast to variable substitution in first-order logic, we do not only replace one variable by one term but one model variable by a set of program variables. Definition 2.5 (Substitution) A substitution σ is a function which maps model variables to a set of program variables, i.e., σ : M 2 V. The result of the application of the substitution σ on a dependence relation R is a dependence relation where all model variables x in R have been replaced by σ(x). In order to compare a computed dependence set with the specification we have to find a substitution that makes the computed dependence set equivalent to the specified one. If there is no such substitution the sets are said to be inconsistent. It remains to show that the approach is feasible in practice. Hence, we have to show that (1) finding a substitution is decidable, and (2) can be done efficiently. Since both the set of model variables and the set of program variables (for a given program) are finite, checking all possible combinations for substitutions is possible. The set of model variables is finite because every model variable corresponds to a program statement and the number of statements of a program is finite. The set of program variables is finite because no new variables can be generated at runtime. However, checking all possible combination is not feasible. Hence, we have to search for a more efficient procedure. For the purpose of finding an efficient algorithm for computing a substitution that makes a dependence set equivalent to its specification we first map the problem to an equivalent constraint satisfaction problem (CSP). A CSP [Dechter, 1992; 2003] comprises variables V ars, their domains Dom, and a set of constraints Cons that have to be fulfilled when assigning values to the variables. A value assignment that fulfills all constraints is said to be a solution of the CSP. Every solution to the corresponding CSP is a valid substitution. Hence, we can make use of standard CSP algorithms for computing substitutions. Algorithm tocsp(r,s) Input: A dependence set R and a grounded dependence set S. Output: A corresponding CSP (V ar, Dom, Cons). 204 DX'06 - Peñaranda de Duero, Burgos (Spain)

215 C (x1,y1) C C (x2,y1) (x3,y1) C C (x4,y2) (x6,y2) C (x5,y2) C (x7,y3) C (x8,y3) Figure 1: The associated hyper-graph s structure. 1. Every model variable x of R has a corresponding constraint variables ν x in V ar. 2. The domain Dom is equivalent to the set of all program variables V (for all constraint variables). 3. For all program variables x of V we compute a function θ which maps x to a set of program variables. θ is defined as follows: θ(x) = {y (x, y) S (x, y) / R}. 4. For all elements (x, y) of Π x (x,y) SP EC (R) do: If y is a model variable, i.e., y M, then add a new constraint C (x,y) to Cons. The scope of C (x,y) is {ν y } and the set of valid tuples comprises only one element θ(x). Note that there is at least a single variable x for any model variable y. With every CSP instance (V ar, Dom, Cons) we associate a hyper-graph (V, H) where V = V ar and H denotes the set of constraints. The structure of the associated hyper-graph is rather simple. It simplifies to a constraint graph comprising only unconnected clusters of constraints which belong to the same model variable. Figure 1 exemplifies the associated hyper-graph s structure. We can further improve the computation of solutions. We only have to check whether all constraints which correspond to a model variable have the same valid tuple. If this is the case, then the tuple presents a substitution for the model variable. Otherwise, there is no substitution. Algorithm findsubstitution(r,s) Input: A dependence set R and a grounded dependence set S. Output: A valid substitution that makes R and S equivalent or if there is no such substitution. 1. Let (V ars, Dom, Cons) be tocsp(r,s). 2. For every model variable ξ do: (a) If all valid constraints C (x,ξ) Cons have the equal valid tuple M, then σ(ξ) = M. (b) Otherwise return. 3. Return σ. Finally, we are now able to define the equivalence of a dependence set and its grounded specification. Definition 2.6 (Equivalence) A dependence set R is equivalent to its grounded specification S iff there exists a σ = findsubstitution (R, S) and σ(r) = S. The following example not only shows how dependences are computed but serves as one example for spurious dependence relations. Consider the program fragment x=y+r;x=x-r. In this program x only depends on variable y and not on r. However, the computation leads to D(x = y + r) = {(x, y), (x, r)} and D(x = x r) = {(x, x), (x, r)} and finally we obtain D(x = y + r; x = x r) = {(x, y), (x, r)} where the rightmost entry represents a spurious dependence. Hence, it is not possible to compute a minimal set of dependences. Only an approximation is possible. To reflect this fact, we employ a weaker criterion than logical equivalence. The following definition is used for checking consistency between the computed dependences and the specification. Definition 2.7 (Contradiction, Fulfillment) Let R be the computed dependence relation for a program P, i.e., R = D(P ) under a given set of assumptions A, and S a specification, i.e., a grounded set of dependences. We say that R fulfills S if there exists a substitution σ = findsubstitution(r, S) and σ(r) S. Otherwise, R contradicts S. Hence, in the above example the computed dependence set (when assuming line 1 and 2 to be correct) fulfills the specification {(x, y)} although they are not equivalent. Finding a bug is now done by finding a set of assumptions that fulfills the given specification. Model-based diagnosis algorithms, e.g., Reiter s hitting set algorithm [Reiter, 1987; Greiner et al., 1989], can be used for this purpose. Formally, it remains to introduce how to extract dependence information from the source code. Figure 2 shows the appropriate rules. In the figure function D returns the dependences for a given statement and function M returns the variables employed within a given statement. Moreover, function var returns a given expression s variables. For more details about how to extract dependences we refer the reader to [Jackson, 1995]. Note that the proposed model is different from the model employed in [Jackson, 1995]. In contrast to the model proposed there, we employ a different operator for statement composition. In dealing with programs relying on procedural abstractions we have to (1) extend our model with rules for mapping formal parameters to actual ones, (2) clarify how to handle return values, and (3) incorporate recursive invocations. Figure 2 outlines the rules for dealing with procedures. The first part of rule 6 states that we first compute the dependences of the procedure s body and afterwards substitute the formal parameters by the actuals. After having obtained the procedure s dependences (including actual parameters) we have to identify those actuals influencing the variables appearing in the procedure s return statements. The second part states how to establish the relationship between the return variables and the target variable of the calling context. If we assume an invocation to be abnormal we introduce a single variable for every occurrence of a certain procedure. For recursive invocations (in all cases where we obtain an cyclic call graph) we have to perform a fixpoint analysis. In order to guarantee that the computed dependences increase monotonically w.r.t. the subset relation, we add the dependences for procedural invocation to those of the calling context (see rule 5 for procedure invocations). Thus, at the cost of over-approximating dependences, we can safely assume that there is always a fixpoint. DX'06 - Peñaranda de Duero, Burgos (Spain) 205

216 1. Assignments: Ab(x = e) D(x = e) = {(x, v) v vars(e)} where vars is assumed to return all variables which are used in expression e. M(x = e) = {x} Ab(x = e) D(x = e) = {(x, ξ ι )} 2. Conditionals: Ab(if e then S 1 else S 2) D(if e then S 1 else S 2) = (D(S 1) (M(S 1) vars(e)), D(S)2) (M(S 2) vars(e))) M(if e then S 1 else S 2) = M(S 1) M(S 2) Ab(if e then S 1 else S 2 ) D(if e then S 1 else S 2 ) = (D(S 1 ) (M(S 1 ) ι), D(S 2 )) (M(S 2 ) ι)) 3. Loops: W i = if b then S; W i 1 D(W 0 ) = {} D AB (W i ) = D(S; W i 1 ) (M(S; W i 1 ) b) = D(S) D AB (W i 1 ) (M(S) b) D AB (W i ) = D(S; W i 1 ) (M(S; W i 1 ) b) = D(S) D AB (W i 1 ) (M(S) b) AB(while b do S ) D AB (while b dos ) = S i D AB(W i ) AB(while b do S ) D AB (while b dos ) = S i D AB(W i) Note that we introduce a single variable ι for every unfolding W i in terms of the abnormal behavior for conditionals. D AB (W i ) denotes dependences for correct behavior of the conditional, and D AB (W i) denotes the dependences for the abnormal behavior. 4. No-operation (NOP): D(nop) = {} M(nop) = {} 5. Sequence of statements: 8 >< D(S 1 ) D(S 2 ) otherwise D(S 1 ; S 2 ) = D(S 1 ) (D(S then ), D(S else )) S 2 = if (e) then S T hen else S else >: D(S 1) D(S 2) S 2 = t proc(a 1, a 2,.., a n) M(S 1 ; S 2 ) = M(S 1 ) M(S 2 ), where R 1 (R 2, R 3 ) = R 1 R 2 R 1 R Procedures D(proc(a 1,..., a n)) = D(body(proc(f 1,.., f n))) {(f i, a i) i {1..n}}, where where D(body(proc(f 1,.., f n))) denotes the dependences of the procedure s body including the formal parameters f 1,.., f n Ab(x = proc(a 1, a 2,..., a n)) D(t = proc(a 1, a 2,..., a n)) = {t} {v (x, v) D(proc(a 1,..., a n )), x return(proc)} Ab(t = proc(a 1, a 2,..., a n)) D(t = proc(a 1, a 2,..., a n)) = {(t, ξ ι)}, where t denotes the target variable and return(proc) is a function returning the return values of the procedure proc Figure 2: The verification-based model We illustrate the basic definitions and the algorithms using our running example program where the area and the circumference of a circle is computed. The case where we assume that all statements work correctly was captured previously. For example, we might assume that line 1 is faulty (AB(1)): 1. d = r * 2; r 1 = {(d, ξ 1 } 2. a = d; r 2 = {(a, d)} 3. c = r * r * pi; r 3 = {(c, r), (c, pi)} The summarized dependence R 1 (after projection on relevant variables that is, after applying Π {c,a} ) is {(c, r), (c, pi), (a, ξ 1 )}. In projecting R 1 on the set A, A refers to target variables contained in the specification. We now compare R 1 with the specification S = {(c, r), (c, pi), (a, r), (a, pi)} and see that they are equivalent when using the substitution σ(ξ 1 ) = {r, pi}. Hence, line 1 is a possible fault location. For computing diagnoses we solve the CSP given in Section 2. In practice, we solve this CSP for every statement assumed to be erroneous, thus for a specific CSP only a single model variable is present. However, as Figure 1 suggests, this procedure can easily be extended towards searching for multiple-fault diagnoses. In a similar fashion, we obtain AB(2) as a possible candidate. In contrast to this, AB(3) does, however, not yield to a valid substitution and thus cannot be responsible for the differences between specified and computed dependences. All other assumptions are supersets of diagnoses already obtained. Hence, we stop searching for bug locations and give back two single-fault diagnoses, i.e., {AB(1)} and {AB(2)}. 3 Comparing Fault Localization Models The model comparison we present in the following relies on a couple of (reasonable) assumptions. First, for the FDM we need to have a test case judging the correctness of specific variables. In general, finding an appropriate test case revealing misbehavior w.r.t. specific variables is a difficult task, however, the presence of such a single test case is a requirement for applicability of the FDM. For the VDM, we assume an underlying assertion language, and a mechanism for deducing dependence specifications from this language. Dependences are further oriented according to last-assigned variables and specified in terms of inputs or input parameters rather than intermediate variables. For simplicity, we further assume that there are no disjunctive post conditions. In the following we illustrate the introduced models strength and weaknesses in terms of simple scenarios. In the figures the left hand side is a summary of the FDM model including the observations obtained from running the test case and the left hand side outlines the VBM. For both columns we summarize the obtained diagnosis candidates in terms of the set DIAG. Note that we only focus on single-fault diagnosis throughout the following discussion. Figure 3 outlines a code snippet together with the assertion checking a certain property, the FDM, and the specified 206 DX'06 - Peñaranda de Duero, Burgos (Spain)

217 1 proc (a,b) {... 2 x = a + b; 3 y = a / b; // instead of y = a b 4 assert (y == a b) 5 }.. 1 proc (a,b,c,d) {... 2 x = a + b; 3 y = x + c + d; // instead of y = x + c 4 assert (y == x + c) 5 }.. AB(2) ok(a) ok(b) ok(x) AB(3) ok(a) ok(b) ok(y) ok(a), ok(b), ok(y) DIAG = {{}} DIAG = {{AB(3)}} SP EC(proc) = {(y, a), (y, b)} dep(proc) = {(y, a), (y, b)} dep(proc) SP EC(proc) Figure 3: Code snippet, FD model, and specified and computed dependences. 1 proc (a,b,x,y) {... 2 x = a + b; 3 x = x + 2; // instead of y = x assert (y == x + 2, x == a + b) 5 }.. AB(2) ok(a) ok(b) ok(x ) SP EC = {(y, a), (y, b)(x, a)(x, b)} AB(3) ok(x ) ok(x ) dep(proc) = {(x, a), (x, b)} dep(proc) SP EC(proc) ok(x), ok(a), ok(b) σ(ξ 2) = {}, σ(ξ 3) = {} ok(x ), ok(a), ok(b) DIAG = {{AB(2)}, {AB(3)}} DIAG = {{}} Figure 4: The misplaced left-hand side variable. an computed dependences. Obviously, the VBM is unable to detect and thus localize this specific (functional) fault. In contrast to this, the FDM is able to localize this specific fault. Due to the failed assertion we can conclude that there is something wrong with variable y, thus ok(y) holds. We also can assume that inputs a and b are correct, thus the assumptions ok(a) and ok(b) directly deliver line 3 (AB(3)) as the sole single-fault diagnosis. Moreover, as Figure 4 illustrates, although the VBM allows for detecting misplaced left-hand side variables, the VBM cannot localize these kind of faults. Assume that a = 1, b = 1, x = 2 thus y = 4. Our assertion suggests to assume the dependences {(y, a), (y, b), (x, a), (x, b)}. Obviously, both models allow for detecting the fault. When employing the FDM, from the raised assertion we know that ok(x) holds. In order to conclude that the outcome of statement 3 is correct, we need to know that x is correct prior to this statement s execution. Thus, to obtain the contradiction we have to assume that both statements are correct. By reverting the correctness assumption about statement 2 we obviously can remove the contradiction. Moreover, reverting the assumption about statement 3 also resolves the contradiction. Thus, we obtain two single-fault diagnosis AB(2) and AB(3). In contrast to this, since y never appears as target variable, we cannot obtain dependences for variable y and thus the VBM cannot localize these kind of (structural) faults. The next example points out that the VBM fails in case the fault introduces additional dependences. In Figure 5 AB(2) ok(a) ok(b) ok(x) AB(3) ok(x) ok(c) ok(d) ok(y) ok(y), ok(a), ok(b) DIAG = {{AB(2)}, {AB(3)}} SP EC(proc) = {(y, a), (y, b)(y, c)} dep(proc) = {(y, a), (y, b)(y, c)(x, a)(x, b)} dep(proc) SP EC(proc) DIAG = {{}} Figure 5: A typical (structural) fault inducing additional dependences. we assign x + c + d instead of x + c to the variable y. Our assertion indicates that y depends upon x and c, thus SP EC(proc) = {(y, a), (y, b), (y, c)}. Computing the program s actual dependences dep(proc), however, yields to {(y, a), (y, b), (y, c), (y, d)} {(y, a), (y, b), (y, c)} and thus VBM cannot detect this specific malfunctioning nor locate the misbehavior s cause. By employing the FDM under the assumption ok(y) we obtain two single-fault diagnosis AB(2) and AB(3). Stumptner [Stumptner, 2001] shows that localizing structural faults requires exploiting design information like assertions, and pre- and post conditions. Again, we outline this in terms of a few small examples. Although the previous examples show that the VBM cannot detect neither locate certain types of faults, it may provide reasonable results in capturing structural faults. Figure 6 illustrates an example where the fault manifests itself in inducing less dependences than specified. Our specification is SP EC(proc) = {(y, a), (y, b), (y, c)}. Obviously, the computed dependences {(y, a), (y, b)} SP EC(proc). As the figure outlines, we obtain two single-fault diagnosis candidates, AB(2) and AB(3). In this case, the FDM is also capable of delivering the misbehavior s real cause, it returns two single-fault diagnosis candidates: AB(2) and AB(3). 1 proc (a,b,c) {... 2 x = a + b; 3 y = x; // instead of y = x + c 4 assert (y == a + b + c) 5 }.. AB(2) ok(a) ok(b) ok(x) SP EC(proc) = {(y, a), (y, b), (y, c)} AB(3) ok(x) ok(y) dep(proc) = {(y, a), (y, b)} ok(y), ok(a), ok(b) dep(proc) SP EC(proc) DIAG = {{AB(2)}, {AB(3)}} σ(ξ 2 ) = {a, b, c}),σ(ξ 3 ) = {a, b, c} DIAG = {{AB(2)}, {AB(3)}} Figure 6: A typical (structural) fault inducing fewer dependences than specified. DX'06 - Peñaranda de Duero, Burgos (Spain) 207

218 Our final example in Figure 8 illustrates that both approaches might deliver reasonable but different results. We assume a = 1, b = 1, e = 0, thus we expect z = 2 and d = 0. However, due to the introduced fault, we obtain z = 1 and d = 0. Since the value of z is incorrect, but d = 0, we conclude that ok(z) and ok(d) holds. Thus, we obtain AB(2) and AB(4) as diagnosis candidates. Note that this result primarily roots in the coincidental correctness of variable d. Given the assertion in Figure 8 we are aware of the dependences {(d, a), (d, b), (d, e), (z, a), (z, b), (z, e)}. As the figure outlines, we obtain two single-fault diagnosis AB(2) and AB(3). As is also indicated in the figure, when solely employing a single assertion requiring z == c+d, we obtain SP EC (proc) = {(z, a), (z, b), (z, e)} and dep (proc) SP EC (proc). Consequently, we obtain 3 diagnoses in this case. However, even when employing the FDM we cannot exclude a single statement, thus, in this specific case, both models deliver the same accuracy. The examples outlined above should have made clear that a comparison of both models in terms of their diagnostic capabilities inherently depends on how we deduce observations from violated properties. Note that the FDM itself cannot detect any faults, rather faults are detected by evaluation of the assertions on the values obtained from a concrete test run. The VBM can reliably detect and localize faults that manifest in missing dependences on the right-hand side of an assignment statement. Due to the over-approximation of dependences and the definition of the fulfillment criterion (see Definition 2.7) we cannot locate faults manifesting in additional dependences as it is impossible to distinguish if (1) the specification is incomplete, (2) the model computes spurious dependences, or (3) an unwanted dependence is present due to a fault. Table 1 summarizes the illustrated examples by listing the individual models fault detection and localization capabilities. For those examples, where both models deliver diagnosis candidates, we checked whether the diagnoses provided by the VBM are a subset of those provided by the F DM. example FDM VBM diags(f DM) det. loc. det. loc diags(v BM) Fig. 3 - Fig. 4 - Fig. 5 - Fig. 6 Fig. 8 Table 1: Summary on the outlined scenarios. In order to compare different models of programs for fault detection and localization, we first introduce the debugging problem formally. Similar to Reiter s definition of a diagnosis problem [Reiter, 1987] a debugging problem is characterized by the given program and its expected behavior. In contrast to Reiter we assume the existence of a specification that captures the whole expected behavior and not only behavioral instances like given by the set of observations OBS in Reiter s original definition. Definition 3.1 (Debugging problem) A debugging problem is characterized by a tuple (Π, SP EC) where Π is a program written in a certain programming language and SP EC DIAG = VBM(p) DIAG= {x x is in stmnt(p)} DIAG={} DIAG = FDM(p) Figure 7: The (open) relationship between VBM and FDM is a (formal) specification of the program s intended behavior. The debugging problem now can be separated into three parts: 1. Fault detection: Answer the question: Does Π fulfill SP EC?. In case a program fulfills (does not fulfill) its specifications we write Π SP EC = (Π SP EC = respectively). 2. Fault localization: Find the root cause in Π which explains a behavior not given in SP EC. 3. Fault correction: Change the program such that Π fulfills SP EC. Note that SP EC is not required to be a formal specification. It might represent an oracle, i.e., a human, which is able to give an answer to all questions regarding program Π. In this paper we focus on the first two tasks of the debugging problem. Because fault localization and correction can only be performed when identifying a faulty behavior, from here on we assume only situations where (Π, SP EC) =. The question now is how such situations can be detected in practice. The availability of a specification that is able to answer all questions is an assumption which is hardly (and not to say impossible) to fulfill. What we have in practice is a partial specification. Therefore, we are only able to detect a faulty behavior and not to prove correctness. Obviously different kind of specifications may lead to different results to the first task of the debugging problem, i.e., identifying a faulty behavior. In the context of this article the question about the satisfiability of Π SP EC = is reduced to checking the satisfiability of two sentences, i.e., F DM(Π) SP EC F DM = and V BM(Π) SP EC V BM = where SP EC V BM and SP EC F DM are the partial specification which belong to the FDM and VBM respectively. In comparing both models, we start by contrasting the wellknown artifacts in the area of MBSD. Table 2 summarizes the most notable differences in employing the VBM and FDM for fault localization. In both models we employ a partial specification (e.g. test case, assertion, invariant) for deducing a number of observations. Whereas the VBM encodes observations in terms of dependence relations, the FDM relies on a program s execution and subsequent classification of the observed variables. Variables are merely classified as being correct or incorrect with respect to a given (partial) specification. 208 DX'06 - Peñaranda de Duero, Burgos (Spain)

219 artifact VBM FDM observations dependence relations ok, (ok) system descr. functions over Horn clauses dependence relations F DM(Π) V BM(Π) fault detect. V BM(Π) F DM(Π) SP EC SP EC = fault localiz. V BM(Π) F DM(Π) SP EC SP EC assumptions variable substitution AB ξ =... theorem prover CSP solver Horn clause theorem prover structural faults detect., localiz. detect, localiz. functional faults no detect., no localiz. detect., localiz. Table 2: Comparing th most common artifacts. Furthermore, the VBM models the program in terms of functions over dependence relations, the FDM captures the programs behavior by a number of logical sentences, in particular we employ a Horn clause theory. The VBM detects a fault by checking whether the system description fulfills the given specification according to the criterion given in Definition 2.7. In case this relationship does not hold, a fault has been detected. In contrast, we detect a fault with the FDM if the system description together with the specification yields to logical contradiction. The VBM locates possible causes for detected misbehavior by assuming that specific statements depend on model variables, and checking whether there is a valid substitution for fulfillment (see Definition 2.7). As outlined in Section 2, this process is efficiently done by solving a CSP. Instead, the FDM employs a Horn clause theorem prover under the assumption of statement abnormality in computing diagnosis candidates. Note, that whereas the FDM does not assume any faulty behavior for specific statements, the VBM assumes specific dependences guided by the specification. As indicated by the example above, the VBM is tailored towards detection and localization of structural faults, whereas the FDM may capture structural but particularly functional faults. Similar to static slicing capturing control as well as data flow dependences, the FDM must comprise all statements responsible for the computation of an erroneous variable. Thus, the FDM always provides diagnosis candidates under presence of an erroneous variable. The author of [Wotawa, 2002] points out that the FDM delivers at least the same results as static slicing. Moreover, we know that the misbehavior s real cause is always among the delivered diagnosis candidates when employing the FDM. This perspective is supported by theoretical foundation [Friedrich et al., 1999] as well as practical evidence in numerous case studies. Particularly, a comparison w.r.t. the accuracy and completeness of the obtained diagnosis is of interest. Figure 7 summarizes the relationship of the FDM and the VBM regarding their abilities of checking satisfiability. The lines between the nodes building up the lattice denote a subset relationship. As illustrated by the examples, there are debugging 1 proc (a,b,e) {... 2 c = a ; // should be c = a + b 3 d = c e; 4 z = c + d 5 assert (z == c + d, [d == c e] ) 6 }.. AB(2) ok(a) ok(c) AB(3) ok(c) ok(e) ok(d) AB(4) ok(c) ok(d) ok(z) [ ok(d)], ok(z) DIAG = {{AB(2)}, {AB(4)}} DIAG = {{AB(2)}, {AB(3)}, {AB(4)}} SP EC(proc) = {(z, a), (z, b), (z, e), (d, a), (d, b), (d, e)} dep(proc) = {(z, a), (z, e), (d, a), (d, e)} dep(proc) SP EC(proc) σ(ξ 2 ) = {a, b},σ(ξ 3 ) = {a, b, e},σ(ξ 4 ) = {} DIAG = {{AB(2)}, {AB(3)}} SP EC (proc) = {(z, a), (z, b), (z, e)} dep (proc) = {(z, a), (z, e)} dep (proc) SP EC (proc) σ(ξ 2 ) = {a, b},σ(ξ 3 ) = {a, b, e},σ(ξ 4 ) = {a, b, e} DIAG = {{AB(2)}, {AB(3)}, {AB(4)}} Figure 8: A degenerated example (error masking), diags(f DM) diags(v BM). problems where the VBM allows for finding a discrepancy but the FDM does not and vice versa. 4 Case Studies In [Peischl et al., 2006] we present first experimental results indicating our approaches applicability. The results presented there solely stem from programs without procedures. In the following we extend these results with results obtained from programs comprising procedures. In evaluating the model s fault localization capabilities under presence of procedural abstraction, we decompose a program into several procedures in a step by step fashion. This procedure allows for a first evaluation of both, the model for (1) parameter passing and (2) handling of return values. Table 3 summarizes our most recent results. Specifically, the program eval evaluates the arithmetic expression z (r h)+(c/d) (d+h) (e+f). The specification says that the left-hand side z depends on the variables r, h, c, d, e, and f. We introduced a single structural fault and decomposed this program by adding procedures computing specific subexpressions in a step by step fashion. A specific subexpression is thus evaluated by a single procedure and replaced by the variable capturing this procedure s evaluation. We refer to the decomposed programs comprising i methods by eval(i). In the remaining programs, which perform simple computations like taxes or evaluate simple arithmetic expressions, we also introduced a single structural fault. Removing certain dependences from the specification allows for evaluating our model s capabilities in localizing structural faults under presence of partial knowledge of the dependences of the output variables. Thus, we observed a subset of the output dependences involving up to 5 variables and recorded the minimum and maximum number of diagnosis candidates. DX'06 - Peñaranda de Duero, Burgos (Spain) 209

220 total min, max no. diagnosis candidates method no. LOC dep. no eval(1) eval(2) eval(3) eval(4) eval(5) sum artihmetics tax comp calculator Table 3: Number of single-fault diagnosis candidates with decreasing number of specified output variables. For example, regarding the program eval(3) we obtained 4 diagnosis candidates when observing all outputs. Afterwards we selected 2 output variables out of the 3 output variables, and for all possible combinations of selecting 2 out of 3 outputs, we recorded the number of diagnoses. The table specifies the minimal and maximal number of diagnosis candidates obtained in this way (in this specific case of considering 2 output variables we obtain at least 4 and at most 13 diagnosis candidates). We checked whether or not the introduced faults appear among the delivered diagnosis candidates. Regarding all our experiments, we have been able to locate the misbehavior s real cause. Furthermore, the table lists the number of total dependences (column 3) and the program s size in terms of the lines of code (column 2). Our experiments indicate an increase in the number of candidates with a decreasing number of outputs being considered. In the table, we did not take into account cases where the reduced output dependences are not capable of detecting the fault. In this case our approach obviously returns {}. In summary, the obtained results, confirm the findings in [Hamscher and Davis, 1984]: As our problem becomes under-constrained by removing certain output dependences, the number of diagnosis candidates may increase drastically. As our experiments indicate, this also appears to hold for the novel model introduced herein. 5 Conclusion and Future Research In this article we extended and formalized the so called verification-based model [Peischl et al., 2006] specifically tailored towards detecting and localizing structural faults. We discussed the relationship between this model and the wellknown functional dependence model [Friedrich et al., 1999] by exemplifying the weaknesses and strengths of both models. Our examples show, that there are debugging problems where the verification-based model delivers different diagnoses than the functional-dependence model and vice versa. Furthermore, we present case studies we conducted recently. Notably, whenever our novel model detects a structural fault, it also appears to be capable of localizing the misbehavior s real cause. A future research challenge is the empirical evaluation of the modeling approaches discussed herein. Most notably, this addresses issues such as the evaluation of the proposed operator for the compound statement as well as the criteria for relating the (conservatively approximated) program dependences to the specified ones. References [Dechter, 1992] Rina Dechter. Encyclopedia of Artificial Intelligence, chapter Constraint Networks, pages John Wiley & Sons, 2nd edition edition, [Dechter, 2003] Rina Dechter. Constraint Processing. Morgan Kaufmann, [Friedrich et al., 1999] Gerhard Friedrich, Markus Stumptner, and Franz Wotawa. Model-based diagnosis of hardware designs. Artificial Intelligence, 111(2):3 39, July [Greiner et al., 1989] Russell Greiner, Barbara A. Smith, and Ralph W. Wilkerson. A correction to the algorithm in Reiter s theory of diagnosis. Artificial Intelligence, 41(1):79 88, [Hamscher and Davis, 1984] Walter C. Hamscher and Randall Davis. Diagnosing circuits with state - an inherently underconstrained problem. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pages Morgan Kaufmann, [Jackson, 1995] Daniel Jackson. Aspect: Detecting Bugs with Abstract Dependences. ACM Transactions on Software Engineering and Methodology, 4(2): , April [Peischl et al., 2006] Bernhard Peischl, Safeeullah Soomro, and Franz Wotawa. Towards lightweight fault localization in procedural programs. In To appear in Proceedings of the 19th Conference on International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE 2006), Lecture Notes in Artificial Intelligence (LNAI). Springer Verlag, [Reiter, 1987] Raymond Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57 95, [Stumptner, 2001] Markus Stumptner. Using design information to identify structural software faults. In AI 01: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence, pages , London, UK, Springer-Verlag. [Weiser, 1984] Mark Weiser. Program slicing. IEEE Transactions on Software Engineering, 10(4): , July [Wotawa, 2002] Franz Wotawa. On the Relationship between Model-Based Debugging and Program Slicing. Artificial Intelligence, 135(1 2): , DX'06 - Peñaranda de Duero, Burgos (Spain)

221 A Bayesian Approach to Fault Isolation with Application to Diesel Engine Diagnosis Anna Pernestål and Mattias Nyberg Scania CV AB Sweden {anna.pernestal, Bo Wahlberg KTH Signals, Sensors and Systems Sweden Abstract This paper considers a Bayesian approach to fault isolation. Given a set of measurements from the system, and a set of possible faults, the task is to calculate the probability that the faults are present. This probability can then be used to rank the faults, or for decisions on fault sccomodation. The method requires the conditional probability distribution describing how the measurements react to the faults. In particular, the structure of dependencies between the tests is important. Knowing the structure facilitates efficient computation methods and makes it possible to reduce the memory capacity needed. In this paper, the structure is estimated from training data using Bayesian methods. The method is applied to diagnosis of the gas flow in a diesel engine. 1 Introduction Fault isolation concerns the problem of localizing faults in technical processes. This is a most important problem in all field of industrial systems. Our motivating application is onboard fault isolation for diesel engines, where maintenance and repair procedures together with new emission regulations put challenging demands on the corresponding diagnosis systems. Other challenges are noise and model errors, which introduces uncertainty to the diagnosis process, and the limited storage capacity in the on-board control unit where the isolation system should be implemented. Further, industrial system are often large and complex, and it is impossible to build a complete model that is executable in the on-board control unit for the whole system. The diagnosis system is structured as in Figure 1. This architecture is commonly used for diagnosis in the FDI community, for example when utilizing structured residuals, see [Gertler, 1998]. Further, it is one of the architectures used in industrial applications, and with this motivation we will use it in the present work. The process to be diagnosed is assumed to consist of a set of components, which can be faulty or non-faulty. The components are monitored by precompiled diagnostic tests. An example of a diagnostic test is a thresholded residual. In the isolation system, the outputs from the tests are used to make inference about possible present faults. c 1 c 2 c n Process Test 1 Test 2 Test m Isolation system Diagnoses Figure 1: An example of relations between components c i, tests and the isolation system. In this work, the diagnostic tests are assumed to be given, and we will focus on the isolation system in Figure 1. The isolation system computes diagnoses, i.e. the combinations of faults that can explain the outputs from the tests. With the test results as our observations, this is the same definition of diagnoses as in [de Kleer et al., 1992]. One problem is that already for small sized processes there can be many diagnoses, and hence a main requirement on the isolation system is that the diagnoses should be ranked after how probable they are. A second requirement is set by the limited processor and memory storage capacity in the on-board control unit. In this work, a Bayesian approach is used for fault isolation. Given a set of test results, and a set of possible faults, the probabilities that different faults are present are computed. These probabilities are called posterior probabilities and can be used to rank the faults, or for decision making about fault accommodation. In order to compute the posterior, the conditional probability describing how the tests react on the faults is needed. In particular, the parameters and the structure of dependencies in the conditional distribution are needed. A complex structure, allowing a lot of dependencies, will increase the storage capacity needed. On the other hand, a too simple structure will affect the performance of the isolation. There are two key contributions in this work. The first is that the structure of conditional probability is used as a design variable in the construction of the isolation system. The second is that Bayesian methods are used for estimation of the structure and the parameters of the conditional probability distribution from training data. Here, training data is the outputs from the diagnostic tests under different working conditions. DX'06 - Peñaranda de Duero, Burgos (Spain) 211

222 The main advantage of estimating the structure from training data is that no explicit knowledge about the process is needed. This is an advantage since in many industrial applications the system to be diagnosed is large and complex, and it is impossible to build a complete model of the whole system. When the structure of dependencies and the parameters of the conditional probability is known, a Bayesian network can be set up and computationally efficient methods for probabilistic inference can be used, see for example [Lerner, 2002], [Lu and Przytula, 2005] or [Jensen, 2001]. 2 Related work When diagnosing complex systems, model errors, noise, and disturbances introduce uncertainties in the diagnosis computation. Several methods that handle the uncertainty have been proposed in the literature. In [Colin N. Jones and Lawrence, 2002] the PGDE (Probabilistic General Diagnostic Engine) algorithm is presented. In the PGDE the logic reasoning used in [Reiter, 1992] is combined with a measure of the belief in different diagnosis. In [Touaf and Ploix, 2004] the problem of uncertainty is solved using fuzzy logic methods. In [Pulido et al., 2005] several isolation algorithms are combined, and the resulting algorithm is applied to uncertain models. In the present work, probabilistic reasoning is used, to compute the probability for different diagnoses. Other probabilistic methods can be found in the literature. The Sherlock algorithm [de Kleer and Williams, 1992], as well as its precursor the GDE (General Diagnostic Engine) [de Kleer and Williams, 1987], contains a part, that corresponds to our isolation system and where probabilities for the diagnoses are computed. Those algorithms are designed for systems without noise, and the conditional probability distributions are assigned constant values, depending only on whether the measurement from the system is consistent with a fault or not. In the present work, training data is used to estimate the conditional probability distributions. If there is no training data available for the estimation of the underlying probability distributions, our algorithm is basically the same as in Sherlock and GDE. In [Lerner et al., 2000], [Schwall and Gerdes, 2002] and [Lu and Przytula, 2005] probabilistic reasoning for isolation is successfully used on noisy systems, utilizing Bayesian networks. In these three works knowledge about the process to be diagnosed is required to set up the structure for the probabilistic reasoning. In [Lerner et al., 2000] the model of the system is translated into a Temporal Casual Graph and the structure of a Bayesian network is learned from it. In [Schwall and Gerdes, 2002] the structure of the model is given as input to the design of the isolation. In [Lu and Przytula, 2005] a known structure is used, and focus is on effective methods for solving the inference. 3 Problem Formulation This work consists of two separate problems. The first is how to estimate the structure of dependencies and relations between the tests and the components, given a set of training data. This part is referred to as the structure problem. The second problem is to utilize the structure to compute the diagnoses when test results arrive to the isolation system, and is called the isolation problem. The structure problem requires tedious computations, but can be performed once and off-line. The isolation problem is performed on-line, where the computational and storage capacity is limited. The process to be diagnosed consists of a set of N C components, which can be faulty or not faulty. Enumerate the components, and let c i be a variable with domain {NF,F}, where F means that the i:th component is faulty and NF that it is not faulty. The variable c i is called the behavioral mode of the i:th component, [de Kleer et al., 1992]. Assume that there exist N D diagnostic tests. The diagnostic tests can be discrete or continuous. Here we will assume that the tests are discrete, because this is actually the case in our application, and because it simplifies the presentation. Note however that this is not necessary for the methods in general. The diagnostic tests are assumed to be given, but we require no knowledge about their explicit construction. The only information needed is which faults that can possibly affect each. If a test is affected by a certain fault, it is said to be able to detect the fault, but due to model errors and noise it does not necessarily detect it. Enumerate the tests and let d i denote the test result from test i. For example for a binary test, the test result can be either 1, indicating that a fault is detected, or 0 if no fault is detected. The prior knowledge about the relations between tests and components can be presented as an isolation structure where an X at position (i, j) means that test i can react to a fault in the j:th component, but it does not necessary react every time the fault is present. A 0 at position (i, j) means that test i and component j are not related. For example, with three components and three tests, the isolation structure can look like c 1 c 2 c 3 d 1 X 0 X (1) d 2 X X 0 d 3 0 X X Let C = [c 1,...,c NC ] be an assignment of behavioral modes to all components in the system, and let D t = [d 1,...,d ND ] be the test results at time t. We call C the system behavioral mode. The isolation problem is to compute the probability for different assignments of behavioral modes to all components in the systems, given the test results a certain time t, P(C D t ), (2) also referred to as the posterior probability. The probability (2), as well as all other probabilities in the following, should also be conditioned on the prior knowledge about the process, but for notational convenience we leave this out unless it is especially important. In the following we consider only the test results from a certain time, and we will suppress the index t to simplify notations. To compute (2), use Bayes rule, P(C D) = P(D C)P(C), (3) P(D) where P(C) is the prior probability for the system behavioral mode. These priors are assumed to be known. In real systems 212 DX'06 - Peñaranda de Duero, Burgos (Spain)

223 this represents the knowledge of the quality of the components. In (3) the denominator P(D) is a normalization factor, which can be computed using marginalization over all possible system behavioral modes, P(D) = P(D C)P(C). (4) C The probability distribution P(D C) is called the likelihood for C, and it will now be shown how it can be estimated from data. For the on-line isolation, the likelihood is stored as a table, and when test results arrive to the isolation system, the probabilities for different values of C given the data D are computed using (3). The table can be very large, for example assuming binary tests the number of elements needed for storage is 2 N C+N D. Even for a small process containing only ten tests and ten components this table has more than a million elements, and the storage of the table is infeasible. To reduce the storage capacity needed for the likelihood P(D C), it can be factorized into mutually independent factors. One naive approach is to assume that all tests are independent given the behavioral modes. This assumption is generally not true. Examples on situations where tests are dependent are when several faults have the same root cause, when one test can cause another test to react, when there are errors in the underlying models for the diagnostic tests, and when the probability that a test reacts is dependent on the working point or the environment. Measurements of outputs from the tests in engine diagnosis have shown that some tests certainly are dependent, while others are independent. Thus, assuming that all tests are independent, the posterior probabilities for the system behavioral mode will be incorrect. Instead, partition the tests into M subsets, such that the tests in different subsets are mutually independent, or can be assumed to be mutually independent, but the tests in the same subset can be dependent. Let the maximum numbers of tests in a subset be L. Let I i be an index vector, containing the indices for the tests that is in subset i, and D[I i ] be the tests with indices in I i. Then the partition of the tests D into M subsets gives the factorization P(D C) = P(D[I 1 ] C)P(D[I 2 ] C)...P(D[I M ] C) (5) of the likelihood. Each of the distributions P(D[I i ] C) can be represented by a table, which maximum size is determined by L. To avoid too large tables, and decrease the storage capacity needed, limits on L is used. For the case where the tests are binary, the maximum number of elements in each subset is 2 N C+L. Although one table for each factor is needed, the total storage capacity required is reduced. In (5) the subsets D[I i ] are of different and unknown size. Also, the number of factors, M, in the factorization is unknown. Besides the factorization of the likelihood, the parameters of the distributions P(D[I i ] C]) must be estimated. For the estimation, assume that we have a set of training data D = [D C D D ], where D C are the behavioral modes D D the corresponding test results. The structure problem can now be stated as estimating the factorization (5) and the parameters of the underlying structure, given the training data D. The structure problem can also be thought of in terms of a model selection problem. Here, we assign a model of the class of two-layer Bayesian networks, and use training data to estimate the best structure. 4 The Structure Problem The structure problem can be visualized graphically as going from the left graph to the right graph in Figure 2. The left graph represents our prior knowledge about the relations between tests and components (solid lines), and the unknown relations between tests (dashed lines). The right graph represents the estimated structure, where tests that are dependent are grouped into the same node. In Figure 2 the left graph represents the structure given by (1), where an X at position (i, j) in (1) gives a solid line between test i component j. The right graph is an example where the tests one and three are grouped. c 1 c 2 c 3 d 1? d 2?? d 3 c 1 c 2 c 3 d 2 d 1 d 3 Figure 2: The structure problem can be represented as going from the left graph to the right. To estimate the structure from data, a measure of how well a structure fits the training data is needed. In [Wolf, 1995] the χ 2 -test is compared with a Bayesian approach. The disadvantage with the χ 2 -test is that it is accurate only for large data sets. With the Bayesian approach, the probability that a certain structure is the underlying structure, given the training data, is computed. This is valid also for small sets of data, see [Jaynes, 2001] and [Wolf, 1995]. The Bayesian approach suits this problem, since the training data consists of few examples for system behavioral modes which are unlikely too occur. In the Bayesian approach prior probabilities for the different structures must be given. In this work an uninformative prior, ranking all structures as equally likely, will be used [Wolf, 1995]. To keep notations simple, the method will be illustrated with a simple example, but it is straight forward to generalize all reasoning to larger problems. In the example, there are three components, represented by the variables c 1,c 2 and c 3, three tests, and the maximum factor size L is set to 2. The relations between tests and components is given by the isolation structure (1). First, the structure is estimated, and then, given the structure, the parameters in the distribution is estimated. 4.1 Structure estimation We search a factorization (5), or in other words the index sets I i, i = 1,...,M, that suits all different assignments of system behavioral modes C. To achieve this, we assume that the index sets in (5) are the same as in P(D) = P(D[I 1 ])P(D[I 2 ])...P(D[I M ]). (6) DX'06 - Peñaranda de Duero, Burgos (Spain) 213

224 Note that it is only the structure, i.e. the index sets I i, i = 1... M and the number of elements M that are assumed to be the same in (6) and (5), and not the probabilities themselves. This assumption is reasonable, since if two subsets D[I j ] and D[I k ] are independent, the knowledge of C will not make them dependent. On the other hand, if D[I j ] and D[I k ] are dependent, the knowledge of C can make them independent. As will be shown in Section 7.1, this will not affect the isolation performance, but only increase the storage capacity needed to perform the on-board isolation. To bias the factorization such that it is better suited for more important faults, more training data from those behavioral modes can be used. Here, all faults are assumed to be equally important and equal amount of training data is used from each system behavioral mode. Now, we introduce some notations. The distributions can be represented by multidimensional arrays, with one dimension for each test. Let p = P(D), and for the marginal distributions p 12 = P(d 1,d 2 ) = d3 P(d 1,d 2,d 3 ), where we sum over all possible values of d 3 etc. Let l r be the number of elements in p r, r = For the elements in the distributions, let p i jk = P(d 1 = i,d 2 = j,d 3 = k), p 12 i j = P(d 1 = i,d 2 = j) and so on. Correspondingly for the data, let n i jk be the number of observations with d 1 = i,d 2 = j,d 3 = k, and n 12 i j = l 3 k=1 n i jk etc. The total amount of data is N = l 1,l 2,l 3 i, j,k=1 n i jk. In our example p can be factorized in four different ways: such that all tests are independent, or such that two tests are dependent while the third is independent of the other two. In the example, use H 0 to denote the hypothesis "all three variables are independent" and H q, q = to denote "variable q is independent and the other two are dependent". Given a hypothesis H q, q = 0...3, p can be factorized in a certain way. For example H 1 means that p = p 1 p 23. We search the probabilities for the different factorizations (6) given the training data D. Since we only want one structure, we use the Maximum a posteriori (MAP) estimate H = max q P(H q D). Bayes rule gives P(H q D) = P(D H q)p(h q ), (7) P(D) where P(D) is a normalization factor, which can be computed using marginalization, P(D) = 3 q=0 P(D H q )P(H q ). (8) Here P(H q ) is the prior probability for the different factorizations. We apply a prior that is zero for all partitions containing subsets with more than l elements, and constant for all other. The distribution P(D H q ) can be computed using marginalization over all possible distributions, P(D H q ) = P(D p,h q ) f(p H q )d p, (9) where f(p H q ) is the continuous distribution for p and P(D p,h q ) is a multinomial distribution given by N! P(D p,h 0 ) = N i, j,k=1 n i jk! N! P(D p,h 1 ) = N i, j,k=1 n i jk! N i, j,k=1 N i, j,k=1 (p 1 i p2 j p3 k )n i jk (10a) (p 1 i p23 jk )n i jk, (10b) and similarly for q = 2, 3. The elements in distributions must be between 0 and 1, and for each distribution they must sum to one. The first criteria is regulated by the integration limits in (9). The latter criteria means that f is proportional to delta functions as δ( l 1 i=1 δ( f(p H 0 ) = f(p 1 p 2 p 3 H 0 ) p 1 l 2 i 1)δ( p 2 l 3 j 1)δ( p 3 j 1) j=1 k=0 f(p H 1 ) = f(p 1 p 23 H 0 ) = l 1 i=1 p 1 l 2,l 3 i 1)δ( p 23 jk 1), j,k=1 (11a) (11b) and similar for H 2 and H 3. The integral (9) can now be solved using convolution and Laplace transform techniques [Wolf, 1995]. The result is P(D p,h 0 ) f(p H 0 )d p = N! n i jk! Γ(l 1)Γ(l 2 )Γ(l 3 )F 0 (12a) P(D p,h 1 ) f(p H 1 )d p = N! n i jk! Γ(l 1)Γ(l 2 l 3 )F 1 where Γ( ) is the gamma function and F 0 = 3 q=1 l q i=1 Γ(nq i + 1) Γ(N + l 1 )Γ(N + l 2 )Γ(N + l 3 ) (12b) (13a) F 1 = l 1 i=1 Γ(n 1 i + 1) l 2,l 3 j,k=1 Γ(n23 jk + 1). (13b) Γ(N + l 1 )Γ(N + l 2 l 3 ) The expression (12) and (13) are similar for H 2 and H 3. Now, we can compute P(H q D) for all q. With the MAP estimate H = max q P(H q D) for the structure, i.e. the index sets in (6) and hence also in (5), we can estimate the parameters in the factors in (5). 4.2 Parameter estimation For the parameter estimation we use the same notation as in Section 4.1, but with subindex C to denote that we condition on the system behavioral mode, i.e. p C = P(D C). We will use the MAP estimate p C, that maximizes f(p C D,H ). Again, apply Bayes rule, f(p C D,H ) = P(D p C,H )P(p C H ) f(d H, (14) ) where P(p C H ) is the prior for p C given the partition H. In this work a prior that is uniform for all p C which suits the partition H and the structure (1), and is zero for all 214 DX'06 - Peñaranda de Duero, Burgos (Spain)

225 other p C s is applied. The denominator in (14) is a normalization factor, independent of p C, and hence P(p C D,H ) P(D p C,H ) for all p C suitable to H. The MAP estimate is pc = max p C P(p C D,H ). From (5) and given H we know that we can factorize p C = M m=1 pm C. The distribution P(D p C,H ) is minimized under the constraint that all elements in each factor should sum to one, i pc,i m = 1, m = 1...M, using Lagrange multipliers. The result is p m C,x = nm x N, (15) for x = 1...l m. With pc m = [pm C,1... pm C,l m ] and pc = M m=1 pm C we know, together with H, both the structure and the parameters in (5), and the isolation problem can be solved using probabilistic inference. No training Data If there is no training data available, other ways of assigning the the structure and the probabilities are needed. Using the principle of indifference [Jaynes, 2001], we assign the probabilities 0 if D is inconsistent with C, p C = 1 if D is surely consistent with C, (16) 1 otherwise. K Here K is the number of values that D can take given C. In this case, the structure will not affect the result, and the assumption that all data is independent can be used. For example, using the isolation structure (1) and assuming binary tests, this gives P(d 1 = 1 C = [NF,NF,F]) = 1 2 and P(d 1 = 1 C = [NF,NF,NF]) = 0. This is basically the same approach as used in [de Kleer and Williams, 1987] and [de Kleer and Williams, 1992]. 5 The Isolation Problem The on-line isolation is solved by computing the posterior probability for the system behavioral modes, given the test results at a certain time and using the structure H, the estimated likelihoods pc and the information about the prior P(C). Denoting the prior information with I, the posterior probability is P(C D,H, pc,i). (17) To compute (17) efficiently, a Bayesian network can be set up, using the structure and the parameters learned from the structure problem. Standard algorithms for reasoning in Bayesian networks can be used, see [Jensen, 2001] or [Lerner, 2002] for examples. There are also algorithms for computing the k most likely explanations of the data, with even less complexity [Lerner, 2002]. 6 Performance Measure In the present paper, isolation systems that can be expressed by a two-layer Bayesian network is designed. By choosing different values of L, different isolation systems within this class is designed. Further, there are the two extreme cases, assuming that all tests are independent, i.e. L = 1, and using no assumptions on independence. Let I denote an isolation system. Then the output from the Bayesian isolation systems, the posterior, is P(C D,I). In order to compare the performance of two isolation systems, a performance measure is needed. We suggest as an optimal isolation system, a system that gives the posterior probability one for the true underlying system behavioral mode. For probabilistic isolation systems, define the Expected probability of correctness, Definition 1 (Expected probability of correctness) Let D C be data generated when the system behavioral mode C is present, and let I be a probabilistic isolation system. Then the expected probability of correctness is µ(c,i) = E {P(C D C,I)}, (18) where the expectation is over data. The measure µ gives the expected probability assigned to the system behavioral mode that is really present. The optimal value of µ is one. This measure gives one number for each system behavioral mode, which is interesting since the behavioral modes can be differently difficult to isolate. To summarize the expected probability of correctness into one number, use the average over all system behavioral modes, µ(i) = 1 m µ(c,i). (19) C Another measure that relates to the isolation system performance is the probability that a correct diagnose is done, if the system behavioral mode with largest posterior probability is chosen as the diagnosis. In other words, given data D C, from the system behavioral mode C, what is the probability that P(C D) is the largest posterior probability? We call this measure the expected probability of correct classification and write it µ cc (C,I) = E { P ( Ĉ = C )}, (20) where Ĉ = max P(C D C,I) (21) C and the expectation is over data. The optimal value of µ cc (I) is one. Also this measure gives one number for each behavioral mode. To summarize µ cc, use the average, µ cc (I) = 1 m µ cc (C,I). (22) C Note that choosing the system behavioral mode with largest probability as the diagnosis is only one of the interpretations of the output from the isolation system. There are more clever ways to interpret the results. How to interpret the results from an probabilistic isolation system is further discussed in Section Diesel Engine Diagnosis The Bayesian isolation approach is applied to the diagnosis of the gas flow of a diesel engine with EGR (Exhaust Gas Recirculation) and VGT (Variable Geometry Turbine). A DX'06 - Peñaranda de Duero, Burgos (Spain) 215

226 Inlet manifold EGR cooler δ n eng EGR Valve Exhaust manifold W egr Exhaust system p im, T im p em, T em p es, T es p amb n trb T amb W trb Turbine Compressor W cmp Restriction Figure 3: A schematic figure of the gas flow through the diesel engine with EGR and VGT schematic figure of the gas flow is given in Figure 3. In the system there are ten components, to be diagnosed, listed in Table 1. In this example, all components considered are sensors, but other kinds of components, such as pipes, actuators etc. can be diagnosed with this method as well. Table 1: The sensors in the engine system exhaust gas pressure inlet manifold gas pressure inlet manifold temperature ambient pressure ambient temperature EGR valve position VGT valve position flow through the compressor engine speed turbine speed p em p im T im p amb T amb u EGR u vgt w cmp n eng n trb To make the results easier to overview, only the three components p em, p im, and n trb are diagnosed in this example, while the other seven are assumed to function correctly. An extension to diagnosis of all sensors is straight forward. All possible combinations of faults of the three components are considered. This gives m = 8 possible system behavioral modes. For the system behavioral modes we use a short notation. For example to denote that components p em and p im are functioning correctly and component n trb is faulty, we let C = [p em, p im,n trb ] = [NF,NF,F] be represented by C = [001]. There exists a complex model of the diesel engine process, from which about 60 residual generators can be found [Einarsson and Ahrrenius, 2004]. Due to limitations in the capacity of the on-board control unit, not all 60 residual generators can be executed. In this example, five of the 60 residuals are used. The residuals are thresholded, and the thresholded residuals are used as the diagnostic tests. Here, the tests are binary, but this is not a requirement for the method. The experiments are done on data collected from the engine in a real driving situation. Four different isolation systems, are set up: T No assumption of independence H The most probable structure for some given requirements, H N The naive assumption that the tests are independent, L = 1 D No training data The diagnosis system T is of course infeasible when considering larger systems, but is used here because it uses all the information given by the training data, and gives in some sense the best possible structure. The system D, designed without training data, is implemented according to (16). This turned out to perform very poor on the current example, and the result is only given in the summary of the experiments in Table 4. For the design of the isolation system H, the requirement L = 2 is used. This reduces the required storage capacity from 2 8 = 256 for the system T to, in worst case, = 80. In practice, the storage capacity needed will be even smaller, since the tests in each partition will not be related to all components. 7.1 Experimental Results To design the isolation system H, the probabilities for all possible structures with L = 2 were computed. The probabilities for the five most probable structures, normalized with the probability of the most probable structure, are given in Table 2. It is clear that the partition [14, 23, 5] is far more probable than the other, and also that there are similarities between the most probable structures. Experiments are run over Monte Carlo simulations based on data from real driving situations. Table 2: The five most probable partitions normalized with the probability of the most probable partition Partition, H i P(H i D)/P(H D) 14, 23, , 23, , 23, 4, , 23, , 23, The prior probabilities for all three faults are assumed to be equal, p(c i ) = 0.1, i = 1,2,3, and, although not necessary, we assume that they break independently. To compare the performance of the isolation systems, data sets from different system behavioral modes are applied to the systems, and the probabilities for different diagnoses are computed. In Figures 4, 5, and 6 the probability, and its variance, for three different system behavioral modes and the three isolation systems T, H, and N are shown. The true behavioral modes are C = [010], C = [110], and C = [110] respectively. For the first two system behavioral modes, the isolation systems T and H assign largest probability to the correct system behavioral mode, while the the system N does not. For the third system behavioral mode, in Figure 6, all 216 DX'06 - Peñaranda de Duero, Burgos (Spain)

227 P(C C,T) P(C C,H ) P(C C,N) [000] [000] [000] [001] [001] [001] [010] [010] [010] [011] [100] C [011] [100] C [011] [100] C [101] [101] [101] [110] [110] [110] [111] [111] [111] P(C C,T) P(C C,H ) P(C C,N) [000] [000] [000] [001] [001] [001] [010] [010] [010] [011] [100] C [011] [100] C [011] [100] C [101] [101] [101] [110] [110] [110] [111] [111] [111] Figure 4: The average probability assigned to the different system behavioral modes for the isolation systems T (top), H (middle), and N (bottom). The lines show the variance. The true behavioral mode is C = [010]. Figure 5: The average probability assigned to the different system behavioral modes for the isolation systems T (top), H (middle), and N (bottom). The lines show the variance. The true behavioral mode is C = [110]. Table 3: The expected probability of correctness for three different system behavioral modes. Isolation µ System C = [010] C = [110] C = [011] T H N D three isolation systems misses the underlying system behavioral mode. The expected probability of correctness for the behavioral modes C = [010], C = [110], and C = [011] are given in Table 3. All the values of µ are far from 1, even for the system T, although no assumptions on independence are done in this system. The reason is that some system behavioral modes are difficult to isolate, for example C = [011] shown in Figure 6. A numerical summation of all four isolation systems is given in Table 4. The values of µ and µ cc of the isolation system T is largest, followed by the system designed with our method, H. The values of µ cc in Table 4 indicates that choosing the system behavioral mode with the largest probability as the diagnosis is not always a good way of interpreting the results. It is also interesting to note that the naive isolation system, and our designed isolation system needs the same amount of storage capacity for the likelihoods. This is not true in general, although L can often be chosen so that the storage capacity needed is significantly reduced compared to the system without restrictions. The performance is very different for different system behavioral modes. In general multiple faults are more difficult to detect than single faults. The reason is that the priors for Table 4: The probability of correct classification and the average probability of correctness for all systems. Isolation system µ cc µ storage needed T H N D multiple faults are very small compared to the priors for single fault or no fault. One solution to overcome this problem is to consider data from several time steps. In this case the probability P(C D t,d t+1,...d t+t,i) for some T > 0 is used instead of the probability P(C D t,i) as in the case above. This will decrease the influence of the prior, and increase the influence of the likelihood on the posterior. See for example [Jaynes, 2001]. 7.2 Discussion The experimental results show that isolation system based on the partitioned structure performs better than the isolation system based on the naive structure, but still it performs worse than the structure assuming no independences. One question is of course, how much performance can be gained using a larger L. The maximum L that can be used is given by restrictions on the memory capacity of the on-board control unit, but also a smaller L could perform sufficiently good. The accuracy needed is dependent on how the output from the isolation system is to be evaluated. One way to interpret the output from a probabilistic isolation system is to use a cost function, and compute the expected cost of measures. The target is to minimize this expected cost. So far we have focused on the storage needed to imple- DX'06 - Peñaranda de Duero, Burgos (Spain) 217

228 P(C C,T) P(C C,H ) P(C C,N) [000] [000] [000] [001] [001] [001] [010] [010] [010] [011] [100] C [011] [100] C [011] [100] C [101] [101] [101] [110] [110] [110] [111] [111] [111] Figure 6: The average probability assigned to the different system behavioral modes for the isolation systems T (top), H (middle), and N (bottom). The lines show the variance. The true behavioral mode is C = [011]. ment the isolation system, and seen that it is dependent on L. Also, the number of hypothesis H q defined in Section 4 is intersting, since too many hypotheses can give numerical problems when solving the structure problem. The number of hypotheses increases with increasing L, and with increasing number of diagnostic tests. To extend this work to large scale problems, the increased search space for Hq must be handled. This extension is a challenge, but beyond the scope of this work. 8 Conclusion In this paper Bayesian techniques for fault isolation is presented. The structures of the underlying conditional probabilities is used as a design variable, and they are estimated from training data. The Bayesian method was applied to diagnosis of the gas flow of a diesel engine. Four different Bayesian isolation systems, with different degrees of dependence assumptions, were compared. The experiments was run on data from real driving situations. The result shows that if there is a dependence betweens tests, this dependence is important to take into account when designing the isolation system. The system designed with the new method performs best of the systems with the same order of complexity. References [Colin N. Jones and Lawrence, 2002] Gregory W. Bond Colin N. Jones and Peter D. Lawrence. Consistency-based fault isolation for uncertain systems with applications to quantitative dynamic models. In DX 2002, pages 36 42, [de Kleer and Williams, 1987] Johan de Kleer and Brian C. Williams. Diagnosing multiple faults. Artif. Intell., 32:97 130, [de Kleer and Williams, 1992] Johan de Kleer and Brian C. Williams. Diagnosis with behavioral modes. In Readings in model-based diagnosis, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [de Kleer et al., 1992] Johan de Kleer, Alan K. Mackworth, and Raymond Reiter. Characterizing diagnoses and systems. Artif. Intell., 56(2-3): , [Einarsson and Ahrrenius, 2004] Henrik Einarsson and Gustav Ahrrenius. Automatic design of diagnosis systems using consistency based residuals. Master s thesis, Uppsala University, [Gertler, 1998] Janos J. Gertler. Fault Detection and Diagnosis in Engineering Systems. Marcel Decker, New York, [Jaynes, 2001] B. T. Jaynes. Probability Theory - the logic of science. Camebridge University Press, Cambridge, [Jensen, 2001] X. Jensen. Bayesian networks. Springer- Verlag, New York, [Lerner et al., 2000] Uri Lerner, Ronald Parr, Daphne Koller, and Gautam Biswas. Bayesian fault detection and diagnosis in dynamic systems. In AAAI/IAAI, pages , [Lerner, 2002] Uri Lerner. Hybrid Bayesian Networks For Reasoning About Complex Systems. PhD thesis, Stanford University, Stanford University, October [Lu and Przytula, 2005] Tsai-Ching Lu and K. Wojtek Przytula. Methodology and tools for rapid development of large bayesian networks. In DX 2005, pages 89 94, [Pulido et al., 2005] B. Pulido, V. Puig, T. Escobet, and J. Quevedo. A new fault localization algorithm that improves the integration between fault detection and localization in dynamic systems. In DX 2005, [Reiter, 1992] Raymond Reiter. A theory of diagnosis from first principles. In Readings in model-based diagnosis, pages 29 48, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [Schwall and Gerdes, 2002] Matthew Schwall and Christian Gerdes. A probabilistic approach to residual processing for vehicle fault detection. In Proceedings of the 2002 ACC, pages , [Touaf and Ploix, 2004] Samir Touaf and Stephane Ploix. Soundly managing uncertain decisions in diagnostic analysis. In DX 2004, [Wolf, 1995] David Wolf. Mutual information as a bayesian measure of independence, DX'06 - Peñaranda de Duero, Burgos (Spain)

229 Automatic Generation of Benchmark Diagnosis Models Gregory Provan Department of Computer Science, University College Cork, Cork, Ireland Abstract We describe an algorithm for automatically generating benchmark models that can be used for evaluating diagnosis algorithms. Our algorithm generates models based on a system structure specified by a small-world network, which is a graphical structure that is common to a wide variety of naturally-occurring systems, ranging from biological systems, the WWW, to human-designed mechanical systems. To demonstrate this approach, we randomly generate a suite of digital circuit models with small-world network structure, and empirically show the computational complexity of diagnosing these models. 1 Diagnostic Inference for Complex Systems The problem of model-based diagnosis (MBD) consists of determining whether an assignment of failure status to a set of mode-variables is consistent with a system description and an observation (e.g., of sensor values). This problem is known to be NP-complete. However, this is a worst-case result, and some NP-complete problems are known to be tractable for particular problem classes. For example, graph colouring, which is NP-complete, has empirically been shown to have run-times that depend on the graph structure [Cheeseman et al., 1991]. We are interested in the average-case complexity of MBD algorithms on problem instances with real-world structure. At present, it is not known whether MBD is computationally difficult for the average real-world system. There has been no systematic study of the complexity of diagnosing real-world problems, and few good benchmarks exist to test this. We describe an algorithm for automatically generating diagnostic benchmark models that can be used to analyse the performance of diagnostic inference algorithms. This model generator can be applied to any domain, and can generate models that accurately capture the properties of complex systems, given as input a library of domain-dependent component models. To demonstrate our approach, we generate a suite of combinatorial circuit models, each of which possesses typical real-world properties, and empirically study the complexity of diagnostic inference within a model-based framework. Our experimental results show that problems with real-world structural properties are computationally more difficult than problems with regular or random structure, such as would be generated by a typical random-problem generator. This article makes two main contributions. First, it describes a technique to generate diagnosis models with realworld structural properties. This approach circumvents the difficulty of assembling a large suite of test problems (benchmark models), given that most large diagnosis models tend to be proprietary. It also enables us to control model parameters (and hence analyse specific parameters). Second, we show empirically that diagnosing models with real-world structure is computationally hard. This provides the first clear experimental demonstration of this computational intractability. We organize the remainder of the document as follows. Section 2 examines the topological structure that all realworld complex systems possess. Section 3 summarises the model-based diagnosis task that we solve. Section 4 reviews related work in the area of automated model generation. Section 5 describes the process we adopt for generating diagnostic models. Section 6 presents the experimental results. Finally, Section 7 summarises our contributions and discusses the wider implications of our results. 2 The Structure of Real-World Problems Several recent theoretical studies and extensive data analyses have shown that a variety of complex systems, including biological [Newman, 2003], social [Newman, 2003], and technological [Braha and Bar-Yam, 2004; i Cancho et al., 2001] systems, share a common underlying structure, which is characterised by a small world graph. A small-world graph (SWG) is a complex network in which (a) the nodes form several loosely connected clusters, and (b) every node can be reached from every other by a small number of hops or steps. We can measure whether a network is a small world or not according to two graph parameters: clustering coefficient and characteristic (mean-shortest) path length [Newman, 2003]. The clustering coefficient, C, is a measure of how clustered, or locally structured, a graph is; this coefficient is an average of how interconnected each agent s neighbors are. The characteristic path length, L, is the average distance between any two nodes in the network, or more precisely, the average length of the shortest path connecting each pair of nodes. DX'06 - Peñaranda de Duero, Burgos (Spain) 219

In the following, we will summarise the graph-theoretic notation that we adopt to study inference complexity of MBD, and examine the data demonstrating the small-world properties of technological

230 In the following, we will summarise the graph-theoretic notation that we adopt to study inference complexity of MBD, and examine the data demonstrating the small-world properties of technological systems. 2.1 Small-World Graph Parameters This section introduces our notation. We assume that we have a graph G(V,E) with V the set of vertices and E the set of edges. We say that V 1 is a parent of V 2 in G, denoted V 1 = π(v 2 ). Definition 1 (Vertex Degree). The in-degree of a vertex in a digraph is the number of arcs coming to the vertex, and the out-degree is the number of arcs going out of the vertex. The degree k is the total number of incoming and outgoing arcs. Definition 2 (Path). A path from a vertex x 0 to a vertex x n in a digraph G =(V,E) is a sequence of vertices x 0,x 1,..., x n that satisfies the following: for each i, 0 i n 1, (x i,x i+1 ) E, or(x i+1,x i ) E, that is, between any pair of vertices there is an arc connecting them. x 0 is the initial vertex and x n is the terminal vertex of the path. SWG Characteristic 1: Mean Shortest Path Length The Characteristic Path Length L of a SWG is only a meaningful measure if a graph is fully connected, i.e., if there is a sequence of edges joining any two nodes. We adopt the convention that L is infinite if a graph is not connected. In general, to make comparisons more feasible, all graphs we deal with will be fully connected. Definition 3 (Connected graph). A graph is said to be connected if there is a path between every pair of its vertices. We define the distance between two vertices in a graph as follows. Definition 4 (Graph Distance). Given a graph G(V,E), the distance L between two vertices is the number of edges in a shortest path connecting the two vertices. As an example, for a random graph G n,p,defined on n nodes where any pair of nodes is connected with probability p, the mean distance L rand lnn ln k. SWG Characteristic 2: Clustering Coefficient The notion of clustering characterises the degree of cliquishness of a typical neighbourhood in a graph. We define the neighbourhoodn for a vertex v i as its immediately connected neighbours as follows: N i = {v i } : e ij E. The degree k i of a vertex i is the number of vertices in its neighbourhood, i.e., k i = N i. The clustering coefficient C i for a vertex v i is the proportion of links between the vertices within its neighbourhood divided by the number of links that could possibly exist between them. For a directed graph, e ij is distinct from e ji, and therefore for each neighbourhood N i there are k i (k i 1) links that could exist among the vertices within the neighbourhood. Thus, the clustering coefficient is given as: C i = {e jk} {e kj } k i (k i 1) : v j,v k N i,e jk E or e kj E. (1) This measure is 1 if every neighbour connected to v i is also connected to every other vertex within the neighbourhood, and 0 if no vertex that is connected to v i connects to any other vertex that is connected to v i. The clustering coefficient for the whole system is the average of the clustering coefficient for each vertex [Watts and Strogatz, 1998]: C = 1 n n i=1 C i. For a random graph G n,p, the clustering coefficient C rand = k n. We will use these properties in the following section to describe technological systems that we want to diagnose. 2.2 Technological System Topology Several recent studies of technological systems have shown that they all possess small-world topology [Braha and Bar- Yam, 2004; i Cancho et al., 2001]. In these studies, each technological system is described in graph-theoretic terms, and the underlying topology of the system graph G is studied. For example, for the electronic circuits studied in [icancho et al., 2001], the vertices of G correspond to electronic components (gates, resistors, capacitors, diodes, etc.), and the edges of G correspond to the connections (or wires) between the components. These circuits comprise both analog and IS- CAS 89/ITC 89 benchmark circuits, and all display C and L parameters that are typical of SWG topologies. Figure 1: Graph of analog TV circuit. Note the clustering of nodes, especially the dense central cluster. Figure 1 [i Cancho et al., 2001] shows the topology graph of an analog TV circuit containing 329 components. This figure shows how this graph has clear clusters (especially the dense central cluster). In addition, there are short paths between any pair of components (nodes) in the network. The first two rows of Table 1 compares the clustering and distance parameters with the corresponding random-graph parameters, showing that C C rand and L L rand. The third row of Table 1 shows a large circuit taken from the ISCAS 89/ITC 89 benchmark, which displays small-world topology similar to the two smaller circuits [i Canchoet al., 2001]. 220 DX'06 - Peñaranda de Duero, Burgos (Spain)

231 Circuit N C C rand L L rand logic analog ISCAS 24, Table 1: SWG data for circuits. N is the number of nodes; C and C rand denote the clustering coefficient for the circuit and corresponding random graph model; L and L rand denote the mean distance for the circuit and corresponding random graph model. 3 Model-Based Diagnosis We can characterise a MBD problem using the triple COMPS,SD,OBS [Reiter, 1987], where: COMPS={C 1,..., C m } describes the operating modes of the set of m components into which the system is decomposed. SD, or system description, describes the function of the system. This model specifies two types of knowledge, denoted SD =(S, B), where the system structure, S, denotes the connections between the components, and the system behaviour, B, denotes the behaviour of each component. OBS, the set of observations, denotes possible sensor measurements, which may be control inputs, outputs or intermediate variable-values. We adopt a propositional logic framework for our diagnostic models. Component i has associated mode-variable C i ; C i can be functioning normally, denoted as [C i = OK], or can take on a finite set of abnormal behaviours. MBD inference assumes initially that all components are functioning normally: [C i = OK], i = 1,..., m. Diagnosis is necessary when SD OBS {[C i = OK] C i COMPS} is proved to be inconsistent. Hypothesizing that component i is faulty means switching from [C i = OK] to [C i OK]. A (minimal) diagnosis is thus a (minimal) subset C COMPS such that: SD OBS {[C i = OK] C i COMPS \ C } {[C i OK] C i C } is consistent. In this article, we adopt a multi-valued propositional logic using standard connectives (,,, ). We denote variable A taking on value α using [A = α]. An example equation is [A = t] [B = f] [C = t]. 4 Automatic Benchmark Generation: Related Work This article addresses the automated generation of benchmarks for model-based diagnosis. The literature does not contain any work, to our knowledge, that addresses this task for applications other than for circuit diagnosis. The most closely-related work in the literature is the work on diagnostic model generation for circuits [Vogels et al., 2004]. This work addresses the detailed simulation of circuit defects (such as metal spot defects or defects in circuit geometry), which itself if a big task. This methodology is important in that very few other researchers have addressed the need to have libraries of components with detailed physicsbased failure-mode definitions. This approach, however, has focused on very small circuits, such as a 4-bit ALU, and does not use algorithms for generating arbitrary circuit topologies. Further, the defect simulation cannot be generalised beyond circuits. A second group of related work addresses automatic benchmark circuit generation for improving the design of programmable logic architectures [Hutton et al., 2002; Christie and Stroobandt, 2000]. Benchmark circuit auto-generation originally was based on applying a circuit generation rule, called Rent s rule [Landman and Russo, 1971], but has since expanded to include other methods. 1 We now describe the basic methodology of automatic discrete circuit generation, pointing out the similarities and differences to our approach for automating the generation of diagnostic models for circuits and other domains. Most automatic circuit generation methods are based on one of two methods, which we call equivalence-class and Rent-based methods. The equivalence-class methods [Ghosh and Brglez, 1999] are based on perturbing a seed circuit to generate a circuit with similar overall structure but different local connectivity. The Rent-based methods use a power-law methodology, called Rent s rule, to generate circuits [Christie and Stroobandt, 2000]. Both methods can generate combinational and sequential circuits, where we define a combinational circuit as one without any distinguished clock inputs (e.g., as provided by D-type Flip-Flop components), and a sequential circuit as a circuit with distinguished clock inputs. 2 Most auto-generation algorithms first create the combinational circuits, and then use a hierarchical approach to generate the sequential circuits for each level of delay [Hutton et al., 2002]. In the following, we examine the combinational circuit generation process, since this process has some properties that are potentially generalisable to any system model; sequential circuit generation addresses issues that are restricted to a specific class of temporal feedback systems with distinguished clock inputs, features that are not present in many other domains. Moreover, because of its greater generality, we focus in this article on the Rent-based combinational circuit methods. Rent s rule [Landman and Russo, 1971], was originally derived empirically, but has since been given mathematical underpinnings. Rent s rule describes the relationship between the number of external signal connections to a logic block (called the number of pins ) and the number of logic gates in the logic block. Rent s rule is given by: T = tn ξ, where (a) T is the number of input/output pins, 3 (b) n is 1 See [Chang et al., 2003] for a survey. 2 In this article, we focus on atemporal models, which translates to combinational circuits. 3 In graph-theoretic terms, if we represent component i using a node in a topology graph, then the degree k i of component i corresponds to the set of terminals of component i in the circuitgeneration domain. DX'06 - Peñaranda de Duero, Burgos (Spain) 221

232 the number of gates, (c) and the (internal) Rent exponent 0 ξ 1 represents the level of placement optimization within a statistically homogeneous circuit, which is characterized by an interconnection topology with an average node degree t (or in engineering terms, t terminals per gate). From an engineering perspective, ξ =1corresponds to no placement optimization, i.e., the circuit is interpreted as a random gate arrangement. In actual circuits, the parameter ξ is dependent on circuit-topology: microprocessors, gate-arrays, and high-speed computers are characterized by Rent exponents of ξ =0.45, 0.5, and0.63, respectively [Christie and Stroobandt, 2000]. Several tools have been developed to generate benchmark circuits based on Rent s rule and other approaches. Examples of such tools are CIRC and GEN. 4 If one is interested in generating benchmark diagnostic circuits, then these tools can be integrating within the diagnostic model-generation framework described in this article. Circuit generation algorithms have proven very useful for applications like FPGA design; however, they are restricted to a specific domain, and focus on topology optimisation, rather than on the issues of fault isolation that are relevant to diagnosis benchmarks. As a consequence, we have developed a more general approach to benchmark generation that has some commonality with circuit generation algorithms, but also some key differences. The key commonality between our approach and these circuit generation algorithms is that we first generate the underlying system topology, using a graph generation algorithm. Specifically, we use a graph generation algorithm that generates a graph with a power-law topology. This approach is a generalisation of the Rent-based topology algorithm, in that Rent s rule uses a power-law method that is almost identical to the power-law approaches developed within the randomgraph community. Both the Rent-based and random-graph methods focus on defining a graphical structure G(V,E) in which the nodes V correspond to components and the edges E correspond to wires between the components. Key differences between these areas include (1) the extension of the system topology to incorporate functionality, and (2) the tuning of the topology and functionality. With regard to (1), our diagnostic benchmark generator extends the system topology to incorporate a functional description that describes both normal and anomalous system behaviours. With regard to (2), the diagnostic benchmark generator methodology has parameters that can be tuned to generate models to approximate particular domains, but assumes that these parameters are domain-dependent and need to be supplied by domain experts. In the absence of good domain parameters, the generated models will approximate real models with good accuracy, the quality of which can be improved with the use of precise parameters. Random graph generators can effectively capture the gross topology of complex systems, but much work remains to more precisely capture detailed structure of particular domains. For example, the actual structure of the WWW is known to differ from the predictions of random graph mod- 4 See jayar/software/software.html. els [Donato et al., 2004]. In contrast, the practical applications and validity of the circuit-synthesis methods are more heavily-researched than the applications and validity of the random-graph generation approach; as a consequence, the models that a circuit-synthesis method generates are provably closer to the real-world targets (circuits) than are the models generated by random-graph generators are to their real-world targets, such as the WWW [Donato et al., 2004]. However, many aspects of the circuit-generation algorithms are so particular to the precise architectures of circuits that they are not generalisable to other domains. 5 Benchmark Diagnostic Model Generation This section describes our algorithm for generating benchmark diagnostic models. Figure 2 depicts the process of automatically generating diagnostic models and using them for evaluating diagnosis inference algorithms. Our approach is applicable to any domain, since (a) the underlying topological models have been shown to approximate virtually any complex system [Newman, 2003], and (b) functionality is incorporated into the system model using a component-library, where components can be developed for any domain in which the system models are decomposable. The topology-generation method we adopt was originally developed based on the theory of random graphs see [Newman, 2003] for background in this area. However, this method focuses solely on the system structure (as captured by the graph), and ignores the system functionality. We extend this approach by adopting the system structure based on the random-graph generators, and then encoding system functionality using a component library. Component Library Model Topology Generator Diagnosis Model Generator Model Generator Inference Algorithm Library Model Suite Test- Case Suite Evaluation Of Inference Algorithms Model Evaluator Figure 2: The steps of automated model generation and analysis. Generation Algorithm: We generate diagnostic (benchmark) models in a three-step process. 1. generate the (topology) graph G underlying each model; 2. assign components to each node in G for system Ψ, to create a Model-Based Diagnosis graph (MBD-graph) G ; 3. generate the system description (and fault probabilities) for Ψ. We now describe this process using an example, and then describe each step of the process. 222 DX'06 - Peñaranda de Duero, Burgos (Spain)

233 Example 1. To demonstrate this approach, we study a suite of auto-generated electronic combinational circuits, which are constructed from simple gates. The inputs to the generation process consist of: (a) a component library; (b) parameters defining the system properties, such as the number n of components; and (c) domain-dependent parameters, such as the Rent parameterξ. As an example, Figure 3 shows several of the gates that we use in our component library, together with truth-tables for the gates (as one method of the describing the functionality of each gate). We study networks with n=50, 60, 70 and 80 components, and generate circuits using domain-dependent Rent parameter of ξ =0.5. Example 2. Figure 4 shows the schematic of a simple circuit with components A, B, C, D and E. The circuit has two inputs, I 1 and I 2, and the output of component i is denoted by O i. Figure 5 shows the process of transforming this schematic into a MBD-graph, which is the basis for constructing a diagnostic model. We first translate the schematic into a topology graph, which makes the graphical topology of the circuit explicit by denoting each component as a node, the inputs as nodes, and the wires linking inputs to components or components to components as directed edges. 5 Next we replace each component X in the topology graph with a pair (C X,O X ), which denotes the mode and output of component X, respectively. We introduce the mode-variables for each component, in order to diagnose the fault status of each component. Further, this new structure of the MBD-graph enables us to define (model-based) behavioural equations for each component. I 1 O A O B A B D O D I 2 C O C E O E Figure 4: Schematic of simple electronic circuit. Figure 3: Partial component library for combinatorial digital circuit domain. Each gate also has an associated truth-table defining the gate s functionality. A key difference between generic circuit models and diagnosis models is that the diagnosis models explicitly encode failure modes and functional effect of failure modes. As a consequence, the structure of a diagnostic model is slightly different than the structure of the corresponding electronic circuit, since a diagnostic model explicitly encode failure modes of components. We formallyspecify the topologygraphg and MBD-graph G as follows. Definition 5 (Topology graph). A topology graph G(V,E) for a system COMPS,SD,OBS is a directed graph G(V,E) corresponding to the system structure S. Hence in G(V,E): (a) the nodes V consist of a collection of nodes corresponding to system components (χ), system inputs (η) and outputs (ζ), i.e., V = χ η ζ; and (b) the edges correspond to connections between two component-nodes, between an input-node and a component-node, or between a componentnode and an output-node, i.e., E = (χ i,χ j ) (η i,χ k ) (χ l,ζ m ),forχ i,χ j,χ k,χ l χ, η i η, and ζ m ζ. Definition 6 (MBD-graph). An MBD-graph G (V,E ) is a topology graph G(V,E) in which each component node χ i V is replaced with a subgraph consisting of the node for the corresponding component-output O i, the node corresponding to component-mode C i, and the directed edge (C i,o i ). 6 Hence in G (V,E ): (a) the nodes V consist of a collection of nodes corresponding to system componentoutputs (O), mode-variables (COMPS), system inputs (η) and outputs (ζ), i.e., V = O COMPS η ζ; and (b) the edges correspond to connections between two componentoutput-nodes, between an input-node and a component-node, or between a component-node and an output-node, i.e., E = (O i,o j ) (η i,o k ) (O l,ζ m ) (C i,o i ),foro i,o j,o k,o l O, η i η, C i COMPS, and ζ m ζ. 5.1 Generate Graph Structure for G We generate a small-world-graph (SWG) using the approach of Watts and Strogatz [1998], as this methodology has been shown to generate graphs with mean distance and clustering coefficient that closely match real-world systems [Newman, 2003]. The Watts and Strogatz approach generates a graph G with a pre-specified degree of randomness that is controlled by a probability p [0, 1]. p 0 corresponds to a regular graph, and p 1 corresponds to an Erdos-Renyi random graph; graphs with real-world structure (SWGs) occur 5 This is the graphical framework used for the small-world analyses of electronic systems in [icanchoet al., 2001]. 6 Figure 5 shows this replacement process. DX'06 - Peñaranda de Duero, Burgos (Spain) 223

B D I 1 A C E I 2 (a) Topology Graph C A A O A (b) Translation to MBD graph C B C D C A O B O D I 1 O A O C O E I 2 (c) MBD Graph Figure 5: Transforming the topology graph of a simple electronic

234 B D I 1 A C E I 2 (a) Topology Graph C A A O A (b) Translation to MBD graph C B C D C A O B O D I 1 O A O C O E I 2 (c) MBD Graph Figure 5: Transforming the topology graph of a simple electronic circuit into a model-based diagnosis graph. roughly in the range.01 p.8, as has been determined by empirically comparing the C and L parameters of generated graphs and actual networks [Newman, 2003]. As noted earlier, a node in G corresponds to a system component, and an edge linking nodes i and j in G corresponds to a wire between components i and j. We randomly assign a set O of nodes to be observable inputs and outputs, choosing O based on system size, using Rent s rule. More precisely, we use O = kn 0.5, taking Rent parameter ξ =0.5 and k as the mean node degree. We can summarise the graph generation process as follows. Figure 6 depicts this process, where we control the proportion of random edges using a rewiring probability p. We start with a regular graph (a ring lattice of n nodes), where each node is connected to its k nearest neighbors. We then introduce random edges, i.e., with probability p we randomly rewire an edge by moving one of its ends to a new position chosen at random from the rest of the lattice [Watts and Strogatz, 1998]. We characterise a SWG H that is generated using the representation H(n, k, p). C C CE the type (e.g., AND-gate, OR-gate), B defines the behavioural equations of component Z,andw the weights assigned to the failure modes of Z. Example 3. For our experiments, we use a set of digital comparator components, as shown in Figure 3. We have also extended selector components, which we characterise as follows: a j-of-k gate will output t if at least j out of the k inputs are t. Given a node that has q possible components that are suitable, we randomly select a component with probability 1 q.for example, the single-input nodes correspond to single-input gates (NOT, buffer), and the dual-input nodes correspond to dual-input gates (AND, OR, NAND, NOR, XOR). 5.3 Generate the System Description Given a selected component, we then generate its normalmode equations (and potentially failure-mode equations). We randomly select the mode type (of the k possible failure modes) for any component-model with probability 1 k.weassign weights to failure-mode values by assuming that normal behaviour is highly-likely, i.e., Pr{C i = OK} 0.99, and faulty behaviour is unlikely, i.e., Pr{C i OK} I 1 A SA0 O A B SA1 O B D invert O D I 2 C O C SA1 E SA1 O E ρ=0 Large L Large C Increasing randomness Small L Large C Small L Small C Figure 6: Generating a small-world graph from a regular ring lattice with rewiring probability p. 5.2 Assign Components to graph G Given a topology graph G, we associate to each node in G a component, based on the number of incoming arcs for the node. Given a SWG node with i inputs and o outputs, we assign a component, denoted Ψ Z (i, o, τ, B,w) where τ denotes ρ=1 Figure 7: Instantiated schematic of simple electronic circuit. Components A, D and E are NOT gates, component C is an AND gate, and component B is a buffer. Example 4. Figure 8 shows a randomly-generated circuit based on the schematic of Figure 4. Here, we instantiate components A, D and E to NOT gates, component C to an AND gate, and component B to a buffer. This figure also depicts the instantiated failure-mode for the components in shaded boxes: Components B, C and E have SA1 fault-modes, component A has a SA0 fault-mode, and component D has a INVERT fault-mode. Given this information, we can generate a system description with equations corresponding to the component-types and fault-mode types as just described. For 224 DX'06 - Peñaranda de Duero, Burgos (Spain)

Comparing diagnosability in Continuous and Discrete-Event Systems

Comparing diagnosability in Continuous and Discrete-Event Systems Marie-Odile Cordier IRISA, Université de Rennes 1 Rennes, France Louise Travé-Massuyès and Xavier Pucel LAAS-CNRS Toulouse, France Abstract