I R I S A P U B L I C A T I O N I N T E R N E N o VIRTUAL PRECEDENCE IN ASYNCHRONOUS SYSTEMS: CONCEPT AND APPLICATIONS

Size: px

Start display at page:

Download "I R I S A P U B L I C A T I O N I N T E R N E N o VIRTUAL PRECEDENCE IN ASYNCHRONOUS SYSTEMS: CONCEPT AND APPLICATIONS"

Doris Bradford
6 years ago
Views:

1 I R I P U B L I C A T I O N I N T E R N E 079 N o S INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTÈMES ALÉATOIRES A VIRTUAL PRECEDENCE IN ASYNCHRONOUS SYSTEMS: CONCEPT AND APPLICATIONS JEAN-MICHEL HÉLARY, ACHOUR MOSTEFAOUI, MICHEL RAYNAL ISSN I R I S A CAMPUS UNIVERSITAIRE DE BEAULIEU RENNES CEDEX - FRANCE

3 CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE Tél. : (33) Fax : (33) Virtual Precedence in Asynchronous Systems: Concept and Applications Jean-Michel Helary, Achour Mostefaoui, Michel Raynal Theme Reseaux et systemes Projet Adp Publication interne n079 Fevrier pages Abstract: This paper introduces the Virtual Precedence (VP) property. An interval-based abstraction of a computation satises the VP property if it is possible to timestamp its intervals in a consistent way (i.e., time does not decrease inside a process and increases after communication). A very general protocol P that builds abstractions satisfying the VP property is proposed. It is shown that the VP property encompasses logical clocks systems and communication-induced checkpointing protocols. A new and ecient protocol which ensures no local checkpoint is useless is derived from P. This protocol compares very favorably with existing protocols that solve the same problem. This shows that, due the generality of its approach, a theory (namely, here VP) can give ecient solutions to practical problems (here the prevention of useless checkpoints). Key-words: Causality, Partial Order, Computation Abstraction, Virtual Precedence, Timestamping, Logical Clocks, Checkpointing. (Resume : tsvp) fhelary, mostefaoui,raynalg@irisa.fr Centre National de la Recherche Scientifique (URA 227) Université de Rennes Insa de Rennes Institut National de Recherche en Informatique et en Automatique unité de recherche de Rennes

4 La precedence virtuelle dans les systemes asynchrones : theorie et applications Resume : Cet article introduit la notion de precedence virtuelle (VP). Une abstraction d'un calcul reparti basee sur des intervalles (ensembles d'evenements consecutifs) respecte la propriete de precedence virtuelle s'il est possible d'exhiber un estampillage coherent de ces intervalles (croissance a l'echelle d'un processus et suivant les communications). Un protocole general construisant des abstractions respectant la VP est propose et prouve. Il est montre que les mecanismes d'horloges logiques en sont des instanciations. De plus, un protocole generique de denition de points de contr^ole en est derive. De ce protocole sont derives plusieurs protocoles deja proposes et un nouveau protocole qui se revele meilleur que les precedents. Mots-cle : Causalite, ordre partiel, abstraction de calcul, precedence virtuelle, estampillage, horloges logiques, points de reprise

5 Virtual Precedence in Asynchronous Systems 3 Introduction Context, Statement of the Problem and Results. Without entering into a philosophical debate and roughly speaking, causality means rst that the future can inuence neither the past nor the present and second that the present is determined by the past. Causality plays a fundamental role when designing or analyzing distributed computations [0]. A lot of causality-based protocols have been designed to solve specic problems encountered in distributed systems (resource allocation, detection of stable and unstable properties, checkpointing, etc). A distributed computation is usually modeled as a partially ordered set of (send, deliver and internal) events [3]. We consider in this paper a higher observation level in which each process of a distributed computation is perceived as a sequence of intervals, an interval being a set of consecutive events produced by a process. Intervals are dened by an abstraction of the computation. Moreover an abstraction associates with each computation a graph, called A-graph. This paper addresses one of the many aspects of causality when considering an observation level dened by intervals; we call it Virtual Precedence. A main question is then: "Given an abstraction A of a distributed computation, does A allow a consistent timestamping of intervals?" (Intuitively "consistent" means that the "time" must not decrease inside a process and must eventually increase after communication). The rst two sections of the paper introduce the Virtual Precedence concept. Section 2 denes intervals and abstractions, and Section 3. provides a formal denition of Virtual Precedence. Section 3.2 gives a characterization of Virtual Precedence in terms of A-graphs. Then Section 3.3 presents a set of properties that are sucient for a timestamping mechanism to ensure that interval timestamps satisfy Virtual Precedence. Finally Section 3.4 denes a very general protocol P that, given a distributed computation, builds abstractions satisfying Virtual Precedence. Where these Results are Useful. Virtual Precedence has a lot of application domains (e.g., observation and monitoring of distributed applications). Section 4 investigates two of them. Better understanding of the deep structure of existing protocols. Section 4. illustrates this point with timestamping protocols. When considering the abstraction A 0 that associates an interval with each event, the resulting A-graph associated with a computation reduces to Lamport's partial order and protocol P can be instantiated to obtain classical timestamping protocols [2, 3, 5]. So, the protocol P provides us with a general framework that gives a deeper insight into a family of basic distributed protocols. Design of new protocols. This point (and again the previous one) is illustrated in Section 4.2 with checkpointing protocols. Uncoordinated checkpointing protocols are prone to the domino eect. In order to prevent this undesirable eect, these protocols can be augmented with a communicationinduced checkpointing protocol. Upon the occurrence of a communication event and according to some condition, such a protocol may force a process to take an additional checkpoint in order no local checkpoint be useless. Checkpointing protocols belonging to this family have already been published in the literature [, 6, ]. Section 4.2 shows that they are particular instances of the general protocol P. Moreover, a new communication-induced checkpointing protocol, more ecient than the other protocols of this family, is obtained as a particular instance of P. This demonstrates that a theoretic approach can have a positive impact on solutions of practical problems. Useless local checkpoints are the cause of the domino eect. PI n079

6 4 J.-M. Helary & A. Mostefaoui & M. Raynal 2 Abstractions of a Distributed Computation 2. Distributed Computations A distributed program consists of a collection of sequential processes, denoted P ; P 2 ; : : : ; P n (n > ), that can communicate only by exchanging messages on communication channels. Processes have no access to a shared memory nor to a global clock. Communication delays are arbitrary. A process can execute internal, send 2 and delivery statements. An internal statement does not involve communication. When a process P i executes the statement \send(m,p j )", it puts the message m into the channel from P i to P j. When P i executes the statement \deliver(m)", it is blocked until at least one message directed to P i has arrived; then a message is withdrawn from one of its input channels and delivered to P i. Executions of internal, send and delivery statements are modeled by internal, send and delivery events. P P 2 P 3 Figure : A Distributed Computation Execution of a process P i produces a sequence of events h i = e i; : : : e i;z e i;z+ : : :. This sequence is called the history of P i ; it can be nite or innite. Events of h i are enumerated according to the total order in which they are produced by P i. Let H be the set of all events produced by the set of processes. A distributed computation is modeled by the partially ordered set H b = (H;!), hb where hb! denotes the well-known Lamport happened-before relation [3]. As they are not relevant from the point of view of process interaction, internal events are no longer considered in the rest of the paper. Figure depicts a distributed computation H b in the usual space-time diagram. 2.2 Abstraction of a Distributed Computation Intervals An interval of process P i is a set of consecutive events of h i. An interval-based abstraction A partitions each process history h i into a sequence of intervalsi i; : : : I i;x I i;x+ : : :. More precisely: 8i : 8e i;z 2 h i : 9x : e i;z 2 I i;x 8i : 8x > 0 : ji i;x j 8i : e i;z 2 I i;x ^ e i;t 2 I i;y ^ x < y ) z < t Figures 2.a., 2.a.2 and 2.a.3 depict three abstractions A ; A 2 and A 3 of the computation H b described in Figure. A rectangular box represents an interval. A-graph An abstraction A of a distributed computation b H associates a directed graph (called A- graph) with this computation. This A-graph is dened in the following way: Vertices: the set of all intervals I i;x. Edges: there is an edge I j;y! I i;x if: { either j = i and y = x? (such an edge is called local edge). 2 We suppose a process does not send messages to itself. Irisa

7 Virtual Precedence in Asynchronous Systems 5 I ; I ;2 I ;3 I P ; I ;2 I ;3 P 2 I 2; I 2;2 I 2;3 I 2; I 2;2 I 2;3 P 3 I 3; I 3;2 I 3;3 I 3; I 3;2 I 3;3 a. Abstraction A b. Corresponding A-graph P I ; I ;2 I ;3 I ; I ;2 I ;3 P 2 I 2; I 2;2 I 2;3 I 2; I 2;2 I 2;3 P 3 I 3; I 3;2 I 3;3 I 3; I 3;2 I 3;3 a.2 Abstraction A 2 b. Corresponding A-graph P I ; I ;2 I ;3 I ; I ;2 I ;3 P 2 I 2; I 2;2 I 2;3 I 2; I 2;2 I 2;3 P 3 I 3; I 3;2 I 3;3 I 3; I 3;2 I 3;3 a.3 Abstraction A 3 b.3 Corresponding A-graph Figure 2: Three Abstractions of the Same Computation { or there is a message m such that send(m)2 I j;y and deliver(m)2 I i;x (such an edge is called communication edge). Figures 2.b., 2.b.2 and 2.b.3 give the A-graphs dened by abstractions A ; A 2 and A 3 of computation b H. It is important to note that an A-graph may have cycles; this depends on the corresponding abstraction. Moreover, abstractions of dierent computations can produce the same A-graph. Let us also note that the A-graph produced by the trivial abstraction A 0 in which every interval exactly corresponds to a single event (actually, there is no abstraction) is the Lamport's partial order associated with the computation. Notations. In the rest of the paper we use the following notations. (S; ) denotes a lattice with an innite number of elements.? and > denote the least element and the greatest element of (S; ), respectively. a b means (a b) ^ (a 6= b). [a; b] denotes the sub-lattice of (S; ) whose least and greatest elements are a and b, respectively. [a; b[ denotes the sub-lattice of (S; ) including all elements x such that a x b. Let X be a subset of (S; ). glb (X) denotes the greatest lower bound of X, and lub (X) denotes the least upper bound of X. For any a 6=?; >: a + " denotes a value such that (a a + ") ^ (a + " 6= >). By convention,? + " =? and > + " = >. PI n079

8 6 J.-M. Helary & A. Mostefaoui & M. Raynal 3 Virtual Precedence 3. Underlying Idea Consider an abstraction A of a distributed computation H. b Intuitively, A satises the virtual precedence (VP) property if it is possible to associate with each interval a value belonging to S, called its timestamp, in such a way that the ordering on intervals induced by their timestamps is compatible with the A-graph relation. For example, when considering the trivial abstraction A 0, Lamport's scalar clocks [3] or Fidge-Mattern's vector clocks [2, 5] dene a timestamping of events and messages, consistent with the causality relation (i.e., with the A 0 -graph relation). For a more sophisticated abstraction A, the VP property ensures that there exists a mechanism that timestamps intervals and messages in such a way that, if we re-order all the communication events of a given interval I i;x according to the timestamps of their messages, then () all deliver events precede all send events and, (2) the timestamp f i;x of I i;x is greater than or equal to the timestamps of delivered messages and lower than or equal to the timestamps of sent messages. This is illustrated in Figure 3 (where S = IN the set of natural integers, an arrow denoting a message and the associated integer denoting its timestamp) and explains the name \Virtual Precedence". VP can be seen as a consistent re-ordering of communication events. Pi Ii;x 7 Pi fi;x 2 Ii;x a. An Interval Ii;x b. Ii;x as Seen by VP Figure 3: What Is Virtual Precedence 3.2 Denition and Characterization Denition 3. An abstraction A of a distributed computation b H satises the VP property if it exists a function f from the vertices of the A-graph into a lattice (S; ) such that (f i;x denotes the value of f(i i;x )): (F) I j;y! I i;x ) f j;y f i;x (F2) j 6= i ^ I j;y! I i;x ^ 9I i;x+ ) f j;y f i;x+ Such a function f is called timestamping function. (F) indicates that timestamps cannot decrease along a path of the A-graph, while (F2) indicates that they must increase after communication 3. Albeit we limit our study of the VP property to the message passing computational model, the reader can see that the previous denition and its characterization given below are not limited to this model: an edge from an interval I j;y to another interval I i;x could also represent a write-read relation, a request-response relation, etc. The following theorem gives a characterization of the VP property in terms of the A-graph dened by an abstraction A (this characterization will be used in Section 4). Theorem 3. An abstraction A of a distributed computation corresponding A-graph has no cycle including a local edge. b H satises the VP property i the 3 (F) can be seen as a safety requirement, while (F2) can be seen as a liveness requirement. Irisa

9 Virtual Precedence in Asynchronous Systems 7 Proof Suciency. Let f be a timestamping function of the A-graph and suppose that the A-graph has a cycle including a local edge I i;x! I i;x+. So, there is a path from I i;x+ to I i;x. Since there is no local edge from I i;x+ to I i;s with s x, has communication edges; let I j;y! I i;s ; s x, be the last of these communication edges. We have: f i;x+ f j;y f i;s f i;s+ f i;x+. A contradiction. Necessity. Suppose A satises the VP property. Let C(I i;x ) denote the strongly connected component 4 to which vertex I i;x belongs. Consider the reduced graph A R, the vertices of which are the strongly connected components of A and where an edge C! C 0 exists if and only if C 6= C 0 and 9I i;x 2 C; 9I j;y 2 C 0 : I i;x! I j;y. As A R is acyclic it exists a function f R with values in a lattice (S; ) such that C! C 0 ) f R (C) f R (C 0 ). Dene the function f from the vertices of the A-graph in S in the following way: 8I i;x : f i;x = f R (C(I i;x )). Clearly, for each edge I j;y! I i;x, we have: f satises (F). -If I j;y! I i;x belongs to a cycle then C(I j;y ) = C(I i;x ) and consequently f j;y = f i;x, from which (F) follows. - If I j;y! I i;x does not belong to a cycle then C(I j;y )! C(I i;x ) and consequently f j;y f i;x, from which (F) follows. f satises (F2). As by assumption no local edge I i;x! I i;x+ belongs to a cycle, we have f R (C(I i;x )) f R (C(I i;x+ )) and consequently f i;x f i;x+. If, additionally, there is a communication edge I j;y! I i;x, we have: f j;y f i;x (as (F) is satised). So f j;y f i;x+, from which (F2) follows. T heorem 3: 3.3 How to timestamp Messages to Ensure the VP Property This section displays properties that are sucient for a timestamping mechanism to implement abstractions satisfying the VP property. This mechanism associates a timestamp m:t 2 S with each message m. Let min sent i;x and max rec i;x be the two following values: min sent i;x = glb (fm:tjsend(m) 2 I i;x g) if fmjsend(m) 2 I i;x g 6= ;, > else. max rec i;x = lub (fm:tjdeliver(m) 2 I i;x g) if fmjdeliver(m) 2 I i;x g 6= ;,? else. Let us consider the four following predicates: (P) max rec i;x f i;x (P3) f i;x f i;x+ (P2) f i;x min sent i;x (P4) max rec i;x f i;x+ Theorem 3.2 Let A be an abstraction of b H. I i;x denotes an interval dened by A, and f i;x denotes its timestamp. i) Let F be a protocol that associates a timestamp with each interval dened by A and each message. If F satises predicates (P) to (P4), then A satises the VP property. ii) If A satises the VP property, then it exists a protocol F that timestamps intervals and messages in such a way that (P)-(P4) are satised. Proof i). if i 6= j and I j;y! I i;x then there is a message m sent in I j;y and delivered in I i;x. It follows from the denitions of min sent j;y and max rec i;x that: min sent j;y m:t max rec i;x (). 4 Two vertices x and y belong to the same strongly connected component if and only if x = y or x and y belong to a same cycle. PI n079

10 8 J.-M. Helary & A. Mostefaoui & M. Raynal (F) is satised. If I j;y is I i;x?, then (F) follows from (P3). If j 6= i then by applying (P2) to I j;y, () and then (P) to I i;x, we get: f j;y min sent j;y max rec i;x f i;x. (F2) is satised. If j 6= i ^ I j;y! I i;x ^ 9I i;x+ then (F2) follows from the previous line (f j;y max rec i;x ) and (P4). ii). Let f be a function that satises (F) and (F2). 8 I j;y and any message m sent in I j;y let m:t = f j;y. (P3) follows from (F) by taking i = j. (P2) follows from the denition of timestamp values. If I j;y! I i;x and i 6= j there is a message m sent in I j;y and delivered in I i;x and we get: - from the denition of m:t and (F): m:t = f j;y f i;x, from which, by considering all messages m received in I i;x, (P) follows. - from the denition of m:t and (F2): m:t = f j;y f i;x+, from which, by considering all messages m received in I i;x, (P4) follows. T heorem 3:2 3.4 A General Protocol Ensuring the VP Property The following protocol P executed by each process P i constructs an abstraction of a distributed computation H, b that satises the VP property, by associating timestamps with messages and intervals. Each interval is timestamped when it terminates 5. The protocol P, described in Figure 4, is based on theorem 3.2, i.e., it maintains invariant predicates (P)-(P4). In order to be as general as possible, some assignments of variables and parameters state only a subset from which the value to assign must be extracted. A particular rule to select a value from the corresponding subset gives a particular instance of P (see Section 4). When P i executes new interval (line 6), it determines a timestamp f i;x for the terminating interval I i;x (lines 7-8). To satisfy (P), (P3) and (P4), we must have lub (ff i;x? ; max rec i;x? + "g; max rec i;x ) f i;x. Moreover, (P2) is f i;x min sent i;x. Thus, we must have: (P ) lub (ff i;x? ; max rec i;x? + "; max rec i;x g) min sent i;x This is achieved by introducing three control variables, namely, MIN SENT i, MAX REC i and T i, and by managing them appropriately. Meaning of variables MIN SENT i and MAX REC i. These variables are used to compute min sent i;x (line 3) and max rec i;x (line 0), respectively. According to their denitions, they are initialized to > and?, respectively, each time a new interval starts (lines 20 and 2). 5 It is also possible to design a protocol Q that timestamps each interval when it starts. With such a protocol Q the set of timestamps that can be associated with a given interval is smaller than the one allowed by P. This is due to the fact that P has more information when it determines the interval's timestamp and so, is less conservative. Irisa

11 Virtual Precedence in Asynchronous Systems 9 T i. This variable measures progression of P i. It is initialized to an arbitrary value distinct from? and >. In order the previous predicate (P) be satised, the protocol maintains invariant the following relation: (R) 8 i ; 8 x, at any point of interval I i;x : lub (ff i;x? ; max rec i;x? + "g; MAX REC i ) T i MIN SENT i. By using the notation i;x? = lub (ff i;x? ; max rec i;x? +"g) it can be rewritten shortly as (note that i;x? is a constant that P i can use when executing I i;x ): (R) 8 i ; 8 x, at any point of interval I i;x : lub (f i;x? ; MAX REC i g) T i MIN SENT i. Note that (R) is initially true by assuming f i;0 and max rec i;0 equal to? (i.e., i;0 =?). Management of these variables When, after the termination of I i;x, P i starts a new interval I i;x+ (lines 7-22) it sets T i to a value belonging to the interval [ lub (ff i;x ; MAX REC i + "g); >[= [ i;x ; >[ (line 9). So, when I i;x+ starts (i.e., after the reinitialization of MIN SENT i to > (line 20) and MAX REC i to? (line 2)), we have lub (f i;x ; MAX REC i g) = i;x T i MIN SENT i, i.e., (R) is satised. When P i sends a message m, this message is timestamped with a value m:t belonging to [T i ; >[ (line 2). This ensures that T i MIN SENT i during interval I i;x, and consequently T i min sent i;x at the end of interval I i;x. The core of the protocol lies in message reception. When m, timestamped m:t, arrives at P i, the protocol forces P i to terminate the current interval I i;x if the relation max rec i;x min sent i;x (i.e., (P) and (P2)) is about to be violated (lines 7-9). This relation (perceived as :(m:t MIN SENT i )) is the limit beyond which VP is no more satised. Let us note that parameters param and param 2 allow the protocol to force termination of intervals even if the previous relation is not about to be violated: the number of forced interval terminations depends on the values selected for param and param 2. The least constraining case is obtained when the extreme values param = m:t and param 2 = MIN SENT i are chosen. On the contrary, the extreme values param = > and param 2 =? force a new interval to be started before each message delivery. Finally, when m is delivered, MAX REC i is updated (line 0); T i is also updated to the value lub (ft i ; MAX REC i g) (line ). This update keeps (R) invariant as shown by the analyze of the two possible cases:. A new interval I i;x+ has been started at line 9. As shown above, before line 0 we have i;x T i and thus, before line we have lub (f i;x ; MAX REC i g) lub (ft i ; MAX REC i g). It follows that, after line we have lub (f i;x ; MAX REC i g) T i MIN SENT i = >. 2. A new interval I i;x+ is not started at line 9. Thus, before line 0 we have param param 2, which implies that m:t MIN SENT i. Also, before line 0, (R) holds, i.e., lub (f i;x? ; MAX REC i g) T i MIN SENT i ; in particular, i;x? T i. As i;x? is constant during I i;x and T i can only increase, it follows that i;x? T i remains true after line 0. Moreover, after line we have MAX REC i T i and lub (f i;x? ; MAX REC i g) T i. Similarly, before line 0 we have MAX REC i MIN SENT i and thus lub (fmax REC i ; m:tg) MIN SENT i which becomes, after line 0, MAX REC i MIN SENT i. But, before line we have lub (fmax REC i ; T i g) MIN SENT i, and so, after line, we have T i MIN SENT i. Thus, (R) remains true after line. To ensure that any interval I i;x (x > 0) includes at least one communication event, the boolean variable New Interval Enabled (initialized to false and updated at lines 5, 3 and 22) is used. If P i has produced events since the end of the previous interval I i;x?, it can terminate I i;x and start I i;x+ PI n079

12 0 J.-M. Helary & A. Mostefaoui & M. Raynal (line 6). P i 's desire to start a new interval is not dened by P; it is the overlying application that triggers line 4 according to the problem it solves. () When P i sends m to P j begin (2) let m:t be a value 2 [T i ; >[; (3) MIN SENT i := glb (fmin SENT i ; m:tg); (4) send(m; m:t) to P j ; (5) N ew Interval Enabled := true; end (6) When P i receives (m; m:t) begin (7) let param be a value 2 [m:t; >]; (8) let param 2 be a value 2 [?; MIN SENT i ]; (9) if :(param param 2 ) then new interval endif; (0) MAX REC i := lub (fmax REC i ; m:tg); () T i := lub (ft i ; MAX REC i g); (2) deliver m; (3) N ew Interval Enabled := true; end (4) When P i desires to start an interval (5) if N ew Interval Enabled then new interval endif; (6) Procedure new interval is begin (7) let f i;x be a value 2 [T i ; >[ \ ]?; MIN SENT i ]; (8) % f i;x is the timestamp of the terminated interval I i;x. A new interval I i;x+ is started % (9) T i := a value 2 [ lub (ff i;x ; MAX REC i + "g); >[; (20) MIN SENT i := >; (2) MAX REC i :=?; (22) New Interval Enabled := false; end Figure 4: A General Protocol P that Ensures the VP Property 4 Applications By appropriately dening the lattice (S; ), the condition ruling P i 's desire to start a new interval, and the rules used to select values for m:t; param ; param 2 and f i;x, we get particular instances of the protocol P. As indicated in the Introduction, we consider two domains of applications: the derivation of timestamping protocols from P and the design of communication-induced checkpointing protocols. In the following, we consider the two following particular lattices:. N, such that S = IN [ f?g [ f>g with? =?; > = + and is the usual order. 2. N n, such that S = IN n [ f?g [ f>g with? = (?; : : : ;?); > = (; : : : ; ), and V V 0, (8i : V [i] V 0 [i])). In this lattice, i denote the n-dimensional vector dened as i [i] = and 8j 6= i i [j] = 0. Irisa

13 Virtual Precedence in Asynchronous Systems 4. Logical Clocks Let us consider the trivial abstraction A 0 (introduced in Section 3.) in which each event constitutes a new interval 6. As the A-graph associated by A 0 with any computation b H has no cycle, it follows that A 0 satises the VP property. This subsection shows that a family of logical clocks protocols can be derived from a particular instance P 0 of the general protocol P described in Section 3.4. Let the instance P 0 of P be dened by the following rules: (S) each P i associates an interval I i;x with each of its events e (S2) at line 2, for m:t 2 [T i ; >[, select m:t = T i + " (S3) at line 7, for param 2 [m:t; >], select param = m:t (S4) at line 8, for param 2 2 [?; MIN SENT i ], select param 2 = MIN SENT i (S5) at line 7, for f i;x 2 [T i ; >[ \ ]?; MIN SENT i ], select f i;x = T i + " (S6) at line 9, for T i 2 [ lub (ff i;x ; MAX REC i + "g); >[, do the assignment T i := lub (ff i;x ; MAX REC i + "g) Due to (S), it follows that P 0 can be rewritten from P, by suppressing lines 4-5 and by replacing lines 5 and 3 by copies of lines 7-2. Consider now the two cases e = send(m) and e = deliver(m). I i;x = fdeliver(m)g. As indicated, the behavior of P 0 is dened by lines 6-2 followed by a copy of lines During this interval I i;x, we have MIN SENT i = >. Moreover, due to (S3) and (S4), we have param = m:t and param 2 = >, from which it follows that the test at line 9 is false at each message reception. Consequently, lines 7-9 can be suppressed from P 0. As during I i;x, MAX REC i takes a single value, namely m:t, this variable can also be suppressed and replaced by m:t. From (S5), it follows that at line 7, we get the following timestamp for I i;x : f i;x = T i + ", and, due to (S6), at line 9, we get T i := lub (ff i;x ; m:t + "g). I i;x = fsend(m)g. In that case, the behavior of P 0 is dened by lines -4 followed by a copy of lines During this interval I i;x, we have MAX REC i =?. As, due to (S2), m:t = T i + ", we get MIN SENT i = T i + " (line 3) and, due to (S5), at line 7, we select f i;x = T i + ". It follows that MIN SENT i can be suppressed from P 0. Finally, at line 9, due to (S6) and to MAX REC i =?, we get T i := m:t (i.e., T i + "). The rules (S)-(S6) make possible to suppress variables MIN SENT i and MAX REC i. The resulting protocol P 0 is described in Figure 5. Then according to the lattice (S; ) we select, a particular logical clock protocol is obtained. Choosing the lattice N with " = and (8i) T i initialized to 0, results in the classical scalar clock protocol [3]. Choosing the lattice N n with, for each process P i, " = i and T i initialized to (0; : : : ; 0), results in the classical vector clock protocol [2, 5]. Other timestamping protocols [8] can be obtained in the same way. 4.2 Preventing Useless Checkpoints A local checkpoint is a snapshot of a local state of a process and a consistent global checkpoint is a set of local states, one from each process, such that no message sent by a process after its local checkpoint is received by another process before its local checkpoint. The computation of consistent global checkpoints is an important task when one is interested in designing or implementing systems that have to ensure dependability of the applications they run. Many protocols have been proposed to determine in which way select local checkpoints in order to form consistent global checkpoints []. 6 As indicated in Section 2., we consider only communication events, namely, send and deliver. The following instantiations of P can easily be adapted to include internal events. PI n079

14 2 J.-M. Helary & A. Mostefaoui & M. Raynal When P i sends m to P j begin When P i receives (m; m:t) begin let m:t = T i + "; deliver m; send(m; m:t) to P j ; T i := lub (ft i ; m:tg) + "; T i := m:t; end end Figure 5: A Family P 0 of Logical Clock Protocols Remark that if local checkpoints are taken independently there is a risk that no consistent global checkpoint can ever be formed from them (this is the well-known unbounded domino eect, that can occur during rollback-recovery [7]). A local checkpoint that does not belong to any consistent global checkpoint is called useless [6]. Useless checkpoints are the cause of the domino eect. Let us consider an abstraction A ckpt that denes an interval as the set of events produced by a process between two successive local checkpoints. Let C i;x be the local checkpoint of P i that corresponds to the local state reached after the last event of I i;x. With these correspondences, the A-graph dened by A ckpt reduces to the R-graph frequently encountered in the checkpointing literature [, ]. In such a context, the fundamental result on the occurrence of useless checkpoints, formalized and stated for the rst time in [6], has been formulated in terms of R-graph as: A checkpoint C i;x is useless i C i;x+! C i;x []. Since this situation exactly corresponds to the occurrence of a cycle including local edges in the R-graph, this can be re-stated in our approach as No local checkpoint is useless i A ckpt satises the VP property. To prevent useless checkpoints (and thus the domino eect), a kind of coordination in the determination of local checkpoints is required. In the approach called communication-induced checkpointing, processes select local checkpoints independently (basic checkpoints) and a protocol requires them to take additional local checkpoints (forced checkpoints) such that no checkpoint be useless; forced checkpoints are taken according to some predicate tested each time a message is received. Distinct denitions for this predicate give rise to distinct protocols (to our knowledge, [4] proposes the most ecient 7 protocol of this class). We show that these protocols can be derived from particular instances of the general protocol P introduced in Section 3.4. Moreover, we derive from P a new communicationinduced checkpointing protocol more ecient than all existing ones. Instantiation P 00. The family P 00 of communication induced checkpointing protocols is obtained from P by applying the following rules (The resulting instantiation is described at Figure 6): (S) Each process P i triggers line 4 each time it takes a basic checkpoint (S2) at line 2, for m:t 2 [T i ; >[, select m:t = T i (S3) at line 7, for param 2 [m:t; >], select param = m:t (S4) at line 8, for param 2 2 [?; MIN SENT i ], select param 2 = MIN SENT i (S5) at line 7, for f i;x 2 [T i ; >[ \ ]?; MIN SENT i ], select f i;x = T i (S6) at line 9, T i 2 [ lub (ff i;x ; MAX REC i + "g); >[, do the assignment T i := lub (ff i;x ; MAX REC i + "g) The choice of a particular lattice (S; ) give rise to a particular communication-induced checkpointing protocol. As before, we consider two choices. 7 The less forced local checkpoints are taken, the more ecient is the protocol. Irisa

15 Virtual Precedence in Asynchronous Systems 3 When P i sends m to P j begin m:t := T i ; MIN SENT i := glb (fmin SENT i ; m:tg); send(m; m:t) to P j ; N ew Checkpoint Enabled := true; end When P i receives (m; m:t) begin (9) if :(m:t MIN SENT i ) then new checkpoint % forced checkpoint % endif; MAX REC i := lub (fmax REC i ; m:tg); T i := lub (ft i ; MAX REC i g); deliver m; N ew Checkpoint Enabled := true; end When P i desires to take a basic checkpoint if N ew Checkpoint Enabled then take a new checkpoint C i;x ; new checkpoint endif; Procedure new checkpoint is begin f i;x := T i ; % f i;x is the timestamp of C i;x. A new interval I i;x+ is started % T i := lub (ff i;x ; MAX REC i + "g); MIN SENT i := >; MAX REC i :=?; N ew Checkpoint Enabled := f alse; end Figure 6: A Family P 00 of Checkpointing Protocols A new protocol (HMR). Taking the lattice N with " = and (8i) T i initialized to 0, results in a new protocol which ensures that no checkpoint is useless. This protocol has a low cost as it requires that application messages piggyback only one integer. Wang's FDAS protocol. Taking the lattice N n with " = i and (8i) T i initialized to (0; : : : ; 0), we get the FDAS protocol introduced by Wang []. This protocol requires each message to carry an O(n) size vector of integers. It is important to note that FDAS has been designed to ensure RD- Trackability [], a property on local checkpoints stronger than the absence of useless checkpoints. If the aim is only that no local checkpoint be useless, then FDAS reveals to be less ecient than HMR. This is due to the following observation. With N and " =, the value of m:t counts the number of local checkpoints belonging to the longest causal path ending at send(m) [8]. With N n and " = i for P i, m:t[k] counts the number of local checkpoints taken by P k and belonging to the causal past of event send(m) [2, 5, 0]. Let us consider a message m received by a process P i. When the lattice is N (and " = ), if m:t > T i (i.e., the test of line 9 is true and a forced checkpoint is taken) then we can conclude that at least one process P k has started an interval not previously known by P i. In the same situation 8, when the lattice is N n (and " = i for P i ), we have necessarily : 9k : m:t[k] > T i [k], which implies that the test of line 9 is also true. It follows that each time the test of line 9 is true when taking N and " = (HMR protocol) it is also true when taking N n and " = i for each P i 8 The one characterized by the sentence \at least one process Pk has started an interval not previously known by Pi". PI n079

16 4 J.-M. Helary & A. Mostefaoui & M. Raynal (FDAS protocol). The reader can observe that the converse is not necessarily true. It follows that HMR is more ecient. Instantiation P Consider the following variant P 00 2 of P 00 in which (S3) is replaced by: (S3') at line 7, for param 2 [m:t; >], select param = > In P 00 2 the test of line 9, namely :(param param 2 ), becomes :(> MIN SENT i ). As MIN SENT i is initialized to > at the beginning of each interval, it follows that P 00 2 ensures that in each interval no deliver event follows a send event. Let us compare P 00 and P 00 2 for a given lattice (S; ). As, when a message m is received, we have m:t >, it follows that if the test at line 9 in P 00 (:(m:t MIN SENT i )) is satised 9, then it is also satised in P 00 2 (:(> MIN SENT i )). So, P 00 is more ecient than P Note that P 00 2 reduces to the well-known Russell's checkpointing protocol [9]. Instantiation P Both FDAS and HMR do not force a process P i to take an additional local checkpoint as long as P i has not sent a message since the beginning of the current interval. This is due to the fact that MIN SENT i is initialized to > when an interval starts and so the test :(m:t MIN SENT i ) is false until a message is sent. More conservative protocols can be obtained by considering an instantiation P 00 3 of P, similar to P 00 but where (S4) is replaced by: (S4') at line 8, for param 2 2 [?; MIN SENT i ], select param 2 = T i As T i MIN SENT i is always true, we have: :(m:t MIN SENT i ) ) :(m:t T i ). It follows that at line 9 (S4') induces more forced checkpoints than (S4). When choosing the lattice N with " = and (8i) T i initialized to 0, we get the MS protocol proposed by Manivannan and Singhal [4]. As indicated previously, to our knowledge, this was the most ecient communication-induced checkpointing protocol avoiding useless checkpoints. As (S4') is more constraining than (S4), the new protocol HMR compares favorably with MS. So, the new HMR protocol, deduced from a theoretical approach, reveals to be more ecient than already known communication-induced checkpointing protocols whose aim is to prevent the occurrence of useless checkpoints. 5 Conclusion This paper has introduced the Virtual Precedence (VP) property. An interval-based abstraction of a computation satises the VP property if it is possible to timestamp its intervals in a consistent way (i.e., time does not decrease inside a process and increases after communication). A very general protocol P that builds abstractions satisfying the VP property has been proposed. It has been shown that the VP property encompasses logical clocks systems and communication-induced checkpointing protocols. A new and ecient protocol which ensures no local checkpoint is useless has been derived from P. This protocol compares very favorably with existing protocols that solve the same problem. This shows that, due the generality of its approach, a theory (namely, here VP) can give ecient solutions to practical problems (here the prevention of useless checkpoints). Our current eort focuses on other applications of the VP theory such as logically instantaneous communications, deadlock detection and unstable properties detection. 9 And consequently a forced local checkpoint is taken. Irisa

17 Virtual Precedence in Asynchronous Systems 5 References [] Elnozahy, E.N., Johnson, D.B. and Wang, Y.M. A Survey of Rollback-Recovery Protocols in Message- Passing Systems, Technical Report CMU-CS-96-8, Carnegie-Mellon University, 996. [2] Fidge C.J. Logical Time in Distributed Computing Systems. IEEE Computer, 24(8):-76, 99. [3] Lamport, L. Time, Clocks and the Ordering of Events in a Distributed System, Communications of the ACM, 2(7): , 978. [4] Manivannan, D. and Singhal, M. A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing. Proc. of the 6th Int. Conf. on Distributed Computing Systems, pp , Hong-Kong, May 996. [5] Mattern, F. Virtual Time and Global States of Distributed Systems. In Cosnard, Quinton, Raynal, and Robert, Eds, Proc. of the Int. Workshop on Parallel and Distributed Algorithms, France, 988, pp. 25{ 226, Elsevier Science Publishers B.V., North Holland, 989. [6] Netzer, R.H.B. and Xu, J. Necessary and Sucient Conditions for Consistent Global Snapshots, IEEE Transactions on Parallel and Distributed Systems, 6(2):65-69, 995. [7] Randell, B. System Structure for Software Fault-Tolerance, IEEE Transactions on Software Engineering, SE(2): , 975. [8] Raynal, M. and Singhal, M. Logical Time: Capturing Causality in Distributed Systems. IEEE Computer, 29(2):49-56, February 996. [9] Russell, D.L. State Restoration in Systems of Communicating Processes, IEEE Transactions on Software Engineering, SE6(2):83-94, 980. [0] Schwarz, R. and Mattern, F. Detecting Causal Relationships in distributed computations: In Search of the Holy Grail. Distributed computing, 7:49-74, 994. [] Wang, Y.M. Consistent Global Checkpoints That Contain a Given Set of Local Checkpoints, to appear in IEEE Transactions on Computers, 996. PI n079

A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints

A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints Roberto BALDONI Francesco QUAGLIA Bruno CICIANI Dipartimento di Informatica e Sistemistica, Università La Sapienza Via Salaria 113,