Time Free Self-Stabilizing Local Failure Detection

Size: px

Start display at page:

Download "Time Free Self-Stabilizing Local Failure Detection"

Mavis Curtis
6 years ago
Views:

1 Research Report 33/2004, TU Wien, Institut für Technische Informatik July 6, 2004 Time Free Self-Stabilizing Local Failure Detection Martin Hutle and Josef Widder Embedded Computing Systems Group 182/2 Technische Universität Wien Treitlstraße 3/2 A-1040 Wien, Austria, EU +43 (1) July 6, 2004 Abstract It is widely acknowledged that failure detection is a useful building block for reliable distributed systems. Since many applications rely on it, implementations of failure detectors should often be as reliable as possible. In this paper we present a failure detector implementation which tries to reconcile two approaches: Self-stabilization and weak timing models. We introduce two time free self-stabilizing local failure detector implementations. The first handles an unbounded number of messages which may stem from the unstable period, but requires unbounded space. The second, more practical, implementation requires just bounded space while assuming a known upper bound on the number of messages a reasonable assumption for many networks. Keywords: Fault Tolerance, Self-Stabilization, Unreliable Failure Detectors, Timing Models Supported by the Austrian bmvit FIT-IT project DCBA, project no

2 1 Introduction Unreliable failure detectors [4] are a well-known and practical way to overcome the impossibility of asynchronous consensus [7]. We focus on failure detectors in sparse networks [11, 12], i.e. networks, where processes need not have a direct link to all other processes in the system. Such a low-level model of computation for failure detectors provides more efficiency and blends nicely with the fast failure detectors approach [10, 1]. Although it is possible to implement global failure detectors [11] in sparse networks, we focus on their local pendants: Processes monitor only their direct neighbors. In this paper we look at an eventually local failure detector. Such a failure detector satisfies local completeness and eventually local accuracy. A failure detector fulfills local completeness, if every process eventually suspects every neighbor that has crashed. Respectively, a failure detector fulfills eventual local accuracy, if every process eventually stops suspecting every non-crashed neighbor. Note that in the special case of a fully connected network an eventually local failure detector becomes an eventually perfect failure detector [4]. FD implementations should be as reliable as possible. We aim at two approaches here: (1) Weakening the system timing models as far as possible. Much recent work focuses on this topic see [2] and [13, 14]. (2) Improve the reliability of the failure detector by referring to the self-stabilization paradigm [6]. In order to reconcile these approaches, algorithms have to be devised that stabilize even in the presence of permanent faults [3, 8, 5]. Regarding (1), our algorithms do not need to know any upper bound on the message end-toend delay between processes. The only information we need is an upper bound on the ratio Θ between the upper and lower bound on the communication delay. Regarding (2), the self-stabilization paradigm was initially proposed by Dijkstra [6]. It requires an algorithm to recover from any (invalid) state in finite time. In detail, we assume our system stabilizes at some unknown time t GST, after which the timing assumptions hold, the number of permanent faults is bounded and no further state corruption at correct processes may occur. On the other hand, at t GST, processes may be in an arbitrary state and also arbitrary messages may be in transfer between the processes. Our analysis reveals, that the number of these messages is an important parameter. In detail, we devise two algorithms. The first one copes with an unbounded number of messages but requires unbounded space. Since we are mainly interested in practical solutions, the unbounded space requirement is not satisfying. We hence devise a second algorithm which requires just bounded space. The required memory depends on M, the a priori known upper bound on the number of messages that may be in transit simultaneously. Since real networks are finite, we consider this upper bound as not very restrictive. For many networks, M can be analytically derived since the capacity of links (determined by memory allocated to queues at diverse network layers) is bounded as well. However, this bound has no influence on the run-time of our algorithm and the space requirements are only logarithmic in M. So even if it is difficult to find a tight upper bound, one can still use an extremely over dimensioned one. The second approach is therefore of greater practical relevance than the first one. Our failure detector implementation is straightforward if self-stabilization is not required: Processes count round trips to neighbors. If no message from some neighbor q was received in a time interval in which Θ round trips to some other neighbor occurred, q must have crashed. In the first solution which has unbounded space, it is always possible to increase the local round number, and old messages can easily be recognized and dismissed such that it can be shown that stabilization can be reached in bounded time. When considering bounded memory, the major problem is that messages are reused and that due to the desired time freedom, messages cannot 1

3 be identified as old resp. faulty (since they must be used again). We solve the problem by introducing M + 2 phases and show that within bounded time a phase must be reached during which that all invalid messages are dismissed. There is, however, the problem of deadlocks in pure time free self-stabilizing solutions. Such an algorithm requires some local event which is unlocked from time to time in order to prevent that all processes wait on messages and do not send any [9]. Such an event needs to happen just eventually and the intervals between two such events have no influence on the correctness of the algorithm but just on its stabilization time. It has only to be triggered an infinite number of times, and the intervals between two such events have to be finite. We assume the existence of such an event for our algorithms. Self-stabilizing failure detector implementations were given in [3]. They use a property which requires that after a process receives m messages from one neighbor it has received at least one message from every other neighbor. Although this seems to have some similarities to our timing assumption, their approach requires local clocks, whereas our algorithms are completely time and timer free. The paper is organized as following: In Section 2 we give the system model and in Section 3 we define the eventually local failure detector. The algorithm that uses unbounded space and handles an unbounded number of messages is presented in Section 4. In Section 5 we show an implementation that requires just bounded space, but needs an upper bound on the channel capacity. We finally conclude in Section 6. 2 System Model Our distributed system comprises a finite set Π of processes, connected by a not necessarily fully connected network. We assume the existence of a global real-time clock with values from R, which is just used for analysis and is not available to the processes. Two nodes connected by a direct link are called neighbors, the set of all neighbors of a process p is denoted by nb(p), the size deg(p) = nb(p) of this set is called the degree of the node. We assume the communication graph has p Π : deg(p) > f, where f is the number of processes that may crash. We denote with v p (t) the value of variable v at process p and time t. If p makes a step at t, v p (t) denotes the value of v before a step, and v p (t) the value after the step. 2.1 Timing and Communication We consider time free algorithms, i.e. processes have no access to hardware clocks or an external time base. Neighboring processes communicate by message passing. The time interval a message m is in transit consists of three parts: Local message preparation (includes queuing) at the sender, transmission over the link, and local receive computation (includes queuing) at the receiver. We denote t m s the instant the preparation of message m starts. The instant the receive computation is finished we denote as t m r. In our system model we say that message m is in transit during real-time interval (t m s, t m r ]. We denote with Q(p, q, t) the set of messages which are in transit from p to q or vice versa at time t, and with Q(p, t) = q Π Q(p, q, t) all messages in transit from or to p. Consequently, Q (p, q, t) denotes the set of all messages from p to q or vice versa after a step at time t. Further, δ m = t m r tm s is the end-to-end computational + transmission delay of message m sent from one correct process to another. Our timing model stipulates an upper bound τ + for the transmission delay as well as a lower bound τ such that 0 < τ δ τ + <, where τ and τ + are not known in advance and need to hold after some unknown global stabilization time t GST only. Since τ + <, every message sent from a correct process to another one after t GST 2

4 is eventually received. Links need not provide FIFO transmission. A measure for the timing uncertainty is the transmission delay ratio Θ = τ + /τ. The presented failure detector, implementations will have a priori knowledge of some integer Ξ with Ξ > Θ. This is the only timing assumption required for our implementations. In particular neither τ nor τ + need to be known. In order to prevent deadlocks every process has some local mechanism that generates local deadlock prevention events from time to time. For timing analysis we postulate there exists an upper bound η on the duration between two such local events at every process. η is not known to processes. Note that the actual value of η has no influence on the correctness of our algorithms. Practically one would implement this e.g. with timers or clocks although any local mechanisms which gives some (even inaccurate) notion of elapsed time can be employed. Note, however, that our solution is time free in a sense that clocks are not used to timeout messages such that we share to advantages of time free designs. 2.2 Self-stabilization and failures We assume that processes can fail by crashing. We denote with C the set of correct processes, i.e. processes that never crash. On the other hand, set F comprises all processes that eventually crash. At some unknown time t GST the system stabilizes. Before t GST, the system may behave arbitrary. In detail, arbitrary state corruptions may occur no timing assumptions hold messages may be lost or spontaneously generated At and after t GST, no further state corruptions may occur, however the system may still be in an illegitimate state. links are reliable and follow the timing assumptions. Moreover, every message that is in Q(p, t GST ) is delivered before t GST + τ +. Process crashes may occur at any time (before and after t GST ). 3 Failure Detectors As most other failure detectors, our algorithm outputs a list of suspected processes. Formally, a failure detector history is a function H(p, t) : Π R 2 Π. If a process q is in H(p, t) at some time t, we say p suspects q. We now define an eventually local perfect failure detector. Such a failure detector has to satisfy the same properties as an eventually perfect failure detector [4], but just for neighbors: Local Completeness Every process that crashes is eventually suspected by all correct neighbors. Formally, p F q nb(p) C : t 0 t t 0 p H(q, t) Eventually Local Accuracy Eventually, no correct process is suspected by any correct neighbor. Formally, p C q nb(p) C : t 0 t t 0 p / H(q, t) The class of local eventual perfect failure detectors is denoted by P l. 3

5 1 state variables 2 q Π : lastmsg p [q] N 3 4 if received (p, k) from q 5 if k > lastmsg p [q] 6 lastmsg p [q] k 7 if k = max r Π {lastmsg p [r]} and s, s q : lastmsg p [s] = lastmsg p [q] 8 suspect {r Π k lastmsg p [r] Ξ} 9 send (p, k + 1) to all neighbors if received (q, k) from q 12 send (q, k) to q on deadlock-prevention-event do 15 send (p, max q Π {lastmsg p [q]} + 1) to all neighbors Figure 1: Algorithm for process p with no bounds on number of messages on a link. 4 Unbounded Link Capacity In this section we describe a novel implementation of P l which handles an unbounded number of messages on the links. More precisely Q(p, t GST ) for all processes p. This algorithm, however, requires unbounded memory. We show that the algorithm stabilizes even if there are infinitely many messages in transit at t GST. This result is hence also a solution for systems where links have known or unknown bounds on the maximum number of messages on the links. The algorithm for a process p is given in Figure 1. With every neighbor of p, (p, k) messages are exchanged, where k is an integer. When a neighbor q receives such a message, q just returns it to p (lines 11-12) and no further processing is done. For every neighbor q, p holds a variable lastmsg p [q], where it stores the highest integer k such that p received a (p, k) reply from q. We also denote lastmsg p,q for lastmsg p [q]. The highest value among all lastmsg p,q determines the round for process p. Thus we define round p (t) = max q Π {lastmsg p,q(t)} and round p (t) = max q Π {lastmsg p,q (t)} respectively. Every time a new round is reached (by receiving a message (p, k) such that k > round p ), p sends a (p, round p + 1) message to all neighbors. Note that the fastest neighbor determines the round progress, i.e. round p + 1 is started when the first neighbor returns the (p, round p + 1) message to p. By our timing assumptions, this requires at least 2τ time. The reply of the slowest neighbor requires at most 2τ + time. At this time, round p has reached at most 2τ + /2τ < Ξ additional rounds. Thus, for every correct neighbor the difference round p lastmsg p,q is less than Ξ, whereas for every faulty neighbor p eventually stops updating lastmsg p,q. So, we set H(p, t) to the set of processes with round p lastmsg p,q Ξ (line 8). In line 14, from times to times the last message is resent to every neighbor. As described in the introduction, this prevents the algorithm from deadlock, if messages are lost during the instable period. Note that this has no influence on the operation of the algorithm, since all messages with k lastmsg p,q are dropped, therefore only the first message that is received has an influence on the behavior of p. We assume that line 14 is activated at least after η time steps. Note that this is not required by the algorithm but just for timing analysis. We start analysis with some preliminary lemmas. 4

6 Lemma 1 (Monotonicity). After t GST, round p (t) is monotonically increasing with time t, i.e. t GST t 1 t round p (t 1 ) round p (t) round p (t). Proof. Obviously, since round p is the maximum of all lastmsg p,q, and by lines 5 and 6 lastmsg p,q is monotonically increasing. Lemma 2 (Progress). There is a time t, t GST broadcasts (p, round p (t) + 1) at time t. Proof. At time t GST we have to distinguish two cases: t < t GST + max{2τ +, η}, such that p (1) There is at least one neighbor q of p, such that at least one message (p, l), with l > round p (t GST ) is in Q(p, q, t). Let such a message (p, k) be the first of them to be received by p at some time t t GST. Obviously, t < t GST + 2τ +. Since by assumption this is the first message after t GST that changes round p, we have round p (t) = round p (t GST ). Thus, lastmsg p,q (t) round p (t) < k, p executes lines 6 and 9 and therefore broadcasts a (p, round p(t) + 1) message, where round p(t) = k. (2) No such message exists. Then by time t = t GST + η, line 15 is executed, and also (p, round p (t) + 1) is broadcast. Thus by time t GST + max(2τ +, η) the required message is broadcast in both cases. Lemma 3 (Stabilization). For every message (p, k), which is received by p at some time t t GST + 2τ +, it holds that k round p (t) + 1. Proof. Since sending a message to a neighbor and receiving the answer takes at most 2τ +, there is a time t 1 t GST when p has broadcast (p, k). However, since we are after t GST, k must be equal to round p (t 1) + 1, since p can broadcast only (p, round p (t 1) + 1) messages (lines 9,15). By Lemma 1, round p is monotonically increasing with time. Therefore from t t 1 follows round p(t) + 1 round p (t) + 1 round p (t 1 ) + 1 = k. Let t stable = t GST + max(2τ +, η), which is the time where we have progress (Lemma 2) and correct message pattern (Lemma 3). Lemma 4 (Fastest Progress). Let correct process p broadcast (p, k) at some time t t stable for the first time. Then p does not broadcast (p, k + l) before time t + 2lτ. Proof. By induction on l. For l = 1 assume by contradiction that p broadcasts (p, k + 1) before time t + 2τ. If (p, k + 1) is sent by line 15 it was sent also by line 9 before (since we are after t stable and Lemma 2 at least one message has been sent by line 9), which would not be the first time. Since (p, k + 1) is sent by line 9, p received a (p, k) message which was sent before time t + τ by some q (as response line 11) and before time t by p. Contradiction. Now assume p broadcasts (p, k + l 1) not before t + 2(l 1)τ the first time. With the same argumentation, p does not broadcast (p, k + l) before time t + 2lτ. Lemma 5 (Slowest Progress). Let correct process p broadcast (p, k) at some time t > t stable for the first time. Then p broadcasts (p, k + l) by time t + 2lτ +. Proof. By induction on l. Since p broadcasts (p, k) at time t, all neighbors of p receive this message by time t + τ + and every correct neighbor (since deg(p) > f there exists at least one) returns it. These messages are received by p by time t + 2τ +. Consider the time of the reception of the first of these messages. Because of Lemma 3 (note that k = round p(t) + 1), p receives no message (p, k ) with k > k by t + 2τ +, thus p broadcasts (p, k + 1). For l > 1, assume p broadcasts (p, k) by time t + 2(l 1)τ +. With the same argumentation, p does broadcasts (p, k + l) by time t + 2lτ +. 5

7 Lemma 6. For every time t > t stable + 2τ + Ξ and every correct neighbor q of correct process p, it holds that round p (t) lastmsg p,q (t) < Ξ. Proof. Since all variables are non-decreasing, the condition could be only violated by increasing round p. So we consider only times t, where round p (t) > round p(t). By Lemma 3 we have round p (t) = round p(t) + 1. By the Lemmas 2,3 and 4, a message (p, round p (t) Ξ + 1) was broadcast by p at time t s t 2τ Ξ, by Lemma 5, t s > t stable, so this message really exists. This message is received from every correct neighbor q by time t r t s + 2τ +, thus lastmsg p,q (t r) round p (t) Ξ + 1 > round p (t) Ξ. Because of Ξ > Θ we have t r t s + 2τ + t 2τ Ξ + 2τ + t, and so from Lemma 1 follows lastmsg p,q(t) lastmsg p,q(t r ). Hence we get round p(t) lastmsg p,q(t) < Ξ. Theorem 1 (Local Completeness). The algorithm in Figure 1 fulfills local completeness, i.e. eventually every non-correct neighbor of p is suspected by p. Proof. Let t crash be the time q crashes, and t = max{t crash, t GST }. Then no message from q to p is received after t + τ +. After this, lastmsg p,q round p remains unchanged. By Lemma 5, round p reaches round p (t) + Ξ by time max{t crash + τ +, t stable } + 2Ξτ +. Since round p is nondecreasing, q remains suspected from then on. Theorem 2 (Eventual Local Accuracy). The algorithm in Figure 1 fulfills eventual local accuracy, i.e. eventually p stops suspecting every correct neighbor of p. Proof. Follows directly from Lemma 6 and line 8 of the algorithm. Corollary 1. The algorithm in Figure 1 implements a self-stabilizing eventually perfect local failure detector in a sparse network. After stabilization, a crashed process is suspected Ξ rounds after it crashed, i.e. the worst case failure detection time is 2Ξτ +. 5 Bounded Memory and Link Capacity In this section we give a solution to failure detection which requires that the number of messages which can be in transit at the same time over one link is bounded, and this bound is known in advance. In contrast to the algorithm of the previous section, this one requires just bounded memory size. We believe that this result is of practical interest. Real computers have bounded memory, which is not only used to store variables of our algorithms, but also to store messages in various queues. Since, queues are the significant parts of links, the assumptions that the number of messages are bounded seems reasonable to us. The algorithm is depicted in Figure 2. In concept, it works close to the one in the last section. However, since our integers are bounded, eventually we need to wrap-around the round number. We call one such cycle a phase. To avoid that messages of the previous phase interfere with the current one, we use phase numbers. Since also the range of the phase numbers is bounded due to the bounded memory assumption, we have to ensure that there are sufficiently many distinct phase numbers such that no interference is possible. We show in our analysis, that if there are at most M messages in all links of a process, M + 2 phases are sufficient to ensure stabilization. The idea behind this is that there exists at least one phase which cannot be shorten by faults messages from the unstable period. The second difference to the algorithm in Figure 1 is that this algorithm broadcasts only on a phase switch, whereas the previous one broadcasts every round. By our assumption, Q(p, t GST ) M < for all processes p. For any phase number ph we further define next(ph) = (ph + 1) mod (M + 2) and prev(ph) = (ph + M + 1) mod (M + 2). 6

8 1 state variables 2 phase p {0,..., M + 1} 3 q Π : lastmsg p [q] {0,..., Ξ} 4 5 if received (p, ph, k) from q 6 if ph = phase p and k > lastmsg p [q] 7 if k < Ξ 8 lastmsg p [q] k 9 send (p, phase p, k + 1) to q 10 else 11 suspect {r Π lastmsg p [r] = 0} 12 phase p (phase p + 1) mod (M + 2) 13 r Π : lastmsg p [r] 0 14 send (p, phase p, 1) to all neighbors if received (q, ph, k) from q 17 send (q, ph, k) to q on deadlock-prevention-event do 20 q Π : send (p, phase p, lastmsg p [q] + 1) to q Figure 2: Algorithm for process p with known upper bound on number of messages on links. Lemma 7. For every process p, in any execution of our algorithm, there exists at least one phase number ph 0, such that no message (p, ph 0, k) is in Q(p, t GST ) and ph 0 phase p (t GST ). Proof. Obviously, Q(p, t GST ) = x M. At time t GST process p can be in one phase only. The number of phase numbers is M + 2 > x + 1 such that at least 1 phase number remains. We now define two properties, that define the legitimate states for process p. In more detail, when the stability property is fulfilled, it is ensured that there are no malicious messages in transit anymore. The progress property ensures that the system is not deadlocked, i.e. there are sufficiently many messages in transit to keep the failure detector working. Definition 1 (Stability). For a process p, the predicate PS(p, t) holds at time t iff there is no next(phase p (t)) message in transit. Formally, PS(p, t) k : (p, next(phase p (t)), k) Q(p, t) Definition 2 (Progress). For a process p, the predicate PP(p, t) holds at time t iff there is at least one correct neighbor q of p, from or to which a message with k > lastmsg p,q (p, t) and current phase is in transit. Formally, PP(p, t) q (C nb(p)) k > lastmsg p,q (p, t) : (p, phase p (t), k) Q(p, q, t) We start by showing closure of progress, i.e. if PP holds once after t GST it holds forever. Lemma 8. If there is time t 0 t GST, where PP(p, t 0 ) holds, then PP(p, t) holds also for all times t > t 0. Formally, t 0 t GST : PP(p, t 0 ) t > t 0 PP(p, t) Proof. Assume by contradiction that there is a time t > t 0, where PP(p, t) does not hold anymore for the first time. Since by assumption the predicate held before that, for some nonfaulty q, either lastmsg p,q (t) lastmsg p,q(t) or (p, phase p (t), k) / Q (p, q, t) anymore. In both cases, p has received a (p, phase p (t), k ) message with k > lastmsg p,q (t) and thus sends either a (p, phase p (t), lastmsg p,q (t) + 1) or sends a (p, phase p (t), round p (t) + 1) message and sets lastmsg p,q = 0. In both cases the property holds also at time t. Contradiction. 7

9 The following lemma ensures convergence to PP, i.e. bounded time after t GST it is guaranteed that PP holds. Lemma 9. By time t GST + η, PP(p, t) holds forever. Proof. By time t GST + η, p sends a (p, phase p, lastmsg p,q + 1) to every neighbor q. Since f < deg(p), at least one of them is non-faulty and thus PP holds for p at this time. By Lemma 8 after that PP holds forever. We have seen that our algorithm stabilizes such that PP always holds after bounded time after t GST. We now turn our attention to the PS property and start with some preliminary lemmas. Lemma 10 (Fastest Progress). Assume p starts phase ph = phase p (t) by broadcasting (p, ph, 1) at time t, and PS(p, t) holds. Then p does not broadcast (p, next(ph), 1) by time t + 2τ Ξ > t + 2τ + Proof. Note that p broadcasts (p, next(ph), 1) only if it receives a (p, ph, Ξ) message from one of its neighbors. We show by induction on Ξ that this is the case not before t + 2τ Ξ. For Ξ = 1, note that sending (p, ph, Ξ) to some neighbor q and receiving the answer takes at least time 2τ, and since by PS(p, t) no other messages with phase ph are in transit, p cannot receive (p, ph, 1) before t + 2τ. Because p is still in phase ph, no message with other phases were broadcast, thus PS still holds. Assume p receives a Ξ 1 message not before t + 2τ (Ξ 1). By the same argumentation, this message is not received before t + 2τ Ξ and PS holds. Since Ξ > Θ (compare Section 2.1), t + 2τ Ξ > t + 2τ +. Lemma 11 (Slowest Progress). Assume p starts phase ph by broadcasting (p, ph, 1) at time t. Then p broadcasts (p, next(ph), 1) by time t + 2τ + Ξ. Proof. Note that p broadcasts (p, next(ph), 1) if it receives a (p, ph, Ξ) message from one of its neighbors and is still in phase ph. If p is no more in phase ph we are done, so we show by induction on Ξ that p receives a (p, ph, Ξ) message by time t + 2τ +. Sending a message to a neighbor and back requires at most time 2τ +, thus by time t + 2τ + receives (p, ph, 1). For the inductive step assume p receives (p, ph, Ξ 1) by time t + 2τ + (Ξ 1). Then by the same argumentation p receives (p, ph, Ξ) by time t + 2τ + Ξ Lemma 12. Assume phase p (t) = ph. Then phase p (t 1 ) = prev(ph) for some times t 1 > t > t GST only if p was in all phases in the time interval [t, t 1 ]. Proof. By line 12 of the algorithm, p changes its phase only to next(phase p (t)) and thus has to adopt all other values before reaching prev(phase p (t)). We now show that PS is reached shortly after t GST. Lemma 13. PS(p, t) holds at time t = t GST + 2τ +. Proof. We have to show that no messages (p, l, k) for some k and l = next(phase p (t)) is in transit at time t. Such a message cannot be in Q(p, t GST ) since all these messages are received by t GST + τ +. Such a message cannot be a reply from one of p s neighbors to a faulty message which was in Q(p, t GST ) either since all these responses must be received by p before t. Thus, message (p, l, k) can only be in transit at time t if p was in phase l at some time t 1, t GST t 1 t. It remains to show that this is not possible. Since p is in phase prev(l) at time t it must, by Lemma 12, have been in all phases (0..M +1) between time t 1 and t, thus there must be some time t 2, t 1 t 2 t such that phase p (t 2 ) = 8

10 prev(ph 0 ), i.e. phase ph 0 was started then. It follows that PS(p, t 2 ). By Lemma 10 this phase cannot be terminated by some time t 3 > t 2 + 2τ + t which is a contradiction to p being in phase prev(l) at time t. It remains to show closure, i.e. if PS is reached once, it holds forever. Definition 3. We define σ(p, t, ph) as the first time after t, where p reaches phase ph. Formally, σ(p, t, ph) = min{t > t phase p (t ) = ph t (t < t < t phase p (t ) = ph)} Lemma 14. From PS(p, t) where t t GST follows that PS(p, t ) holds for all times t, t t < σ(p, t, next(phase p (t))). Proof. Obviously, since phase p remains unchanged, no spontaneous messages are generated after t GST and p sends phase p (t) messages only. Lemma 15. Let PS(p, t) hold at time σ(p, t, ph) where t t GST. t = σ(p, t, next(ph)). Then PS(p, t ) at time Proof. By Lemma 10 p terminates phase ph after σ(p, t, ph) + 2τ +. All messages which are in transit to p at time σ(p, t, ph) are received by time σ(p, t, ph)+τ +. All messages for other phases than ph are ignored by p (and hence no messages are sent). All messages for phases other than ph which are in transit from p to its neighbors are answered by them by line 17. The answers are received by p by time σ(p, t, ph) + 2τ + and ignored as well since p is still in phase ph. Hence no messages for other phases than ph are in transit at time t. Since next(next(ph)) ph the Lemma holds. Lemma 16. After time t stable = t GST + 2τ +, PS holds at all phase switch times. Proof. By Lemma 13 PS(p, t) holds at time t = t GST + 2τ +. By Lemma 14 follows that PS(p, t ) holds for all times t, t t < σ(p, t, next(phase p (t))). From an inductive application of Lemma 15 it follows that PS holds at all phase switch times after that. From these lemmas it follows that after some times, all phases are sufficiently long to timeout processes. Thus we show local completeness and local accuracy in the following. Theorem 3 (Local Completeness). The algorithm in Figure 2 fulfills local completeness, i.e. eventually every non-correct neighbor of p is suspected by p. Proof. Assume neighbor q of p has crashed. By Lemma 9, PP holds by time t GST + η. Note that every message (p, ph, k) from q, with k > lastmsg p,q and ph = phase p causes either a message (p, ph, k + 1) (for k < Ξ) or a (p, ph + 1, 1) message. Consequently, eventually p reaches k = Ξ and switches to the next phase (lines 11-14). When p reaches k = Ξ in the next phase, lastmsg p,q = 0, since there was no message from q. According to line 11, p suspects q. Theorem 4 (Eventual Local Accuracy). The algorithm in Figure 2 fulfills eventual local accuracy, i.e. eventually p stops suspecting every correct neighbor of p. Proof. By Lemma 16 and Lemma 10 all phases that are started after t GST + 2τ + are longer than 2τ +. This is sufficiently long for all answers of correct process p s correct neighbors q to p s (p, ph, 1) message are received by p before it executes line 11 at some time t. It follows lastmsg pq (t) > 0 for every correct neighbor q when p executes line 11 such that no correct processes will ever be suspected by p. Corollary 2. The algorithm in Figure 2 implements a self-stabilizing eventually perfect local failure detector in a sparse network. When a process crashes in a phase (after replying at least one message) it is suspected at the end of the next phase. Thus, it is easy to see that the worst case failure detection time once the failure detector has stabilized is (4Ξ 1)τ +. 9

11 6 Conclusions We presented two time free implementations of P l which are self-stabilizing. The results described in this paper are in the context of sparse networks. Obviously they apply to fully connected networks as well. We provided two algorithms for a self-stabilizing local failure detector in a sparse network, using little timing assumptions. Obviously, the existence of a bound on the channel capacity can be identified as an additional design parameter. Whereas the solution with unbounded space is only of theoretical interest, it is the second approach that is relevant in practice. The assumption of bounded channel capacity is valid in most systems. Moreover, the chosen bound has no effect on the runtime of the algorithm and the message size is also only logarithmic in this bound. It is an open question, whether there is a solution for unbounded channel capacity and bounded space. Although unbounded channel capacity may be practically not interesting, the subject is at least of theoretical meaning. However, it remains the question whether there is a solution given that there is an unknown bound on channel capacity requiring just bounded space. References [1] Marcos Aguilera, Gérard Le Lann, and Sam Toueg. On the impact of fast failure detectors on real-time fault-tolerant systems. In Proceedings of the 16th International Symposium on Distributed Computing (DISC 02), volume 2508 of LNCS, pages Springer Verlag, Oct [2] Marcos K. Aguilera, Carole Delporte-Gallet, Hugues Fauconnier, and Sam Toueg. On implementing Omega with weak reliability and synchrony assumptions. In Proceeding of the 22nd Annual ACM Symposium on Principles of Distributed Computing (PODC 03), [3] J. Beauquier and S. Kekkonen. Fault-tolerance and self-stabilization: Impossibility results and solutions using self-stabilizing failure detectors. International Journal of Systems Science, 28(11): , [4] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2): , March [5] Ariel Daliot, Danny Dolev, and Hanna Parnas. Linear time byzantine self-stabilizing clock synchronization. In Proceedings of the 7th International Conference on Principles of Distributed Systems, Dec to appear. [6] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed control. Communications of the ACM, 17(11): , [7] Michael J. Fischer, Nancy A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty processor. Journal of the ACM, 32(2): , April [8] Felix Gärtner. On crash failures and self-stabilization. Presentation at Journées Internationales sur l auto-stabilisation, CIRM, Luminy, France, October 21-25, 2002, October [9] Mohamed G. Gouda and Nicholas J. Multari. Stabilizing communication protocols. IEEE Transactions on Computers, 40(4): , April

12 [10] J.-F. Hermant and Gérard Le Lann. Fast asynchronous uniform consensus in real-time distributed systems. IEEE Transactions on Computers, 51(8): , August [11] Martin Hutle. An efficient failure detector for sparsely connected networks. In Proc. IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 04), Innsbruck, Austria, February [12] Martin Hutle. On omega in sparse networks. In Proc. 10th International Symposium Pacific Rim Dependable Computing (PRDC 04), Papeete, Tahiti, March [13] Gérard Le Lann and Ulrich Schmid. How to implement a timer-free perfect failure detector in partially synchronous systems. Technical Report 183/1-127, Department of Automation, Technische Universität Wien, January (submitted). [14] Josef Widder. Booting clock synchronization in partially synchronous systems. In Proceedings of the 17th International Symposium on Distributed Computing (DISC 03), volume 2848 of LNCS, pages Springer Verlag, October

Asynchronous Models For Consensus

Distributed Systems 600.437 Asynchronous Models for Consensus Department of Computer Science The Johns Hopkins University 1 Asynchronous Models For Consensus Lecture 5 Further reading: Distributed Algorithms