Resolving Message Complexity of Byzantine. Agreement and Beyond. 1 Introduction

Size: px
Start display at page:

Download "Resolving Message Complexity of Byzantine. Agreement and Beyond. 1 Introduction"

Transcription

1 Resolving Message Complexity of Byzantine Agreement and Beyond Zvi Galil Alain Mayer y Moti Yung z (extended summary) Abstract Byzantine Agreement among processors is a basic primitive in distributed computing. It comes in a number of basic fault models: \Crash", \Omission" and \Malicious" adversarial behaviors. The message complexity of the primitive has been known for the strong failure models of Malicious and Omission adversary since the early 80's, while the question for the more benign Crash failure model has been open. In this paper we show how to solve agreement in the presence of crash failures using O(n) messages which is optimal, thus settling a thirteen year old open problem. Our solution has almost linear time and our new algorithmic techniques have further implications: A family of \early stopping" agreement protocols with improved message-complexity. A new solution to \Checkpoint" yielding a substantial improvement of the protocol for distributed work performance under adaptive parallelism in a network of workstations. Columbia University and Tel-Aviv University. galil@cs.columbia.edu y Dept. of Computer Science, Columbia University, New York, NY 10027, USA. mayer@cs.columbia.edu. Part of this work was done while the author was visiting the IBM T. J. Watson Research Center. Partially supported by NSF grant CCR and CISE Institutional Infrastructure Grant CDA z IBM Research Division, T. J Watson Research Center, Box 704, Yorktown Heights, NY moti@watson.ibm.com 1 Introduction The Byzantine Agreement (BA) problem is concerned with a network of n processes consisting of a distinguished process, the sender, and n? 1 receivers. The sender has an initial value which it wishes to broadcast to the receivers (we assume synchrony and message passing communication). The complication is that some of the processes (possibly including the sender) may be faulty under some fault model (we essentially follow the notations of [HT93]). The BA problem is to design a protocol, i.e., an algorithm for each process, which will ensure the following three conditions in the presence of up to t faulty processes: Termination: Every correct process eventually chooses a decision value. Agreement: No two correct processes choose dierent decision values. Validity: If the sender is correct then all correct processes choose the sender's initial value. This paper is on crash-failures which are de- ned as follows: A faulty process stops prematurely; once it stopped it sends no more messages. Typically t = n? 1 in the case of crash-failures. History: Algorithms for BA under crash failures have been studied as early as 1982 by Lamport and Fischer [LF82] (see also [F83]), who show a solution which requires O(n 3 ) messages. In 1984, Bracha [B84] showed a nonconstructive way (namely, the existence of certain partition) to achieve a message-complexity

2 of O(n p n). In 1992, Dwork, Halpern, and Waarts [DHW92] introduced a constructive solution with sub-quadratic O(n log n) messages, but with exponential running time (see Figure 1 for a historical overview). The message-complexity of the problem has remained open. This is in contrast with stronger failure models, where upper bounds match the lower bounds. The latter were obtained in [DR85] by exploiting the power of the stronger { omission and malicious { adversaries. In the simple case of crash failures only the trivial lower bound of (n) messages has been known. It holds even in the failure-free case. Exact upper and lower bounds for the failure-free scenario were investigated by [AWH92, HH93]). A special variant of BA is early-stopping BA (EBA). It requires the running time to be O(f+1), where f is the actual number of failures in a run. (The time must be at least f + 1, as was shown in [LF82, H83, DRS90, DM90]). Lamport and Fischer [LF82] showed how to solve EBA in O((f + 1)n 2 ) messages. In 1990, Chandra and Toueg [CT90] solved EBA with an improved message complexity of O((f + 1)n). Here we improve the message complexity of EBA. See Figure 2 for a historical overview of EBA. Relevance and applications: The model of crash-failures is assumed in numerous distributed systems (see for example [CM84, BJ87, MB+94]) and variations of (E)BA are used as general building blocks for reliable distributed computing; see for example [HT93] 1. Various distributed operating systems ([HT93] page 131: ISIS, Amoeba, Psynch, Delta-4, Transis) use atomic or FIFO broadcast, while relying, at some layer, on synchrony. Note that in asynchrony this problem is unsolvable [FLP85]. Another important application of our results, work performance in a network of workstations [ACP94, CFGK95, DMY94] is discussed further below. The problem of work performance in the presence of crash faults was rst formulated by Dwork, Halpern and Waarts in [DHW92]. 1 Hadzilacos and Toueg ([HT93]) use the name \Terminating Reliable Broadcast" for BA. 1.1 Our approach and techniques Our basic building block is the Checkpoint problem (CP). De Prisco, Mayer and Yung [DMY94] formalized this problem as follows: Let ST ART be the set of live processes at the beginning of the execution and let EN D be the set of live processes when the protocol terminates. Then correctness of the result-set E is dened as follows: Agreement: E must be the same at each process. Safety: E ST ART. Progress: EN D E. In [DMY94] an early-stopping solution with O((f + 1)n) messages was presented. All previous message-ecient solutions to BA or EBA are based on the rotating coordinator (or phase-king) approach: there is a predened coordinator-process in every round and each message sent either originates at or is destined to the coordinator. To obtain our result we depart from this technique, and develop ideas in the following directions: (i) we employ a \rotating tree", which we call diusion tree. That is, in every phase (collection of a predened number of rounds), there is a predened tree which spans all live processes. In a phase, messages are sent only along the edges of the diusion-tree; (ii) we solve agreement via a reduction to checkpoint (typically, the reduction in the other direction was employed); (iii) we develop a recursive checkpointing approach as follows: In order to achieve BA or CP via a diusiontree of height h and degree d, we use the induction hypothesis of having available a CP for a diusion-tree of height h? 1 and degree d, i.e. for the internal nodes of the original diusion-tree; (iv) our techniques have early-stopping avor, i.e. processes exit the protocol at dierent times and hence we need to introduce a time management subprotocol. This is somewhat counter intuitive as early stopping protocols typically employ ooding of messages. In the previous rotating coordinator paradigm, time was easy to manage, as the round in which a process was required to become a coordinator was essentially determined by his unique ID. In our case, there are repeated calls in dierent recursive levels. Processes need to have consistent views of when to \time-out" and become the new root of the diusion-tree. 2

3 LF82 B84 DHW92 This paper messages O(n 3 ) O(n p n) O(n log n) O(n) time O(n) nonconstr. O(2 n ) O(n 1+ ) Figure 1: Comparison of BA under crash-failures LF82 CT90 This paper messages O((f + 1)n 2 ) O((f + 1)n) O(n + f n ) Figure 2: Comparison of EBA under crash-failures 1.2 Our results on agreement Let a diusion-tree be a minimum height tree spanning all n processors with the sender as its root. The height h and degree d of the diusion tree satisfy h = O(log d n) and d = (n 1=h ). Our main result is the design of a family of BA protocols with the following characteristics: Lemma 1 For any diusion-tree with height h and degree degree d, such that d 9, our algorithm solves BA in the presence of f crash-failures in time O((f + 1)8 h ) and number of messages O(fd + n) This is employed to give the result on agreement with optimal message-complexity. By choosing a large constant d in Lemma 1 we get: Theorem 1 For any > 0, BA can be solved with O(n) messages and time O(n 1+ ). We also improve the known message-complexity of EBA. By setting d = n and h = O(1=) in Lemma 1 we get: Theorem 2 EBA can be solved with running time O((f +1)8 1= ) and with O(n+fn ) messages for any > 0. Our solutions use messages of size O(n). Regarding bit complexity, a simple coding of a \long" message into time-slots (under exponential slowdown in time) shows that, at least theoretically, O(n) bit-complexity is also achievable, which is optimal as well. Another such transformation that breaks the message to O(log n)-size minimessages gives a polynomial time protocol with subquadratic total bits. 1.3 Parallel Work on Networks of Workstations Our main result holds also for performing a checkpoint: Theorem 3 CP can be solved with running time O((f + 1)8 1= ) and with O(n + fn ) messages for any > 0. Performing parallel applications on a network of workstations (NOW) is an area of growing interest, e.g., [ACP94, CFGK95]; especially with the advent of low-latency architectures, see e.g., [EABB94]. In this context a checkpoint is a basic building block for performing m independent work units in parallel on a network of n unreliable workstations (WP). (This models adaptive parallelism when stations are taken away from 3

4 the global parallel application to be used by the local user). This problem was rst introduced in [DHW92]. Let S denotes the parallel processor step, a measure for parallel time for this problem, introduced by Kanellakis and Shvartsman [KS89]. Using Theorem 3 in the framework of [DMY94], we get: level 1 level 0 (sender and coordinator) (coordinators) Theorem 4 WP can be solved in time S = O(m + (f + 1)n) and number of messages M = O(fn + min(f + 1; log n)n). We note that S = O(m+(f +1)n) is optimal for WP [DMY94], thus we improve on their number of messages (M = O((f + 1)n)) without losing time optimality. Organization: The next two sections introduce our algorithms for BA and CP, by solving the special case of height-2 diusion tree, demonstrating some of the algorithmic techniques employed (rotating tree and basic time management). Section 4 then presents the general case with its further techniques and subtleties (involved time management and full usage of the recursive checkpointing). 2 BA via a diusion-tree of height 2 In this section we show some of our ideas by presenting a solution to (E)BA which uses a diusiontree of height 2 and consequently of degree ( p n). We show how to get a solution to (E)BA with running time O(f + 1) and message complexity O(n + f p n). Initially, the diusion-tree is of height 2. The process on level 0 is the sender. Processes on levels 0 and 1 are referred to as coordinators; see Figure 3. The tree is organized such that a breadth- rst search yields processes with monotonically increasing IDs. If all coordinators were nonfaulty, the sender could simply send its value to the coordinators on level 1 which, in turn, could forward the values to the leaves on level 2. In the following we describe how we proceed in the general case, i.e., in the presence of failures. Let L denote the set of level 2 Figure 3: Diusion-Tree of height 2 (leaves) leaves which have not yet received any value. We propose to iterate the following ve phases until L is empty: Phase (1): The sender diuses its value to the coordinators on level 1 which, in turn, forward the value to the leaves on level 2. At this point, each process whose path in the tree to the sender consists of live processes knows the value of the sender. Phase (2): A rst checkpoint among the coordinators, i.e. the level 0 and 1 processes, is executed, using a generalized checkpoint of [DMY94]: During the running of the checkpoint each live process (which is a member of the result-set E 1 ) supplies the value it received in Phase 1 or some defaultvalue if it did not receive any message. Now the checkpoint can produce an additional (besides the set E 1 ) output-value v. The value v is the sender's value if it was received by at least one alive level-1 coordinator in Phase 1 and the default-value otherwise. Phase (3): Those coordinators on level 1 which have not received a message from the sender in the rst phase (in the case of the sender failing) now diuse the value v to their descendants (leaves). Phase (4): A second checkpoint among the coordinators is executed which yields the result set E 2. Knowing E 1 and E 2, each process can now compute which coordinators have failed and hence which leaves have not received any message so far. Let L denote the set of these leaves. Phase (5): If the size of L is larger than p n then the next iteration uses again a diusion-tree of height 2. If L is nonempty, but smaller than p n, then a diusion-tree of height 1 is sucient 4

5 since its degree is bounded by p n. Note that using a diusion-tree of height 1 is essentially equivalent to using the standard rotating coordinator technique of [CT90]. If L is empty then each process knows the value v (which was computed in phase 2) and hence the diusion is complete. This is the basic iteration, however, we need to solve a few crucial issues. First is the issue of \synchronizing" the processors taking part in an eventual sub-protocol, and second is the issue of \maintenance" of the tree-structure and recruiting coordinators/ sender to their \right location" at the \right time". The checkpoint sub-protocol is only eventual, thus processes potentially exit it at dierent times. After the rst checkpoint (which we call junior checkpoint), the nodes on level-1 can diuse the value if they have not done so yet. This fringe diusion (Phase 3) can be executed at dierent times. The start of the checkpoint in phase 4 (senior checkpoint) is triggered by the current sender, and hence independent of the times processes left the previous junior checkpoint. Similarly, the next iteration of the agreement (after phase (5)) starts with a diusion originating at the current root. As is evident from the above discussion, processes need to become active senders if all processes with lower ID have failed. However, the checkpoints used above are only eventual. This raises the issue of how to set these \time-outs" at the end of a checkpoint for the next checkpoint. We approach this issue by having processes computing a \virtual" start-time of the current checkpoint. Basically, if the current sender knows the start-time, it will broadcast it as well and processes will agree on it. If the sender does not know the start-time (because it turned sender as a result of a time-out), then it will set the current time as the virtual start-time. In addition, a process knows its \distance" to the sender according to the previous checkpoint, i.e. the number of live processes having IDs which are in between the sender's ID and the process' ID. So whenever a process exits the current checkpoint, it computes its next time-out with respect to this start-time and the distance to the sender. We diuse three messages: the value-message, the \commit"-message, and nally the \goodbye"-message. After receiving a \commit"- message, a process can use the value. Each diffusion follows the strategy above. The reason for this approach is to avoid costly \wake-up"- messages to a potential sender: Whenever a process times-out it assumes the role of the sender for the coming iterations. If it was a coordinator before, it remembers the current progress to determine which message (value, commit, or goodbye) to diuse and how many leaves are still uninformed. If it is a leaf, then it will decide which value to diuse based on the type of messages it received so far. At the end of an iteration, it is the task of a sender to recruit leaves to \serve" as coordinators to complete the diusion-tree after failures. 2.1 Correctness proof and complexity analysis In the following we outline a proof that the algorithm (1) is a correct protocol for BA and (2) performs within the time (number of rounds) and messages claimed above. In the correctness part, we rst need to show that the current structure of the diusion-tree is well-dened, namely, that all coordinators have the same view of the tree. Once this is accomplished, agreement and validity are easily shown by carefully examining the state (knowledge) of each process after each iteration. The running time analysis follows from the fact that the checkpoint of [DMY94] is early-stopping. The crucial step for the message-analysis is to show that if all coordinators (internal nodes) fail in one iteration (and hence all knowledge about which leaves have been informed is lost), the number of messages lost is still small (i.e. O( p n)) when amortized per failure. In the following let an epoch denote the maximum over: (1) the time of a failure-free execution of a checkpoint, and (2) the time a failure can extend a checkpoint, and (3) a diusion. Obviously, an epoch consists of a constant (independent of n) number of rounds. We start by showing that a failure of an arbitrary process at an arbitrary point in time of the algorithm can delay the al- 5

6 gorithm by at most a constant number of epochs (denoted by C h ). The reason for this is that a failure of a coordinator must show up in the result of the next checkpoint and the proof is essentially done by a case analysis. Lemma 2 Any failure can delay the algorithm for at most 8 epochs. From the above lemma follows that we should let C h = 8 epochs. Next we show that there is at most one root at any point in time. The proof involves showing that if the current root has failed at most one process is timing out, where time-outs are in units of C h, the value obtained in Lemma 2. Lemma 3 There is at most one live level-0 coordinator (root) at any time and it is the lowest live ID. Now we can show that the diusion-tree is always well-dened, i.e., each coordinator has the same view of it. By Lemma 3 there is always a well-dened root and hence the checkpoints run correctly. This gives an inductive proof of the following lemma: Lemma 4 Every coordinator has consistent view of the diusion-tree at any time and the following invariant is maintained: if j is a coordinator at time t then all processes k with k < j are either coordinators at the same or higher level in the tree or dead. Given a consistent view of the diusion-tree, the agreement-, validity-, and early-stopping properties are proven by induction over the iterations: Lemma 5 The algorithm achieves agreement. Lemma 6 The algorithm achieve validity. Lemma 7 Each process receives a \commit"- message after O(f + 1) rounds. Finally, we analyze the message-performance. The main focus is on showing that if at some point in time all O( p n) coordinators fail, the number of messages which need to be resend is bounded by O( p n) amortized per failure: Lemma 8 The total number of messages sent is O(fn 1=2 + n). 3 CP via a diusion-tree of height 2 In this section we show how to extend the ideas of the previous section to obtain a protocol for CP with the same asymptotic performance. We start with the same diusion-tree. In the rst phase, the root diuses a \query"-message along the tree. This phase is mainly used to obtain a synchronization-eect. In the second phase, an incast, starting at the leaves, is performed along the tree. The purpose of the incast is to collect information about the set of live processes. Then, in the third phase, a rst checkpoint among the internal processes is performed. After this checkpoint all internal nodes agree on the set of live internal nodes. Those live internal nodes which did not receive the \query"-message and hence did not participate in the incast can send now a \query"- message to their descendants (leaves) and the leaves answer by the incast. In the fourth phase another checkpoint among the internal nodes is executed. Again, this checkpoint yields the set of live internal nodes. From this, (1) the set of leaves unheard of and (2) the set of leaves which have live internal nodes as ancestors and hence whose status (live/dead), are determined. Based on the size of set (1), the next iteration will either use again a diusion-tree of height 2 or a tree of height 1. The latter is used whenever the number of unheard leaves is at most p n. If there are no more unheard leaves left, all internal nodes agree on the set of live nodes (denoted E in the introduction) and hence the algorithm can proceed to the diusion-phase which is essentially the same as an agreement-protocol. The only dierence is that after the original value of E is lost (through failures), the next root cannot simply use a defaultvalue, but needs to restart an incast-iteration to obtain a new set E. All techniques and remarks of the last section apply here too. 4 BA and CP via a diusiontree of height h In this section we show how to use a diusion-tree of arbitrary height h (1 h log n) and conse- 6

7 quently degree (n 1=h ) for either BA or CP. We use a recursive approach in the sense that we assume having available BA- and CP-protocols for diusion-trees of heights h?p (1 p < h). We denote a checkpoint among nodes on levels 0; 1; : : :; l by CP l Let us rst consider BA. The algorithm again consists of phases. In the rst phase the root of the diusion tree diuses its value along the tree. In the second phase, a rst checkpoint CP h?1 among the internal processes (using the induction hypothesis) is executed. After this checkpoint internal nodes on level h? 1 which might not have received a message from the root in the rst phase (in case of a node failing on the path from the root to this node) can now diuse this value to their descendants (which are leaves of the diusion tree). After that, the third phase, a second internal checkpoint, CP h?1 is executed. As a result of this second checkpoint the set of leaves which are yet to be informed (the outstanding work) can be computed by each internal process. Based on the size of this set, the height of the tree in the next iteration is determined: if the outstanding work is between d h?(p+1) and d h?p (for 0 p h? 1) then the next diusion-tree has height h? p and the algorithm continues to use an BA-protocol for that tree. The checkpoint CP h?1 is given by the induction hypothesis. As in the h=2 case, this protocol is only eventual. The same issues about managing time arise as previously. The recruiting at the end of an iteration is done as follows: The decision on the height of the next stage is a result of CP h?1. However the information (i.e., the set E h?1) provided by CP h?1 might be outdated for the actual recruiting of new coordinators. The reason for this is that as part of CP h?1 other recursive instances CP h?p for 2 p h? 1 are also active and their recruiting might not be reected in E h?1. But note that the recruiting can be done as part of the rst diusion of the next iteration. This diusion always starts at the current root and hence can be made consistent among all levels. This holds since the root at which the diusion originates has also been part of each recursively called checkpoint CP h?p and hence holds the freshest knowledge; i.e. can take into account all previous recruitings by those checkpoints. In Figure 4 we consider BA and in Figure 5 CP, both using a diusion-tree of height h. 4.1 Correctness proof and analysis The following subtleties are new to the proof compared to the previous sections: (i) A process which fails on level l can potentially be recruited by an instance of CP l?2, then by CP l?3, CP l?4, and so forth. However, a careful analysis will reveal that the total number of times the algorithm recruits a dead process out of a set of k process is O(k). This is important for the management of time-outs and the message-analysis. (ii) We need an analysis of the recursive structure to give a condition on d which ensures that the message performance is O(n) for a failure-free execution, regardless of the actual value of h. We will show that d > 9 is sucient. (iii) The running time is obviously a function of h and needs to be analyzed. In the following we sketch the correctness proof and the analysis for CP (which essentially implies BA). We rst show that the total number of times the algorithm recruits a dead process out of a set of n processes is O(n). Hence, on average, a process can \negatively inuence" the algorithm only a constant number of times. A process negatively inuences the algorithm by not sending the messages it is supposed to send (since it is dead). We now consider an adversary which wishes to maximize the total number of times dead processes negatively inuences the algorithm. Let A d (n; t) denote the total number of times our adversary succeeds in having up to t dead processes negatively inuence the algorithm running on a tree of degree d with n processes (and hence height O(log d n)). We show now that A d (n; n) = O(n). First we show that a collection of smaller trees does not allow as good a strategy as one big tree T with the same total number of nodes: Lemma 9 A d (n; n) > P i A d(c i n; c i n), where 0 < c i 1, P i c i = 1 and 9i : c i < 1=d Corollary 1 A d (n; n) P i A d(c i n; c i n), where 0 < c i 1, P i c i = 1 7

8 The next lemma shows that if the adversary can only fail a fraction of the processes, it might just as well fail the corresponding top of the tree: Lemma 10 A d (n; n) = A d (n; n), where 0 < 1 We are now ready to prove our intermediate goal. The way we achieve it is as follows: We dene a greedy strategy recursively as applying it to the n=d internal nodes and then recruiting another n=d nodes from the leaves. This process can be repeated d times and can be proven to yield O(n) \negative inuences". We then show that no strategy can do better than the greedy, using the last two lemmas. Lemma 11 The total number of times the algorithm relies on a dead process from a subset of k processes is bounded by O(k). Remember that C h denotes the unit by which time-outs operate. In order to determine C h we also need to analyze the running time of CP h in the failure-free case. The running time can be described by the following recurrence: T (h) = 5h + 8T (h? 1) + (1), as the solution for CP h is made up by the \query"-diusion, the incast, two instances of CP h?1 to collect the set E h and then a value-, \commit"-, and \good-bye"-diusion each with two instances of CP h?1. The reccurence solves to T (h) = O(8 h ): Lemma 12 In algorithm CP h, in the failure-free case, each process receives a \good-bye"-message after C8 h rounds for some constant C. Hence, by Lemmas 11 and 12, it is safe to set C h = C8 h. But note that this is a very loose analysis which is expected to be easily tightened. We now return to the sequence of lemmas as used in Section 2. Time-outs are using the value C h obtained by Lemmas 11 and 12 as units and hence we can prove the following lemma, which corresponds to Lemma 3 in essentially the same way (but using inductive hypotheses rather than our induction basis): Lemma 13 There is at most one live level-0 coordinator (root) at any time and it is the lowest live ID. Using the fact that each recruiting decision originates at the current root (done as part of the value- or \query"-diusion) and that the dierence in knowledge of a coordinator on level l and level l + 1 is only with respect to processes up to level l, we can prove the equivalent of Lemma 4: Lemma 14 Every coordinator has consistent view of the diusion-tree at any time and the following invariant is maintained: if j is a coordinator at time t then all processes k with k < j are either coordinators at same or higher level in tree or dead. The Lemmas corresponding to Lemmas 5 and 6 hold with essentially unchanged proofs and the following is an easy consequence of Lemma 12 Lemma 15 Each process receives a \commit"- message after O((f + 1)8 h ) rounds. Finally, we prove the message-complexity. The proof of the following lemma consists of the following steps: First we derive a condition for the degree d such that the number of messages in the failure-free case remains linear: The recurrence describing the number of messages in this case is as follows: M(n) = 5n + 8M(n? d h ) where the numbers 5 and 8 arise for the same reason as in the time-analysis (see above). From this we derive the condition d > 9. Secondly, we need to show that if the tree remains at height h for a next iteration, the number of failures among coordinators on level h? 1 amortizes the cost for sending messages among internal nodes. Note that this simple analysis is possible since (1) by the structure of our protocol failures on smaller levels will be amortized via an instance of a checkpoint on a smaller level and (2) by Lemma 11 a process can inuence (on average) only a constant number of levels. Finally, we show that the messages lost if all internal nodes fail is amortized over the number of simultaneous failures: Lemma 16 The total number of messages sent is O(fn 1=h + n). From the above lemmas we get the our main result (Lemma 1). 8

9 References [ACP94] [AWH92] [B84] [BJ87] T. Anderson, D. Culler, and D. A. Patterson, A case for NOW (Networks of Workstations). IEEE Micro 1994 (special issue). S. Amdur, S. Weber, and V. Hadzilacos, On the Message Complexity of Binary Agreement Under Crash Failures. Distrib. Comput, 5:175 { 186, G. Bracha, unpublished manuscript. Cornell University, July, 1984 K. Birman and T.A. Joseph, Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5(1):47 { 76, [CFGK95] N. Carriero, E. Freeman, D. Gelernter and D. Kaminsky, Adaptive Parallelism with Piranha. IEEE Computer, Jan. 95 Vol. 28 (No. 1). [CM84] J. Chang and N. Maxemchuk, Reliable Broadcast Protocols. ACM Transactions on Computer Systems, 2(3):251 { 273, August [CT90] T.D. Chandra and S. Toueg, Time and Message Ecient Reliable Broadcast. Proc. Int. Workshop on Distributed Algorithms 1990, LNCS Springer-Verlag, [DMY94] R. De Prisco, A. Mayer, and M. Yung, Time-Optimal Message-Ecient Work Performance in the Presence of Faults, Proc. 13th ACM Symp. on Principles of Distributed Computing (1994). [DR85] [DRS90] [DHW92] D. Dolev, and R. Reischuk, Bounds on Information Exchange in Byzantine Agreement. Journal of the ACM, 32(1):191 { 204, D. Dolev, R. Reischuk, and H. R. Strong, Early Stopping in Byzantine Agreement. Journal of the ACM, 37(4):720 { 741, C. Dwork, J. Halpern, and O. Waarts, Performing Work Eciently in the Presence of Faults. Proc. 11th ACM Symp. on Principles of Distributed Computing (1992). [DM90] [EABB94] C. Dwork and Y. Moses, Knowledge and Common Knowledge in a Byzantine Environment: Crash Failures. Information and Computation, { 186 (1990). T. von Eicken, V. Avula, A. Basu, and V. Buch, Low-Latency Communication over ATM Networks using Active Messages. Proc. Hot Interconnects II (1994). [F83] M.J. Fischer, The consensus problem in unreliable distributed systems. Symp. on Math. Founda. of Computer Science, LNCS, Springer-Verlag, 127 { 139, [FLP85] [H83] [H90] [HH93] M.J. Fischer, N. Lynch, and M. Paterson, Impossibility of Distributed Consensus with One Faulty Processor. Journal of the ACM, 32(2):374 { 382, V. Hadzilacos, A lower bound for byzantine agreement with fail-stop processors. Harvard University Technical Report TR V. Hadzilacos, On the Relationships Between the Atomic Commitment and Consensus Problem. Fault Tolerance in Distributed Computing LNCS Springer Verlag, 448, 201 { 208. V. Hadzilacos and J. Halpern, Message- Optimal Protocols for Byzantine Agreement. Math. Systems Theory 26, (1993). [HT93] V. Hadzilacos and S. Toueg, Faulttolerant Broadcast and Related Problems. in Distributed Systems, 2-d ed., editor: S. Mullender, Eddison-Wesley, 97{145, (1993). [KS89] [LF82] [MB+94] P. Kanellakis and A. Shvartsman, Ecient Parallel Algorithms Can Be Made Robust. Distrib. Comput, 5:201 { 219, L. Lamport and M.J. Fischer, Byzantine Generals and Transaction Commit Protocols. SRI Technical Report Op. 62 D. Malki, K. Birman, A. Ricciardi, and A. Schiper, Uniform Actions in Asynchronous Distributed Systems. Proc. 13th ACM Symp. on Principles of Distributed Computing, 274 { 283,

10 The root initiates a complete diusion on the diusion-tree. The value used is (i) the original value if the root is the sender, (ii) the value received last if the root is not the sender and received at least one valuemessage in an earlier iteration, and (iii) a default-value if the root is not the sender and has not received any value-message before. A checkpoint CP h?1 on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 1 of level-(h? 1) processes which are alive but did not receive a message from their ancestor (and hence could not diuse the value to their descendants, which are leaves in the tree). 2. The set 2 of internal processes which are alive. 3. Agreement on the value of the sender among internal processes. The processes in 1 diuse the value agreed on above. A checkpoint on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 3 of internal processes which are alive (denoted E h?1). From 2 and 3 the set of uninformed leaves can be derived. The size of this set constitutes the amount of work left to be done. If no more work then algorithm moves to diuse the \commit"-message. Otherwise, the height of the next diusion tree depends on the amount of work: if the outstanding work is between d h?(p+1) and d h?p (for 1 p h? 1) then the next diusion-tree has height h? p and degree d = O(n 1=h ). The root may need to recruit leaves to become internal nodes. This is done during the diusion of the next iteration. The root uses the freshest knowledge to do that (recruiting is also done along the diusion tree). Recruiting is done such that breadth-rst traversal yields monotonically increasing IDs. Figure 4: BA via tree of height h The root initiates a \query"-diusion. The leaves (after receiving a \query"-message) initiate an incast to collect information about live processes. A junior checkpoint CP h?1 on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 1 of live level-(h? 1) coordinators which did not receive the \query"-broadcast. 2. The set 2 of live coordinators. Coordinators in 1 do another round of \query" and incast (fringe incast). A senior checkpoint CP h?1 on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 3 of live coordinators. From 2 and 3 we can compute the set of leaves which are unheard of. I.e., the remaining work. If there is no more work to do then the algorithm moves on to the diusion part. Otherwise, the height of the tree in the next iteration depends on the 's in the same fashion as for BA. Leaves which become internal nodes are \recruited" by the current root, such that breadth-rst search yields processes with monotonically increasing IDs. The diusion of the value (the set E of live processes) to leaves is the same as in agreement, except for the fact the current root cannot decide on a \default"-value, but has to restart the incast at that point in time. Figure 5: CP via tree of height h 10

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications:

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications: AGREEMENT PROBLEMS (1) AGREEMENT PROBLEMS Agreement problems arise in many practical applications: agreement on whether to commit or abort the results of a distributed atomic action (e.g. database transaction)

More information

Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems

Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems Xianbing Wang, Yong-Meng Teo, and Jiannong Cao Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3, Singapore 117576 Abstract

More information

Early consensus in an asynchronous system with a weak failure detector*

Early consensus in an asynchronous system with a weak failure detector* Distrib. Comput. (1997) 10: 149 157 Early consensus in an asynchronous system with a weak failure detector* André Schiper Ecole Polytechnique Fe dérale, De partement d Informatique, CH-1015 Lausanne, Switzerland

More information

Degradable Agreement in the Presence of. Byzantine Faults. Nitin H. Vaidya. Technical Report #

Degradable Agreement in the Presence of. Byzantine Faults. Nitin H. Vaidya. Technical Report # Degradable Agreement in the Presence of Byzantine Faults Nitin H. Vaidya Technical Report # 92-020 Abstract Consider a system consisting of a sender that wants to send a value to certain receivers. Byzantine

More information

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit Finally the Weakest Failure Detector for Non-Blocking Atomic Commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory EPFL Abstract Recent papers [7, 9] define the weakest failure detector

More information

Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures

Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures Xianbing Wang 1, Yong-Meng Teo 1,2, and Jiannong Cao 3 1 Singapore-MIT Alliance, 2 Department of Computer Science,

More information

CS505: Distributed Systems

CS505: Distributed Systems Cristina Nita-Rotaru CS505: Distributed Systems. Required reading for this topic } Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson for "Impossibility of Distributed with One Faulty Process,

More information

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links Jialin Zhang Tsinghua University zhanggl02@mails.tsinghua.edu.cn Wei Chen Microsoft Research Asia weic@microsoft.com

More information

Distributed Systems Byzantine Agreement

Distributed Systems Byzantine Agreement Distributed Systems Byzantine Agreement He Sun School of Informatics University of Edinburgh Outline Finish EIG algorithm for Byzantine agreement. Number-of-processors lower bound for Byzantine agreement.

More information

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur Agreement Protocols CS60002: Distributed Systems Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur Classification of Faults Based on components that failed Program

More information

Optimal Resilience Asynchronous Approximate Agreement

Optimal Resilience Asynchronous Approximate Agreement Optimal Resilience Asynchronous Approximate Agreement Ittai Abraham, Yonatan Amit, and Danny Dolev School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel {ittaia, mitmit,

More information

Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1.

Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1. Coordination Failures and Consensus If the solution to availability and scalability is to decentralize and replicate functions and data, how do we coordinate the nodes? data consistency update propagation

More information

Distributed Consensus

Distributed Consensus Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit or Abort in distributed transactions Reaching agreement

More information

Abstract. The paper considers the problem of implementing \Virtually. system. Virtually Synchronous Communication was rst introduced

Abstract. The paper considers the problem of implementing \Virtually. system. Virtually Synchronous Communication was rst introduced Primary Partition \Virtually-Synchronous Communication" harder than Consensus? Andre Schiper and Alain Sandoz Departement d'informatique Ecole Polytechnique Federale de Lausanne CH-1015 Lausanne (Switzerland)

More information

Eventually consistent failure detectors

Eventually consistent failure detectors J. Parallel Distrib. Comput. 65 (2005) 361 373 www.elsevier.com/locate/jpdc Eventually consistent failure detectors Mikel Larrea a,, Antonio Fernández b, Sergio Arévalo b a Departamento de Arquitectura

More information

Early-Deciding Consensus is Expensive

Early-Deciding Consensus is Expensive Early-Deciding Consensus is Expensive ABSTRACT Danny Dolev Hebrew University of Jerusalem Edmond Safra Campus 9904 Jerusalem, Israel dolev@cs.huji.ac.il In consensus, the n nodes of a distributed system

More information

Asynchronous Models For Consensus

Asynchronous Models For Consensus Distributed Systems 600.437 Asynchronous Models for Consensus Department of Computer Science The Johns Hopkins University 1 Asynchronous Models For Consensus Lecture 5 Further reading: Distributed Algorithms

More information

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas Slides are partially based on the joint work of Christos Litsas, Aris Pagourtzis,

More information

CS505: Distributed Systems

CS505: Distributed Systems Department of Computer Science CS505: Distributed Systems Lecture 10: Consensus Outline Consensus impossibility result Consensus with S Consensus with Ω Consensus Most famous problem in distributed computing

More information

A Realistic Look At Failure Detectors

A Realistic Look At Failure Detectors A Realistic Look At Failure Detectors C. Delporte-Gallet, H. Fauconnier, R. Guerraoui Laboratoire d Informatique Algorithmique: Fondements et Applications, Université Paris VII - Denis Diderot Distributed

More information

Combining Shared Coin Algorithms

Combining Shared Coin Algorithms Combining Shared Coin Algorithms James Aspnes Hagit Attiya Keren Censor Abstract This paper shows that shared coin algorithms can be combined to optimize several complexity measures, even in the presence

More information

Failure detectors Introduction CHAPTER

Failure detectors Introduction CHAPTER CHAPTER 15 Failure detectors 15.1 Introduction This chapter deals with the design of fault-tolerant distributed systems. It is widely known that the design and verification of fault-tolerent distributed

More information

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra I.B.M Thomas J. Watson Research Center, Hawthorne, New York and Sam Toueg Cornell University, Ithaca, New York We introduce

More information

Consensus. Consensus problems

Consensus. Consensus problems Consensus problems 8 all correct computers controlling a spaceship should decide to proceed with landing, or all of them should decide to abort (after each has proposed one action or the other) 8 in an

More information

Early stopping: the idea. TRB for benign failures. Early Stopping: The Protocol. Termination

Early stopping: the idea. TRB for benign failures. Early Stopping: The Protocol. Termination TRB for benign failures Early stopping: the idea Sender in round : :! send m to all Process p in round! k, # k # f+!! :! if delivered m in round k- and p " sender then 2:!! send m to all 3:!! halt 4:!

More information

The Weighted Byzantine Agreement Problem

The Weighted Byzantine Agreement Problem The Weighted Byzantine Agreement Problem Vijay K. Garg and John Bridgman Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712-1084, USA garg@ece.utexas.edu,

More information

Fault-Tolerant Consensus

Fault-Tolerant Consensus Fault-Tolerant Consensus CS556 - Panagiota Fatourou 1 Assumptions Consensus Denote by f the maximum number of processes that may fail. We call the system f-resilient Description of the Problem Each process

More information

Genuine atomic multicast in asynchronous distributed systems

Genuine atomic multicast in asynchronous distributed systems Theoretical Computer Science 254 (2001) 297 316 www.elsevier.com/locate/tcs Genuine atomic multicast in asynchronous distributed systems Rachid Guerraoui, Andre Schiper Departement d Informatique, Ecole

More information

How to solve consensus in the smallest window of synchrony

How to solve consensus in the smallest window of synchrony How to solve consensus in the smallest window of synchrony Dan Alistarh 1, Seth Gilbert 1, Rachid Guerraoui 1, and Corentin Travers 2 1 EPFL LPD, Bat INR 310, Station 14, 1015 Lausanne, Switzerland 2 Universidad

More information

Concurrent Non-malleable Commitments from any One-way Function

Concurrent Non-malleable Commitments from any One-way Function Concurrent Non-malleable Commitments from any One-way Function Margarita Vald Tel-Aviv University 1 / 67 Outline Non-Malleable Commitments Problem Presentation Overview DDN - First NMC Protocol Concurrent

More information

Byzantine Agreement. Gábor Mészáros. CEU Budapest, Hungary

Byzantine Agreement. Gábor Mészáros. CEU Budapest, Hungary CEU Budapest, Hungary 1453 AD, Byzantium Distibuted Systems Communication System Model Distibuted Systems Communication System Model G = (V, E) simple graph Distibuted Systems Communication System Model

More information

Towards optimal synchronous counting

Towards optimal synchronous counting Towards optimal synchronous counting Christoph Lenzen Joel Rybicki Jukka Suomela MPI for Informatics MPI for Informatics Aalto University Aalto University PODC 5 July 3 Focus on fault-tolerance Fault-tolerant

More information

Approximation of δ-timeliness

Approximation of δ-timeliness Approximation of δ-timeliness Carole Delporte-Gallet 1, Stéphane Devismes 2, and Hugues Fauconnier 1 1 Université Paris Diderot, LIAFA {Carole.Delporte,Hugues.Fauconnier}@liafa.jussieu.fr 2 Université

More information

Upper and Lower Bounds on the Number of Faults. a System Can Withstand Without Repairs. Cambridge, MA 02139

Upper and Lower Bounds on the Number of Faults. a System Can Withstand Without Repairs. Cambridge, MA 02139 Upper and Lower Bounds on the Number of Faults a System Can Withstand Without Repairs Michel Goemans y Nancy Lynch z Isaac Saias x Laboratory for Computer Science Massachusetts Institute of Technology

More information

Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors

Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors Michel RAYNAL, Julien STAINER Institut Universitaire de France IRISA, Université de Rennes, France Message adversaries

More information

Shared Memory vs Message Passing

Shared Memory vs Message Passing Shared Memory vs Message Passing Carole Delporte-Gallet Hugues Fauconnier Rachid Guerraoui Revised: 15 February 2004 Abstract This paper determines the computational strength of the shared memory abstraction

More information

Easy Consensus Algorithms for the Crash-Recovery Model

Easy Consensus Algorithms for the Crash-Recovery Model Reihe Informatik. TR-2008-002 Easy Consensus Algorithms for the Crash-Recovery Model Felix C. Freiling, Christian Lambertz, and Mila Majster-Cederbaum Department of Computer Science, University of Mannheim,

More information

Communication-Efficient Randomized Consensus

Communication-Efficient Randomized Consensus Communication-Efficient Randomized Consensus Dan Alistarh 1, James Aspnes 2, Valerie King 3, and Jared Saia 4 1 Microsoft Research, Cambridge, UK. Email: dan.alistarh@microsoft.com 2 Yale University, Department

More information

Uniform consensus is harder than consensus

Uniform consensus is harder than consensus R Available online at www.sciencedirect.com Journal of Algorithms 51 (2004) 15 37 www.elsevier.com/locate/jalgor Uniform consensus is harder than consensus Bernadette Charron-Bost a, and André Schiper

More information

Reliable Broadcast for Broadcast Busses

Reliable Broadcast for Broadcast Busses Reliable Broadcast for Broadcast Busses Ozalp Babaoglu and Rogerio Drummond. Streets of Byzantium: Network Architectures for Reliable Broadcast. IEEE Transactions on Software Engineering SE- 11(6):546-554,

More information

Failure Detection and Consensus in the Crash-Recovery Model

Failure Detection and Consensus in the Crash-Recovery Model Failure Detection and Consensus in the Crash-Recovery Model Marcos Kawazoe Aguilera Wei Chen Sam Toueg Department of Computer Science Upson Hall, Cornell University Ithaca, NY 14853-7501, USA. aguilera,weichen,sam@cs.cornell.edu

More information

Coin-flipping games immune against linear-sized coalitions (Extended abstract)

Coin-flipping games immune against linear-sized coalitions (Extended abstract) Coin-flipping games immune against linear-sized coalitions (Extended abstract) Abstract Noga Alon IBM Almaden Research Center, San Jose, CA 9510 and Sackler Faculty of Exact Sciences, Tel Aviv University,

More information

Byzantine behavior also includes collusion, i.e., all byzantine nodes are being controlled by the same adversary.

Byzantine behavior also includes collusion, i.e., all byzantine nodes are being controlled by the same adversary. Chapter 17 Byzantine Agreement In order to make flying safer, researchers studied possible failures of various sensors and machines used in airplanes. While trying to model the failures, they were confronted

More information

Byzantine Agreement. Chapter Validity 190 CHAPTER 17. BYZANTINE AGREEMENT

Byzantine Agreement. Chapter Validity 190 CHAPTER 17. BYZANTINE AGREEMENT 190 CHAPTER 17. BYZANTINE AGREEMENT 17.1 Validity Definition 17.3 (Any-Input Validity). The decision value must be the input value of any node. Chapter 17 Byzantine Agreement In order to make flying safer,

More information

Common Knowledge and Consistent Simultaneous Coordination

Common Knowledge and Consistent Simultaneous Coordination Common Knowledge and Consistent Simultaneous Coordination Gil Neiger College of Computing Georgia Institute of Technology Atlanta, Georgia 30332-0280 gil@cc.gatech.edu Mark R. Tuttle DEC Cambridge Research

More information

Byzantine Agreement in Polynomial Expected Time

Byzantine Agreement in Polynomial Expected Time Byzantine Agreement in Polynomial Expected Time [Extended Abstract] Valerie King Dept. of Computer Science, University of Victoria P.O. Box 3055 Victoria, BC, Canada V8W 3P6 val@cs.uvic.ca ABSTRACT In

More information

Faster Agreement via a Spectral Method for Detecting Malicious Behavior

Faster Agreement via a Spectral Method for Detecting Malicious Behavior Faster Agreement via a Spectral Method for Detecting Malicious Behavior Valerie King Jared Saia Abstract We address the problem of Byzantine agreement, to bring processors to agreement on a bit in the

More information

Time Free Self-Stabilizing Local Failure Detection

Time Free Self-Stabilizing Local Failure Detection Research Report 33/2004, TU Wien, Institut für Technische Informatik July 6, 2004 Time Free Self-Stabilizing Local Failure Detection Martin Hutle and Josef Widder Embedded Computing Systems Group 182/2

More information

C 1. Recap: Finger Table. CSE 486/586 Distributed Systems Consensus. One Reason: Impossibility of Consensus. Let s Consider This

C 1. Recap: Finger Table. CSE 486/586 Distributed Systems Consensus. One Reason: Impossibility of Consensus. Let s Consider This Recap: Finger Table Finding a using fingers Distributed Systems onsensus Steve Ko omputer Sciences and Engineering University at Buffalo N102 86 + 2 4 N86 20 + 2 6 N20 2 Let s onsider This

More information

Byzantine Vector Consensus in Complete Graphs

Byzantine Vector Consensus in Complete Graphs Byzantine Vector Consensus in Complete Graphs Nitin H. Vaidya University of Illinois at Urbana-Champaign nhv@illinois.edu Phone: +1 217-265-5414 Vijay K. Garg University of Texas at Austin garg@ece.utexas.edu

More information

Weakening Failure Detectors for k-set Agreement via the Partition Approach

Weakening Failure Detectors for k-set Agreement via the Partition Approach Weakening Failure Detectors for k-set Agreement via the Partition Approach Wei Chen 1, Jialin Zhang 2, Yu Chen 1, Xuezheng Liu 1 1 Microsoft Research Asia {weic, ychen, xueliu}@microsoft.com 2 Center for

More information

Byzantine Agreement in Expected Polynomial Time

Byzantine Agreement in Expected Polynomial Time 0 Byzantine Agreement in Expected Polynomial Time Valerie King, University of Victoria Jared Saia, University of New Mexico We address the problem of Byzantine agreement, to bring processors to agreement

More information

The Weakest Failure Detector to Solve Mutual Exclusion

The Weakest Failure Detector to Solve Mutual Exclusion The Weakest Failure Detector to Solve Mutual Exclusion Vibhor Bhatt Nicholas Christman Prasad Jayanti Dartmouth College, Hanover, NH Dartmouth Computer Science Technical Report TR2008-618 April 17, 2008

More information

On the Resilience and Uniqueness of CPA for Secure Broadcast

On the Resilience and Uniqueness of CPA for Secure Broadcast On the Resilience and Uniqueness of CPA for Secure Broadcast Chris Litsas, Aris Pagourtzis, Giorgos Panagiotakos and Dimitris Sakavalas School of Electrical and Computer Engineering National Technical

More information

Model Checking of Fault-Tolerant Distributed Algorithms

Model Checking of Fault-Tolerant Distributed Algorithms Model Checking of Fault-Tolerant Distributed Algorithms Part I: Fault-Tolerant Distributed Algorithms Annu Gmeiner Igor Konnov Ulrich Schmid Helmut Veith Josef Widder LOVE 2016 @ TU Wien Josef Widder (TU

More information

ROBUST & SPECULATIVE BYZANTINE RANDOMIZED CONSENSUS WITH CONSTANT TIME COMPLEXITY IN NORMAL CONDITIONS

ROBUST & SPECULATIVE BYZANTINE RANDOMIZED CONSENSUS WITH CONSTANT TIME COMPLEXITY IN NORMAL CONDITIONS ROBUST & SPECULATIVE BYZANTINE RANDOMIZED CONSENSUS WITH CONSTANT TIME COMPLEXITY IN NORMAL CONDITIONS Bruno Vavala University of Lisbon, Portugal Carnegie Mellon University, U.S. Nuno Neves University

More information

6.852: Distributed Algorithms Fall, Class 10

6.852: Distributed Algorithms Fall, Class 10 6.852: Distributed Algorithms Fall, 2009 Class 10 Today s plan Simulating synchronous algorithms in asynchronous networks Synchronizers Lower bound for global synchronization Reading: Chapter 16 Next:

More information

Uniform Actions in Asynchronous Distributed Systems. Extended Abstract. asynchronous distributed system that uses a dierent

Uniform Actions in Asynchronous Distributed Systems. Extended Abstract. asynchronous distributed system that uses a dierent Uniform Actions in Asynchronous Distributed Systems Extended Abstract Dalia Malki Ken Birman y Aleta Ricciardi z Andre Schiper x Abstract We develop necessary conditions for the development of asynchronous

More information

Round-by-Round Fault Detectors: Unifying Synchrony and Asynchrony. Eli Gafni. Computer Science Department U.S.A.

Round-by-Round Fault Detectors: Unifying Synchrony and Asynchrony. Eli Gafni. Computer Science Department U.S.A. Round-by-Round Fault Detectors: Unifying Synchrony and Asynchrony (Extended Abstract) Eli Gafni (eli@cs.ucla.edu) Computer Science Department University of California, Los Angeles Los Angeles, CA 90024

More information

Verification of clock synchronization algorithm (Original Welch-Lynch algorithm and adaptation to TTA)

Verification of clock synchronization algorithm (Original Welch-Lynch algorithm and adaptation to TTA) Verification of clock synchronization algorithm (Original Welch-Lynch algorithm and adaptation to TTA) Christian Mueller November 25, 2005 1 Contents 1 Clock synchronization in general 3 1.1 Introduction............................

More information

Optimal and Player-Replaceable Consensus with an Honest Majority Silvio Micali and Vinod Vaikuntanathan

Optimal and Player-Replaceable Consensus with an Honest Majority Silvio Micali and Vinod Vaikuntanathan Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2017-004 March 31, 2017 Optimal and Player-Replaceable Consensus with an Honest Majority Silvio Micali and Vinod Vaikuntanathan

More information

Valency Arguments CHAPTER7

Valency Arguments CHAPTER7 CHAPTER7 Valency Arguments In a valency argument, configurations are classified as either univalent or multivalent. Starting from a univalent configuration, all terminating executions (from some class)

More information

Eventual Leader Election with Weak Assumptions on Initial Knowledge, Communication Reliability, and Synchrony

Eventual Leader Election with Weak Assumptions on Initial Knowledge, Communication Reliability, and Synchrony Eventual Leader Election with Weak Assumptions on Initial Knowledge, Communication Reliability, and Synchrony Antonio FERNÁNDEZ Ernesto JIMÉNEZ Michel RAYNAL LADyR, GSyC, Universidad Rey Juan Carlos, 28933

More information

Crash-resilient Time-free Eventual Leadership

Crash-resilient Time-free Eventual Leadership Crash-resilient Time-free Eventual Leadership Achour MOSTEFAOUI Michel RAYNAL Corentin TRAVERS IRISA, Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex, France {achour raynal travers}@irisa.fr

More information

Section 6 Fault-Tolerant Consensus

Section 6 Fault-Tolerant Consensus Section 6 Fault-Tolerant Consensus CS586 - Panagiota Fatourou 1 Description of the Problem Consensus Each process starts with an individual input from a particular value set V. Processes may fail by crashing.

More information

On Equilibria of Distributed Message-Passing Games

On Equilibria of Distributed Message-Passing Games On Equilibria of Distributed Message-Passing Games Concetta Pilotto and K. Mani Chandy California Institute of Technology, Computer Science Department 1200 E. California Blvd. MC 256-80 Pasadena, US {pilotto,mani}@cs.caltech.edu

More information

On-line Bin-Stretching. Yossi Azar y Oded Regev z. Abstract. We are given a sequence of items that can be packed into m unit size bins.

On-line Bin-Stretching. Yossi Azar y Oded Regev z. Abstract. We are given a sequence of items that can be packed into m unit size bins. On-line Bin-Stretching Yossi Azar y Oded Regev z Abstract We are given a sequence of items that can be packed into m unit size bins. In the classical bin packing problem we x the size of the bins and try

More information

Integrating External and Internal Clock Synchronization. Christof Fetzer and Flaviu Cristian. Department of Computer Science & Engineering

Integrating External and Internal Clock Synchronization. Christof Fetzer and Flaviu Cristian. Department of Computer Science & Engineering Integrating External and Internal Clock Synchronization Christof Fetzer and Flaviu Cristian Department of Computer Science & Engineering University of California, San Diego La Jolla, CA 9093?0114 e-mail:

More information

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems Unreliable Failure Detectors for Reliable Distributed Systems A different approach Augment the asynchronous model with an unreliable failure detector for crash failures Define failure detectors in terms

More information

The Heard-Of Model: Computing in Distributed Systems with Benign Failures

The Heard-Of Model: Computing in Distributed Systems with Benign Failures The Heard-Of Model: Computing in Distributed Systems with Benign Failures Bernadette Charron-Bost Ecole polytechnique, France André Schiper EPFL, Switzerland Abstract Problems in fault-tolerant distributed

More information

Atomic m-register operations

Atomic m-register operations Atomic m-register operations Michael Merritt Gadi Taubenfeld December 15, 1993 Abstract We investigate systems where it is possible to access several shared registers in one atomic step. We characterize

More information

Silence. Guy Goren Viterbi Faculty of Electrical Engineering, Technion

Silence. Guy Goren Viterbi Faculty of Electrical Engineering, Technion Silence Guy Goren Viterbi Faculty of Electrical Engineering, Technion sgoren@campus.technion.ac.il Yoram Moses Viterbi Faculty of Electrical Engineering, Technion moses@ee.technion.ac.il arxiv:1805.07954v1

More information

Generalized Consensus and Paxos

Generalized Consensus and Paxos Generalized Consensus and Paxos Leslie Lamport 3 March 2004 revised 15 March 2005 corrected 28 April 2005 Microsoft Research Technical Report MSR-TR-2005-33 Abstract Theoretician s Abstract Consensus has

More information

Tolerating Permanent and Transient Value Faults

Tolerating Permanent and Transient Value Faults Distributed Computing manuscript No. (will be inserted by the editor) Tolerating Permanent and Transient Value Faults Zarko Milosevic Martin Hutle André Schiper Abstract Transmission faults allow us to

More information

A lower bound for scheduling of unit jobs with immediate decision on parallel machines

A lower bound for scheduling of unit jobs with immediate decision on parallel machines A lower bound for scheduling of unit jobs with immediate decision on parallel machines Tomáš Ebenlendr Jiří Sgall Abstract Consider scheduling of unit jobs with release times and deadlines on m identical

More information

Round Complexity of Authenticated Broadcast with a Dishonest Majority

Round Complexity of Authenticated Broadcast with a Dishonest Majority Round Complexity of Authenticated Broadcast with a Dishonest Majority Juan A. Garay Jonathan Katz Chiu-Yuen Koo Rafail Ostrovsky Abstract Broadcast among n parties in the presence of t n/3 malicious parties

More information

On the weakest failure detector ever

On the weakest failure detector ever On the weakest failure detector ever The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Guerraoui, Rachid

More information

Impossibility of Distributed Consensus with One Faulty Process

Impossibility of Distributed Consensus with One Faulty Process Impossibility of Distributed Consensus with One Faulty Process Journal of the ACM 32(2):374-382, April 1985. MJ Fischer, NA Lynch, MS Peterson. Won the 2002 Dijkstra Award (for influential paper in distributed

More information

Failure detection and consensus in the crash-recovery model

Failure detection and consensus in the crash-recovery model Distrib. Comput. (2000) 13: 99 125 c Springer-Verlag 2000 Failure detection and consensus in the crash-recovery model Marcos Kawazoe Aguilera 1, Wei Chen 2, Sam Toueg 1 1 Department of Computer Science,

More information

Byzantine agreement with homonyms

Byzantine agreement with homonyms Distrib. Comput. (013) 6:31 340 DOI 10.1007/s00446-013-0190-3 Byzantine agreement with homonyms Carole Delporte-Gallet Hugues Fauconnier Rachid Guerraoui Anne-Marie Kermarrec Eric Ruppert Hung Tran-The

More information

Replication predicates for dependent-failure algorithms

Replication predicates for dependent-failure algorithms Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo Department of Computer Science and Engineering University of California, San Diego La Jolla, CA USA {flavio,

More information

A Short Introduction to Failure Detectors for Asynchronous Distributed Systems

A Short Introduction to Failure Detectors for Asynchronous Distributed Systems ACM SIGACT News Distributed Computing Column 17 Sergio Rajsbaum Abstract The Distributed Computing Column covers the theory of systems that are composed of a number of interacting computing elements. These

More information

S1 S2. checkpoint. m m2 m3 m4. checkpoint P checkpoint. P m5 P

S1 S2. checkpoint. m m2 m3 m4. checkpoint P checkpoint. P m5 P On Consistent Checkpointing in Distributed Systems Guohong Cao, Mukesh Singhal Department of Computer and Information Science The Ohio State University Columbus, OH 43201 E-mail: fgcao, singhalg@cis.ohio-state.edu

More information

Towards Optimal Synchronous Counting

Towards Optimal Synchronous Counting Towards Optimal Synchronous Counting Christoph Lenzen clenzen@mpi-inf.mpg.de Max Planck Institute for Informatics Joel Rybicki joel.rybicki@aalto.fi Max Planck Institute for Informatics Helsinki Institute

More information

Distributed Algorithms

Distributed Algorithms Distributed Algorithms December 17, 2008 Gerard Tel Introduction to Distributed Algorithms (2 nd edition) Cambridge University Press, 2000 Set-Up of the Course 13 lectures: Wan Fokkink room U342 email:

More information

Self-stabilizing Byzantine Agreement

Self-stabilizing Byzantine Agreement Self-stabilizing Byzantine Agreement Ariel Daliot School of Engineering and Computer Science The Hebrew University, Jerusalem, Israel adaliot@cs.huji.ac.il Danny Dolev School of Engineering and Computer

More information

Communication Predicates: A High-Level Abstraction for Coping with Transient and Dynamic Faults

Communication Predicates: A High-Level Abstraction for Coping with Transient and Dynamic Faults Communication Predicates: A High-Level Abstraction for Coping with Transient and Dynamic Faults Martin Hutle martin.hutle@epfl.ch André Schiper andre.schiper@epfl.ch École Polytechnique Fédérale de Lausanne

More information

I R I S A P U B L I C A T I O N I N T E R N E THE NOTION OF VETO NUMBER FOR DISTRIBUTED AGREEMENT PROBLEMS

I R I S A P U B L I C A T I O N I N T E R N E THE NOTION OF VETO NUMBER FOR DISTRIBUTED AGREEMENT PROBLEMS I R I P U B L I C A T I O N I N T E R N E N o 1599 S INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTÈMES ALÉATOIRES A THE NOTION OF VETO NUMBER FOR DISTRIBUTED AGREEMENT PROBLEMS ROY FRIEDMAN, ACHOUR MOSTEFAOUI,

More information

Can an Operation Both Update the State and Return a Meaningful Value in the Asynchronous PRAM Model?

Can an Operation Both Update the State and Return a Meaningful Value in the Asynchronous PRAM Model? Can an Operation Both Update the State and Return a Meaningful Value in the Asynchronous PRAM Model? Jaap-Henk Hoepman Department of Computer Science, University of Twente, the Netherlands hoepman@cs.utwente.nl

More information

CS3110 Spring 2017 Lecture 21: Distributed Computing with Functional Processes

CS3110 Spring 2017 Lecture 21: Distributed Computing with Functional Processes CS3110 Spring 2017 Lecture 21: Distributed Computing with Functional Processes Robert Constable Date for Due Date PS6 Out on April 24 May 8 (day of last lecture) 1 Introduction In the next two lectures,

More information

Round-Efficient Perfectly Secure Message Transmission Scheme Against General Adversary

Round-Efficient Perfectly Secure Message Transmission Scheme Against General Adversary Round-Efficient Perfectly Secure Message Transmission Scheme Against General Adversary Kaoru Kurosawa Department of Computer and Information Sciences, Ibaraki University, 4-12-1 Nakanarusawa, Hitachi,

More information

Simultaneous Consensus Tasks: A Tighter Characterization of Set-Consensus

Simultaneous Consensus Tasks: A Tighter Characterization of Set-Consensus Simultaneous Consensus Tasks: A Tighter Characterization of Set-Consensus Yehuda Afek 1, Eli Gafni 2, Sergio Rajsbaum 3, Michel Raynal 4, and Corentin Travers 4 1 Computer Science Department, Tel-Aviv

More information

Asynchronous Leasing

Asynchronous Leasing Asynchronous Leasing Romain Boichat Partha Dutta Rachid Guerraoui Distributed Programming Laboratory Swiss Federal Institute of Technology in Lausanne Abstract Leasing is a very effective way to improve

More information

Byzantine Agreement. Gábor Mészáros. Tatracrypt 2012, July 2 4 Smolenice, Slovakia. CEU Budapest, Hungary

Byzantine Agreement. Gábor Mészáros. Tatracrypt 2012, July 2 4 Smolenice, Slovakia. CEU Budapest, Hungary CEU Budapest, Hungary Tatracrypt 2012, July 2 4 Smolenice, Slovakia Byzantium, 1453 AD. The Final Strike... Communication via Messengers The Byzantine Generals Problem Communication System Model Goal G

More information

Randomized Protocols for Asynchronous Consensus

Randomized Protocols for Asynchronous Consensus Randomized Protocols for Asynchronous Consensus Alessandro Panconesi DSI - La Sapienza via Salaria 113, piano III 00198 Roma, Italy One of the central problems in the Theory of (feasible) Computation is

More information

Protocol for Asynchronous, Reliable, Secure and Efficient Consensus (PARSEC)

Protocol for Asynchronous, Reliable, Secure and Efficient Consensus (PARSEC) Protocol for Asynchronous, Reliable, Secure and Efficient Consensus (PARSEC) Pierre Chevalier, Bart lomiej Kamiński, Fraser Hutchison, Qi Ma, Spandan Sharma June 20, 2018 Abstract In this paper we present

More information

SYNCHRONOUS SET AGREEMENT: A CONCISE GUIDED TOUR (WITH OPEN PROBLEMS)

SYNCHRONOUS SET AGREEMENT: A CONCISE GUIDED TOUR (WITH OPEN PROBLEMS) I R I P U B L I C A T I O N I N T E R N E N o 1791 S INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTÈMES ALÉATOIRES A SYNCHRONOUS SET AGREEMENT: A CONCISE GUIDED TOUR (WITH OPEN PROBLEMS) MICHEL RAYNAL CORENTIN

More information

Clock Synchronization in the Presence of. Omission and Performance Failures, and. Processor Joins. Flaviu Cristian, Houtan Aghili and Ray Strong

Clock Synchronization in the Presence of. Omission and Performance Failures, and. Processor Joins. Flaviu Cristian, Houtan Aghili and Ray Strong Clock Synchronization in the Presence of Omission and Performance Failures, and Processor Joins Flaviu Cristian, Houtan Aghili and Ray Strong IBM Research Almaden Research Center Abstract This paper presents

More information

Benchmarking Model Checkers with Distributed Algorithms. Étienne Coulouma-Dupont

Benchmarking Model Checkers with Distributed Algorithms. Étienne Coulouma-Dupont Benchmarking Model Checkers with Distributed Algorithms Étienne Coulouma-Dupont November 24, 2011 Introduction The Consensus Problem Consensus : application Paxos LastVoting Hypothesis The Algorithm Analysis

More information

Failure Detectors. Seif Haridi. S. Haridi, KTHx ID2203.1x

Failure Detectors. Seif Haridi. S. Haridi, KTHx ID2203.1x Failure Detectors Seif Haridi haridi@kth.se 1 Modeling Timing Assumptions Tedious to model eventual synchrony (partial synchrony) Timing assumptions mostly needed to detect failures Heartbeats, timeouts,

More information