Resolving Message Complexity of Byzantine. Agreement and Beyond. 1 Introduction

Size: px

Start display at page:

Download "Resolving Message Complexity of Byzantine. Agreement and Beyond. 1 Introduction"

Johnathan Farmer
5 years ago
Views:

1 Resolving Message Complexity of Byzantine Agreement and Beyond Zvi Galil Alain Mayer y Moti Yung z (extended summary) Abstract Byzantine Agreement among processors is a basic primitive in distributed computing. It comes in a number of basic fault models: \Crash", \Omission" and \Malicious" adversarial behaviors. The message complexity of the primitive has been known for the strong failure models of Malicious and Omission adversary since the early 80's, while the question for the more benign Crash failure model has been open. In this paper we show how to solve agreement in the presence of crash failures using O(n) messages which is optimal, thus settling a thirteen year old open problem. Our solution has almost linear time and our new algorithmic techniques have further implications: A family of \early stopping" agreement protocols with improved message-complexity. A new solution to \Checkpoint" yielding a substantial improvement of the protocol for distributed work performance under adaptive parallelism in a network of workstations. Columbia University and Tel-Aviv University. galil@cs.columbia.edu y Dept. of Computer Science, Columbia University, New York, NY 10027, USA. mayer@cs.columbia.edu. Part of this work was done while the author was visiting the IBM T. J. Watson Research Center. Partially supported by NSF grant CCR and CISE Institutional Infrastructure Grant CDA z IBM Research Division, T. J Watson Research Center, Box 704, Yorktown Heights, NY moti@watson.ibm.com 1 Introduction The Byzantine Agreement (BA) problem is concerned with a network of n processes consisting of a distinguished process, the sender, and n? 1 receivers. The sender has an initial value which it wishes to broadcast to the receivers (we assume synchrony and message passing communication). The complication is that some of the processes (possibly including the sender) may be faulty under some fault model (we essentially follow the notations of [HT93]). The BA problem is to design a protocol, i.e., an algorithm for each process, which will ensure the following three conditions in the presence of up to t faulty processes: Termination: Every correct process eventually chooses a decision value. Agreement: No two correct processes choose dierent decision values. Validity: If the sender is correct then all correct processes choose the sender's initial value. This paper is on crash-failures which are de- ned as follows: A faulty process stops prematurely; once it stopped it sends no more messages. Typically t = n? 1 in the case of crash-failures. History: Algorithms for BA under crash failures have been studied as early as 1982 by Lamport and Fischer [LF82] (see also [F83]), who show a solution which requires O(n 3 ) messages. In 1984, Bracha [B84] showed a nonconstructive way (namely, the existence of certain partition) to achieve a message-complexity

2 of O(n p n). In 1992, Dwork, Halpern, and Waarts [DHW92] introduced a constructive solution with sub-quadratic O(n log n) messages, but with exponential running time (see Figure 1 for a historical overview). The message-complexity of the problem has remained open. This is in contrast with stronger failure models, where upper bounds match the lower bounds. The latter were obtained in [DR85] by exploiting the power of the stronger { omission and malicious { adversaries. In the simple case of crash failures only the trivial lower bound of (n) messages has been known. It holds even in the failure-free case. Exact upper and lower bounds for the failure-free scenario were investigated by [AWH92, HH93]). A special variant of BA is early-stopping BA (EBA). It requires the running time to be O(f+1), where f is the actual number of failures in a run. (The time must be at least f + 1, as was shown in [LF82, H83, DRS90, DM90]). Lamport and Fischer [LF82] showed how to solve EBA in O((f + 1)n 2 ) messages. In 1990, Chandra and Toueg [CT90] solved EBA with an improved message complexity of O((f + 1)n). Here we improve the message complexity of EBA. See Figure 2 for a historical overview of EBA. Relevance and applications: The model of crash-failures is assumed in numerous distributed systems (see for example [CM84, BJ87, MB+94]) and variations of (E)BA are used as general building blocks for reliable distributed computing; see for example [HT93] 1. Various distributed operating systems ([HT93] page 131: ISIS, Amoeba, Psynch, Delta-4, Transis) use atomic or FIFO broadcast, while relying, at some layer, on synchrony. Note that in asynchrony this problem is unsolvable [FLP85]. Another important application of our results, work performance in a network of workstations [ACP94, CFGK95, DMY94] is discussed further below. The problem of work performance in the presence of crash faults was rst formulated by Dwork, Halpern and Waarts in [DHW92]. 1 Hadzilacos and Toueg ([HT93]) use the name \Terminating Reliable Broadcast" for BA. 1.1 Our approach and techniques Our basic building block is the Checkpoint problem (CP). De Prisco, Mayer and Yung [DMY94] formalized this problem as follows: Let ST ART be the set of live processes at the beginning of the execution and let EN D be the set of live processes when the protocol terminates. Then correctness of the result-set E is dened as follows: Agreement: E must be the same at each process. Safety: E ST ART. Progress: EN D E. In [DMY94] an early-stopping solution with O((f + 1)n) messages was presented. All previous message-ecient solutions to BA or EBA are based on the rotating coordinator (or phase-king) approach: there is a predened coordinator-process in every round and each message sent either originates at or is destined to the coordinator. To obtain our result we depart from this technique, and develop ideas in the following directions: (i) we employ a \rotating tree", which we call diusion tree. That is, in every phase (collection of a predened number of rounds), there is a predened tree which spans all live processes. In a phase, messages are sent only along the edges of the diusion-tree; (ii) we solve agreement via a reduction to checkpoint (typically, the reduction in the other direction was employed); (iii) we develop a recursive checkpointing approach as follows: In order to achieve BA or CP via a diusiontree of height h and degree d, we use the induction hypothesis of having available a CP for a diusion-tree of height h? 1 and degree d, i.e. for the internal nodes of the original diusion-tree; (iv) our techniques have early-stopping avor, i.e. processes exit the protocol at dierent times and hence we need to introduce a time management subprotocol. This is somewhat counter intuitive as early stopping protocols typically employ ooding of messages. In the previous rotating coordinator paradigm, time was easy to manage, as the round in which a process was required to become a coordinator was essentially determined by his unique ID. In our case, there are repeated calls in dierent recursive levels. Processes need to have consistent views of when to \time-out" and become the new root of the diusion-tree. 2

3 LF82 B84 DHW92 This paper messages O(n 3 ) O(n p n) O(n log n) O(n) time O(n) nonconstr. O(2 n ) O(n 1+ ) Figure 1: Comparison of BA under crash-failures LF82 CT90 This paper messages O((f + 1)n 2 ) O((f + 1)n) O(n + f n ) Figure 2: Comparison of EBA under crash-failures 1.2 Our results on agreement Let a diusion-tree be a minimum height tree spanning all n processors with the sender as its root. The height h and degree d of the diusion tree satisfy h = O(log d n) and d = (n 1=h ). Our main result is the design of a family of BA protocols with the following characteristics: Lemma 1 For any diusion-tree with height h and degree degree d, such that d 9, our algorithm solves BA in the presence of f crash-failures in time O((f + 1)8 h ) and number of messages O(fd + n) This is employed to give the result on agreement with optimal message-complexity. By choosing a large constant d in Lemma 1 we get: Theorem 1 For any > 0, BA can be solved with O(n) messages and time O(n 1+ ). We also improve the known message-complexity of EBA. By setting d = n and h = O(1=) in Lemma 1 we get: Theorem 2 EBA can be solved with running time O((f +1)8 1= ) and with O(n+fn ) messages for any > 0. Our solutions use messages of size O(n). Regarding bit complexity, a simple coding of a \long" message into time-slots (under exponential slowdown in time) shows that, at least theoretically, O(n) bit-complexity is also achievable, which is optimal as well. Another such transformation that breaks the message to O(log n)-size minimessages gives a polynomial time protocol with subquadratic total bits. 1.3 Parallel Work on Networks of Workstations Our main result holds also for performing a checkpoint: Theorem 3 CP can be solved with running time O((f + 1)8 1= ) and with O(n + fn ) messages for any > 0. Performing parallel applications on a network of workstations (NOW) is an area of growing interest, e.g., [ACP94, CFGK95]; especially with the advent of low-latency architectures, see e.g., [EABB94]. In this context a checkpoint is a basic building block for performing m independent work units in parallel on a network of n unreliable workstations (WP). (This models adaptive parallelism when stations are taken away from 3

4 the global parallel application to be used by the local user). This problem was rst introduced in [DHW92]. Let S denotes the parallel processor step, a measure for parallel time for this problem, introduced by Kanellakis and Shvartsman [KS89]. Using Theorem 3 in the framework of [DMY94], we get: level 1 level 0 (sender and coordinator) (coordinators) Theorem 4 WP can be solved in time S = O(m + (f + 1)n) and number of messages M = O(fn + min(f + 1; log n)n). We note that S = O(m+(f +1)n) is optimal for WP [DMY94], thus we improve on their number of messages (M = O((f + 1)n)) without losing time optimality. Organization: The next two sections introduce our algorithms for BA and CP, by solving the special case of height-2 diusion tree, demonstrating some of the algorithmic techniques employed (rotating tree and basic time management). Section 4 then presents the general case with its further techniques and subtleties (involved time management and full usage of the recursive checkpointing). 2 BA via a diusion-tree of height 2 In this section we show some of our ideas by presenting a solution to (E)BA which uses a diusiontree of height 2 and consequently of degree ( p n). We show how to get a solution to (E)BA with running time O(f + 1) and message complexity O(n + f p n). Initially, the diusion-tree is of height 2. The process on level 0 is the sender. Processes on levels 0 and 1 are referred to as coordinators; see Figure 3. The tree is organized such that a breadth- rst search yields processes with monotonically increasing IDs. If all coordinators were nonfaulty, the sender could simply send its value to the coordinators on level 1 which, in turn, could forward the values to the leaves on level 2. In the following we describe how we proceed in the general case, i.e., in the presence of failures. Let L denote the set of level 2 Figure 3: Diusion-Tree of height 2 (leaves) leaves which have not yet received any value. We propose to iterate the following ve phases until L is empty: Phase (1): The sender diuses its value to the coordinators on level 1 which, in turn, forward the value to the leaves on level 2. At this point, each process whose path in the tree to the sender consists of live processes knows the value of the sender. Phase (2): A rst checkpoint among the coordinators, i.e. the level 0 and 1 processes, is executed, using a generalized checkpoint of [DMY94]: During the running of the checkpoint each live process (which is a member of the result-set E 1 ) supplies the value it received in Phase 1 or some defaultvalue if it did not receive any message. Now the checkpoint can produce an additional (besides the set E 1 ) output-value v. The value v is the sender's value if it was received by at least one alive level-1 coordinator in Phase 1 and the default-value otherwise. Phase (3): Those coordinators on level 1 which have not received a message from the sender in the rst phase (in the case of the sender failing) now diuse the value v to their descendants (leaves). Phase (4): A second checkpoint among the coordinators is executed which yields the result set E 2. Knowing E 1 and E 2, each process can now compute which coordinators have failed and hence which leaves have not received any message so far. Let L denote the set of these leaves. Phase (5): If the size of L is larger than p n then the next iteration uses again a diusion-tree of height 2. If L is nonempty, but smaller than p n, then a diusion-tree of height 1 is sucient 4

5 since its degree is bounded by p n. Note that using a diusion-tree of height 1 is essentially equivalent to using the standard rotating coordinator technique of [CT90]. If L is empty then each process knows the value v (which was computed in phase 2) and hence the diusion is complete. This is the basic iteration, however, we need to solve a few crucial issues. First is the issue of \synchronizing" the processors taking part in an eventual sub-protocol, and second is the issue of \maintenance" of the tree-structure and recruiting coordinators/ sender to their \right location" at the \right time". The checkpoint sub-protocol is only eventual, thus processes potentially exit it at dierent times. After the rst checkpoint (which we call junior checkpoint), the nodes on level-1 can diuse the value if they have not done so yet. This fringe diusion (Phase 3) can be executed at dierent times. The start of the checkpoint in phase 4 (senior checkpoint) is triggered by the current sender, and hence independent of the times processes left the previous junior checkpoint. Similarly, the next iteration of the agreement (after phase (5)) starts with a diusion originating at the current root. As is evident from the above discussion, processes need to become active senders if all processes with lower ID have failed. However, the checkpoints used above are only eventual. This raises the issue of how to set these \time-outs" at the end of a checkpoint for the next checkpoint. We approach this issue by having processes computing a \virtual" start-time of the current checkpoint. Basically, if the current sender knows the start-time, it will broadcast it as well and processes will agree on it. If the sender does not know the start-time (because it turned sender as a result of a time-out), then it will set the current time as the virtual start-time. In addition, a process knows its \distance" to the sender according to the previous checkpoint, i.e. the number of live processes having IDs which are in between the sender's ID and the process' ID. So whenever a process exits the current checkpoint, it computes its next time-out with respect to this start-time and the distance to the sender. We diuse three messages: the value-message, the \commit"-message, and nally the \goodbye"-message. After receiving a \commit"- message, a process can use the value. Each diffusion follows the strategy above. The reason for this approach is to avoid costly \wake-up"- messages to a potential sender: Whenever a process times-out it assumes the role of the sender for the coming iterations. If it was a coordinator before, it remembers the current progress to determine which message (value, commit, or goodbye) to diuse and how many leaves are still uninformed. If it is a leaf, then it will decide which value to diuse based on the type of messages it received so far. At the end of an iteration, it is the task of a sender to recruit leaves to \serve" as coordinators to complete the diusion-tree after failures. 2.1 Correctness proof and complexity analysis In the following we outline a proof that the algorithm (1) is a correct protocol for BA and (2) performs within the time (number of rounds) and messages claimed above. In the correctness part, we rst need to show that the current structure of the diusion-tree is well-dened, namely, that all coordinators have the same view of the tree. Once this is accomplished, agreement and validity are easily shown by carefully examining the state (knowledge) of each process after each iteration. The running time analysis follows from the fact that the checkpoint of [DMY94] is early-stopping. The crucial step for the message-analysis is to show that if all coordinators (internal nodes) fail in one iteration (and hence all knowledge about which leaves have been informed is lost), the number of messages lost is still small (i.e. O( p n)) when amortized per failure. In the following let an epoch denote the maximum over: (1) the time of a failure-free execution of a checkpoint, and (2) the time a failure can extend a checkpoint, and (3) a diusion. Obviously, an epoch consists of a constant (independent of n) number of rounds. We start by showing that a failure of an arbitrary process at an arbitrary point in time of the algorithm can delay the al- 5

6 gorithm by at most a constant number of epochs (denoted by C h ). The reason for this is that a failure of a coordinator must show up in the result of the next checkpoint and the proof is essentially done by a case analysis. Lemma 2 Any failure can delay the algorithm for at most 8 epochs. From the above lemma follows that we should let C h = 8 epochs. Next we show that there is at most one root at any point in time. The proof involves showing that if the current root has failed at most one process is timing out, where time-outs are in units of C h, the value obtained in Lemma 2. Lemma 3 There is at most one live level-0 coordinator (root) at any time and it is the lowest live ID. Now we can show that the diusion-tree is always well-dened, i.e., each coordinator has the same view of it. By Lemma 3 there is always a well-dened root and hence the checkpoints run correctly. This gives an inductive proof of the following lemma: Lemma 4 Every coordinator has consistent view of the diusion-tree at any time and the following invariant is maintained: if j is a coordinator at time t then all processes k with k < j are either coordinators at the same or higher level in the tree or dead. Given a consistent view of the diusion-tree, the agreement-, validity-, and early-stopping properties are proven by induction over the iterations: Lemma 5 The algorithm achieves agreement. Lemma 6 The algorithm achieve validity. Lemma 7 Each process receives a \commit"- message after O(f + 1) rounds. Finally, we analyze the message-performance. The main focus is on showing that if at some point in time all O( p n) coordinators fail, the number of messages which need to be resend is bounded by O( p n) amortized per failure: Lemma 8 The total number of messages sent is O(fn 1=2 + n). 3 CP via a diusion-tree of height 2 In this section we show how to extend the ideas of the previous section to obtain a protocol for CP with the same asymptotic performance. We start with the same diusion-tree. In the rst phase, the root diuses a \query"-message along the tree. This phase is mainly used to obtain a synchronization-eect. In the second phase, an incast, starting at the leaves, is performed along the tree. The purpose of the incast is to collect information about the set of live processes. Then, in the third phase, a rst checkpoint among the internal processes is performed. After this checkpoint all internal nodes agree on the set of live internal nodes. Those live internal nodes which did not receive the \query"-message and hence did not participate in the incast can send now a \query"- message to their descendants (leaves) and the leaves answer by the incast. In the fourth phase another checkpoint among the internal nodes is executed. Again, this checkpoint yields the set of live internal nodes. From this, (1) the set of leaves unheard of and (2) the set of leaves which have live internal nodes as ancestors and hence whose status (live/dead), are determined. Based on the size of set (1), the next iteration will either use again a diusion-tree of height 2 or a tree of height 1. The latter is used whenever the number of unheard leaves is at most p n. If there are no more unheard leaves left, all internal nodes agree on the set of live nodes (denoted E in the introduction) and hence the algorithm can proceed to the diusion-phase which is essentially the same as an agreement-protocol. The only dierence is that after the original value of E is lost (through failures), the next root cannot simply use a defaultvalue, but needs to restart an incast-iteration to obtain a new set E. All techniques and remarks of the last section apply here too. 4 BA and CP via a diusiontree of height h In this section we show how to use a diusion-tree of arbitrary height h (1 h log n) and conse- 6

7 quently degree (n 1=h ) for either BA or CP. We use a recursive approach in the sense that we assume having available BA- and CP-protocols for diusion-trees of heights h?p (1 p < h). We denote a checkpoint among nodes on levels 0; 1; : : :; l by CP l Let us rst consider BA. The algorithm again consists of phases. In the rst phase the root of the diusion tree diuses its value along the tree. In the second phase, a rst checkpoint CP h?1 among the internal processes (using the induction hypothesis) is executed. After this checkpoint internal nodes on level h? 1 which might not have received a message from the root in the rst phase (in case of a node failing on the path from the root to this node) can now diuse this value to their descendants (which are leaves of the diusion tree). After that, the third phase, a second internal checkpoint, CP h?1 is executed. As a result of this second checkpoint the set of leaves which are yet to be informed (the outstanding work) can be computed by each internal process. Based on the size of this set, the height of the tree in the next iteration is determined: if the outstanding work is between d h?(p+1) and d h?p (for 0 p h? 1) then the next diusion-tree has height h? p and the algorithm continues to use an BA-protocol for that tree. The checkpoint CP h?1 is given by the induction hypothesis. As in the h=2 case, this protocol is only eventual. The same issues about managing time arise as previously. The recruiting at the end of an iteration is done as follows: The decision on the height of the next stage is a result of CP h?1. However the information (i.e., the set E h?1) provided by CP h?1 might be outdated for the actual recruiting of new coordinators. The reason for this is that as part of CP h?1 other recursive instances CP h?p for 2 p h? 1 are also active and their recruiting might not be reected in E h?1. But note that the recruiting can be done as part of the rst diusion of the next iteration. This diusion always starts at the current root and hence can be made consistent among all levels. This holds since the root at which the diusion originates has also been part of each recursively called checkpoint CP h?p and hence holds the freshest knowledge; i.e. can take into account all previous recruitings by those checkpoints. In Figure 4 we consider BA and in Figure 5 CP, both using a diusion-tree of height h. 4.1 Correctness proof and analysis The following subtleties are new to the proof compared to the previous sections: (i) A process which fails on level l can potentially be recruited by an instance of CP l?2, then by CP l?3, CP l?4, and so forth. However, a careful analysis will reveal that the total number of times the algorithm recruits a dead process out of a set of k process is O(k). This is important for the management of time-outs and the message-analysis. (ii) We need an analysis of the recursive structure to give a condition on d which ensures that the message performance is O(n) for a failure-free execution, regardless of the actual value of h. We will show that d > 9 is sucient. (iii) The running time is obviously a function of h and needs to be analyzed. In the following we sketch the correctness proof and the analysis for CP (which essentially implies BA). We rst show that the total number of times the algorithm recruits a dead process out of a set of n processes is O(n). Hence, on average, a process can \negatively inuence" the algorithm only a constant number of times. A process negatively inuences the algorithm by not sending the messages it is supposed to send (since it is dead). We now consider an adversary which wishes to maximize the total number of times dead processes negatively inuences the algorithm. Let A d (n; t) denote the total number of times our adversary succeeds in having up to t dead processes negatively inuence the algorithm running on a tree of degree d with n processes (and hence height O(log d n)). We show now that A d (n; n) = O(n). First we show that a collection of smaller trees does not allow as good a strategy as one big tree T with the same total number of nodes: Lemma 9 A d (n; n) > P i A d(c i n; c i n), where 0 < c i 1, P i c i = 1 and 9i : c i < 1=d Corollary 1 A d (n; n) P i A d(c i n; c i n), where 0 < c i 1, P i c i = 1 7

8 The next lemma shows that if the adversary can only fail a fraction of the processes, it might just as well fail the corresponding top of the tree: Lemma 10 A d (n; n) = A d (n; n), where 0 < 1 We are now ready to prove our intermediate goal. The way we achieve it is as follows: We dene a greedy strategy recursively as applying it to the n=d internal nodes and then recruiting another n=d nodes from the leaves. This process can be repeated d times and can be proven to yield O(n) \negative inuences". We then show that no strategy can do better than the greedy, using the last two lemmas. Lemma 11 The total number of times the algorithm relies on a dead process from a subset of k processes is bounded by O(k). Remember that C h denotes the unit by which time-outs operate. In order to determine C h we also need to analyze the running time of CP h in the failure-free case. The running time can be described by the following recurrence: T (h) = 5h + 8T (h? 1) + (1), as the solution for CP h is made up by the \query"-diusion, the incast, two instances of CP h?1 to collect the set E h and then a value-, \commit"-, and \good-bye"-diusion each with two instances of CP h?1. The reccurence solves to T (h) = O(8 h ): Lemma 12 In algorithm CP h, in the failure-free case, each process receives a \good-bye"-message after C8 h rounds for some constant C. Hence, by Lemmas 11 and 12, it is safe to set C h = C8 h. But note that this is a very loose analysis which is expected to be easily tightened. We now return to the sequence of lemmas as used in Section 2. Time-outs are using the value C h obtained by Lemmas 11 and 12 as units and hence we can prove the following lemma, which corresponds to Lemma 3 in essentially the same way (but using inductive hypotheses rather than our induction basis): Lemma 13 There is at most one live level-0 coordinator (root) at any time and it is the lowest live ID. Using the fact that each recruiting decision originates at the current root (done as part of the value- or \query"-diusion) and that the dierence in knowledge of a coordinator on level l and level l + 1 is only with respect to processes up to level l, we can prove the equivalent of Lemma 4: Lemma 14 Every coordinator has consistent view of the diusion-tree at any time and the following invariant is maintained: if j is a coordinator at time t then all processes k with k < j are either coordinators at same or higher level in tree or dead. The Lemmas corresponding to Lemmas 5 and 6 hold with essentially unchanged proofs and the following is an easy consequence of Lemma 12 Lemma 15 Each process receives a \commit"- message after O((f + 1)8 h ) rounds. Finally, we prove the message-complexity. The proof of the following lemma consists of the following steps: First we derive a condition for the degree d such that the number of messages in the failure-free case remains linear: The recurrence describing the number of messages in this case is as follows: M(n) = 5n + 8M(n? d h ) where the numbers 5 and 8 arise for the same reason as in the time-analysis (see above). From this we derive the condition d > 9. Secondly, we need to show that if the tree remains at height h for a next iteration, the number of failures among coordinators on level h? 1 amortizes the cost for sending messages among internal nodes. Note that this simple analysis is possible since (1) by the structure of our protocol failures on smaller levels will be amortized via an instance of a checkpoint on a smaller level and (2) by Lemma 11 a process can inuence (on average) only a constant number of levels. Finally, we show that the messages lost if all internal nodes fail is amortized over the number of simultaneous failures: Lemma 16 The total number of messages sent is O(fn 1=h + n). From the above lemmas we get the our main result (Lemma 1). 8

9 References [ACP94] [AWH92] [B84] [BJ87] T. Anderson, D. Culler, and D. A. Patterson, A case for NOW (Networks of Workstations). IEEE Micro 1994 (special issue). S. Amdur, S. Weber, and V. Hadzilacos, On the Message Complexity of Binary Agreement Under Crash Failures. Distrib. Comput, 5:175 { 186, G. Bracha, unpublished manuscript. Cornell University, July, 1984 K. Birman and T.A. Joseph, Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5(1):47 { 76, [CFGK95] N. Carriero, E. Freeman, D. Gelernter and D. Kaminsky, Adaptive Parallelism with Piranha. IEEE Computer, Jan. 95 Vol. 28 (No. 1). [CM84] J. Chang and N. Maxemchuk, Reliable Broadcast Protocols. ACM Transactions on Computer Systems, 2(3):251 { 273, August [CT90] T.D. Chandra and S. Toueg, Time and Message Ecient Reliable Broadcast. Proc. Int. Workshop on Distributed Algorithms 1990, LNCS Springer-Verlag, [DMY94] R. De Prisco, A. Mayer, and M. Yung, Time-Optimal Message-Ecient Work Performance in the Presence of Faults, Proc. 13th ACM Symp. on Principles of Distributed Computing (1994). [DR85] [DRS90] [DHW92] D. Dolev, and R. Reischuk, Bounds on Information Exchange in Byzantine Agreement. Journal of the ACM, 32(1):191 { 204, D. Dolev, R. Reischuk, and H. R. Strong, Early Stopping in Byzantine Agreement. Journal of the ACM, 37(4):720 { 741, C. Dwork, J. Halpern, and O. Waarts, Performing Work Eciently in the Presence of Faults. Proc. 11th ACM Symp. on Principles of Distributed Computing (1992). [DM90] [EABB94] C. Dwork and Y. Moses, Knowledge and Common Knowledge in a Byzantine Environment: Crash Failures. Information and Computation, { 186 (1990). T. von Eicken, V. Avula, A. Basu, and V. Buch, Low-Latency Communication over ATM Networks using Active Messages. Proc. Hot Interconnects II (1994). [F83] M.J. Fischer, The consensus problem in unreliable distributed systems. Symp. on Math. Founda. of Computer Science, LNCS, Springer-Verlag, 127 { 139, [FLP85] [H83] [H90] [HH93] M.J. Fischer, N. Lynch, and M. Paterson, Impossibility of Distributed Consensus with One Faulty Processor. Journal of the ACM, 32(2):374 { 382, V. Hadzilacos, A lower bound for byzantine agreement with fail-stop processors. Harvard University Technical Report TR V. Hadzilacos, On the Relationships Between the Atomic Commitment and Consensus Problem. Fault Tolerance in Distributed Computing LNCS Springer Verlag, 448, 201 { 208. V. Hadzilacos and J. Halpern, Message- Optimal Protocols for Byzantine Agreement. Math. Systems Theory 26, (1993). [HT93] V. Hadzilacos and S. Toueg, Faulttolerant Broadcast and Related Problems. in Distributed Systems, 2-d ed., editor: S. Mullender, Eddison-Wesley, 97{145, (1993). [KS89] [LF82] [MB+94] P. Kanellakis and A. Shvartsman, Ecient Parallel Algorithms Can Be Made Robust. Distrib. Comput, 5:201 { 219, L. Lamport and M.J. Fischer, Byzantine Generals and Transaction Commit Protocols. SRI Technical Report Op. 62 D. Malki, K. Birman, A. Ricciardi, and A. Schiper, Uniform Actions in Asynchronous Distributed Systems. Proc. 13th ACM Symp. on Principles of Distributed Computing, 274 { 283,

10 The root initiates a complete diusion on the diusion-tree. The value used is (i) the original value if the root is the sender, (ii) the value received last if the root is not the sender and received at least one valuemessage in an earlier iteration, and (iii) a default-value if the root is not the sender and has not received any value-message before. A checkpoint CP h?1 on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 1 of level-(h? 1) processes which are alive but did not receive a message from their ancestor (and hence could not diuse the value to their descendants, which are leaves in the tree). 2. The set 2 of internal processes which are alive. 3. Agreement on the value of the sender among internal processes. The processes in 1 diuse the value agreed on above. A checkpoint on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 3 of internal processes which are alive (denoted E h?1). From 2 and 3 the set of uninformed leaves can be derived. The size of this set constitutes the amount of work left to be done. If no more work then algorithm moves to diuse the \commit"-message. Otherwise, the height of the next diusion tree depends on the amount of work: if the outstanding work is between d h?(p+1) and d h?p (for 1 p h? 1) then the next diusion-tree has height h? p and degree d = O(n 1=h ). The root may need to recruit leaves to become internal nodes. This is done during the diusion of the next iteration. The root uses the freshest knowledge to do that (recruiting is also done along the diusion tree). Recruiting is done such that breadth-rst traversal yields monotonically increasing IDs. Figure 4: BA via tree of height h The root initiates a \query"-diusion. The leaves (after receiving a \query"-message) initiate an incast to collect information about live processes. A junior checkpoint CP h?1 on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 1 of live level-(h? 1) coordinators which did not receive the \query"-broadcast. 2. The set 2 of live coordinators. Coordinators in 1 do another round of \query" and incast (fringe incast). A senior checkpoint CP h?1 on the O(n h?1=h ) internal nodes in the diusion-tree is run. This checkpoint yields the following information: 1. The set 3 of live coordinators. From 2 and 3 we can compute the set of leaves which are unheard of. I.e., the remaining work. If there is no more work to do then the algorithm moves on to the diusion part. Otherwise, the height of the tree in the next iteration depends on the 's in the same fashion as for BA. Leaves which become internal nodes are \recruited" by the current root, such that breadth-rst search yields processes with monotonically increasing IDs. The diusion of the value (the set E of live processes) to leaves is the same as in agreement, except for the fact the current root cannot decide on a \default"-value, but has to restart the incast at that point in time. Figure 5: CP via tree of height h 10

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications:

AGREEMENT PROBLEMS (1) AGREEMENT PROBLEMS Agreement problems arise in many practical applications: agreement on whether to commit or abort the results of a distributed atomic action (e.g. database transaction)