CONSENSUS IN THE CRASH-RECOVERY MODEL

Size: px

Start display at page:

Download "CONSENSUS IN THE CRASH-RECOVERY MODEL"

Horatio Henderson
6 years ago
Views:

1 Rheinisch-Westfälische Technische Hochschule Aachen Department of Computer Science CONSENSUS IN THE CRASH-RECOVERY MODEL DAS CONSENSUS PROBLEM IN VERTEILTEN SYSTEMEN MIT ANHALTE-AUSFÄLLEN UND NEUSTARTS Diploma Thesis of CHRISTIAN LAMBERTZ Thesis advisor: Prof. Dr. Felix Freiling Second examiner: Prof. Dr. Klaus Wehrle

3 iii I hereby declare that I have created this work completely on my own and used no other sources or tools than the ones listed, and that I have marked any citations accordingly. Hiermit versichere ich, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich gemacht habe. Aachen, December 17th, 2007 (Christian Lambertz)

5 v Contents Abstract xix Zusammenfassung xxi Acknowledgements xxiii 1 Introduction Motivation Problem Statement Result Overview Roadmap Conventions Definitions Algorithms Model Environment Failures

6 vi Contents 2.3 Stable Storage Algorithms Communication Fair Loss Links Stubborn Links Quiescence Failure Detection Synchronization The Consensus Protocol Consensus Definition Uniform Consensus Quiescence of Consensus Space-Time Diagrams Summary Related Work Complexity of Consensus Solving Consensus in the Crash-Stop Model Perfect Failure Detector and One Correct Process Eventually Perfect Failure Detector and Correct Majority Weakest Failure Detector for Crash-Stop Consensus Crash-Recovery Consensus

7 Contents vii Oliveira, Guerraoui, and Schiper Hurfin, Mostéfaoui, and Raynal Aguilera, Chen, and Toueg Lamport s Paxos Algorithm Summary New Consensus Algorithms Necessary Process State Assumption in the Absence of Stable Storage Consensus Algorithms with Synchronizers Crash-Stop Model and Synchronizers Crash-Recovery Model and Synchronizers Summary Emulation Technique Idea The CS-consensus Interface Solutions using the Emulation Technique without Stable Storage Perfect Failure Detector and One Always Up Process Eventually Perfect Failure Detector and Always Up Majority Eventually Perfect Failure Detector, Correct Majority, and One Always Up Process Eventually Perfect Failure Detector and More Always Up Than Incorrect Processes

8 viii Contents Summary Solutions using the Emulation Technique with Stable Storage Failure Detector Extension Eventually Perfect Failure Detectors still need a Correct Majority of Processes Eventually Perfect Failure Detector, Correct Majority, and Last Message Storage Weakening of the Process State Assumption Summary Repeated Consensus Definition Adjusting the Emulation Technique Repeated Consensus Algorithm Combining the Stubborn Links Modules Summary Conclusion New Results Further Ideas A Stubborn Links 107 A.1 Asynchronous Timer A.2 Implementation

9 Contents ix A.3 Comparison with Other Definitions B Recurrent Completeness 111 C The More Always Up Than Incorrect Processes Assumption 113 Bibliography 117 Index 119

11 xi List of Algorithms 3.1 Crash-stop consensus algorithm that uses P Crash-stop consensus algorithm that uses P Round based crash-stop consensus algorithm that uses Z Tick based crash-stop consensus algorithm that uses Z Consensus algorithm with Z and at least one always up assumption Emulator with P and at least one always up assumption Emulator with P and always up majority assumption Emulator with P, stable storage, and correct majority assumption Repeated emulator with P, stable storage, and correct majority assumption A.1 Stubborn Links Implementation C.1 Consensus algorithm with P and more always up than incorrect assumption

13 xiii List of Definitions 2.1 Asynchronous System Protocol Failure Pattern Crash-Stop Model Crash-Recovery Model Stable Storage Algorithm Interface Links Fair Loss Links Interface of Fair Loss Links Stubborn Links Interface of Stubborn Links Quiescence Failure Detector Interface of Failure Detectors

14 xiv List of Definitions 2.17 Perfect Failure Detector Eventually Perfect Failure Detector Synchronizer Interface of Synchronizers Consensus Protocol Interface of Consensus Uniform Consensus Protocol Quiescent Consensus Interface of CS-consensus Repeated Consensus Interface of Repeated Consensus Interface of CR-consensus A.1 Asynchronous Timer A.2 Interface of Asynchronous Timers

15 xv List of Figures 1.1 Mobile Phone Example Consensus Solvability Results without Stable Storage Consensus Solvability Results with Stable Storage The Computation Layers of Algorithm Process Speed Assumption Unreliability of Communication Channels Process Failure Classification in the Crash-Stop Model Process Failure Classification in the Crash-Recovery Model Fair Loss Links Usage Stubborn Sending of a Message Meaning of Stubbornness Legend of Space-Time Diagrams The FLP Impossibility Example Run 1 of Algorithm Example Run 2 of Algorithm Example Run of Algorithm

16 xvi List of Figures 3.5 Comparison of Crash-Recovery Solutions Result Overview of Section Example Run of Algorithm Example Run of Algorithm Result Overview of Subsection Example Run of Algorithm The Consensus Interface The CS-consensus Interface Usage of the CS-consensus Interface Result Overview of Subsection Example Run 1 of Algorithm Example Run 2 of Algorithm Example Run 3 of Algorithm Result Overview of Subsection Result Overview of Subsection Correct Majority and One Always Up Process Impossibility Result Overview of Subsection Summary of the Consensus Results without Stable Storage Result Overview of Subsection Eventually Perfect Failure Detector, Stable Storage, and One Always Up Process Impossibility Result Overview of Subsection

17 List of Figures xvii 4.21 All Messages of Algorithm Example Run of Algorithm Result Overview of Subsection One Correct Process Impossibility Usage of the CR-consensus Interface Example Run 1 of Emulation Algorithm Example Run 2 of Emulation Algorithm B.1 Insufficiency of Original Failure Detectors in the Crash- Recovery Model

19 xix Abstract This diploma thesis examines the solvability of the consensus problem in asynchronous distributed systems with a specific failure assumption called the crash-recovery model. Processes can crash and recover, but otherwise they behave benign, and losses in the message exchange are fair. Roughly, to solve consensus, every process in the system proposes a value, and all eventually not any more crashing processes have to decide a common value, which must be one of the proposed ones. The research on the consensus problem under a weaker failure assumption the crash-stop model where processes stop participating after a crash is rather sophisticated, so instead of developing completely new algorithms, research of reusing already known crash-stop algorithms in the crash-recovery model is conducted in this thesis with the help of emulation algorithms. Throughout this research, different assumptions on the amount of faulty processes, the availability of stable storage, and accessibility of means of synchrony such as failure detectors and synchronizers are made.

20 xx Abstract

21 xxi Zusammenfassung Die vorliegende Diplomarbeit untersucht die Lösbarkeit des Consensus Problems in asynchronen verteilten Systemen unter der Annahme eines bestimmten Fehlermodells, welches Crash-Recovery Modell genannt wird. In diesem Modell können Prozesse durch Abstürze ausfallen und später neu starten, aber verhalten sich sonst gutartig und die Nachrichtenverluste im Netzwerk sind gerecht. Um Consensus zu lösen, muss jeder Prozess einen Wert vorschlagen und alle schließlich nicht mehr abstürzenden Prozesse müssen sich auf einen gemeinsamen Wert einigen, der unter den Vorgeschlagenen ist. Die Erforschung des Consensus Problems unter einem schwächeren Fehlermodell, welches Crash-Stop Modell genannt wird und in dem die Prozesse nach einem Ausfall nicht neu starten, ist weit fortgeschritten. Anstatt nun neue Algorithmen für das Crash-Recovery Modell zu entwickeln, wird in dieser Diplomarbeit die Wiederverwendung der bereits bekannten Crash-Stop Modell basierten Algorithmen im Crash-Recovery Modell untersucht. Diese Wiederverwendung wird durch so genannte Emulationsalgorithmen ermöglicht. Während der Untersuchung werden verschiedene Annahmen über die Anzahl fehlerhafter Prozesse, das Vorhandensein eines stabilen Speichers und die Zugänglichkeit von Synchronisationsmitteln wie Fehlerdetektoren und Synchronisierer gemacht.

23 xxiii Acknowledgements I would like to thank all the people who supported me with technical advices and personal encouragements while I was writing my diploma thesis. I thank my advisor Prof. Dr. Felix Freiling for the introduction to the topic of reliable distributed computing and for the time he spent reading and discussing preliminary versions of this work. Furthermore, thanks to Prof. Dr. Klaus Wehrle for evaluating my diploma thesis as second examiner.

25 1 Chapter 1 Introduction Distributed systems can be found everywhere nowadays: from small personal networks up to big computing environments. The most famous and biggest one connects billions of single computers and coined the term Internet. As more and more important computations are executed on large groups of connected machines, reliable communication and agreement among the single systems become essential. Booking reservations, online payment, and banking are only three examples that rely on agreement of the participating parties. Such agreements often take place among distributed databases, which have to either commit or abort transactions that change important data, e.g., the withdrawal of money from bank accounts. In this thesis, the basic abstraction of all these forms of agreement problems is studied, the consensus problem. Before explicating the exact problem, an example of consensus in the world of mobile phones is presented to motivate the problem and to rise the interest in studying it. 1.1 Motivation Consider five with mobile phones equipped persons, who want to agree on a meeting point and time. They communicate via SMS 1 messages only, i.e., they do not call each other and talk directly. Everybody suggests his preferred meeting data in a message and is allowed to send any amount of additional messages in order to achieve agreement on the meeting. Unfortunately, all of them bought the same mobile phone, which was very cheap and sometimes becomes extremely slow, i.e., the input of text can take a long time. Even worse, occasionally a whole phone freezes and a person 1 SMS abbreviates short message service in mobile phone networks.

26 2 1 Introduction must reset it. But this reset does not always help, i.e., a phone can freeze forever, and the owner cannot attempt any communication with the others any longer. Furthermore, their mobile provider is currently working heavily on the network. Thus, the network can be slow and occasionally fails, i.e., it loses a message without notifying the sender or receiver. The intention of the persons is that everybody with a working phone eventually knows the same meeting point and time. A person, whose phone froze forever, is not demanded to arrive; he or she has a good excuse. But actually, all persons want to come to the meeting. Figure 1.1 illustrates the situation, the nodes depict the persons mobile phones, and the lines between the nodes depict the mobile network Phones can be slow. Sometimes they freeze. Resets are a remedy. But do not always help. 4 3 Network can be slow. Messages can get lost. Figure 1.1: Communication situation of the five persons. Interestingly, the persons are unable to solve the problem, i.e., to ensure that everybody with a working mobile phone arrives at the meeting, even if all persons have a working phone. But why are they unable to solve this simple task and which additional assumptions are required to solve it? The research on these assumptions is the topic of this thesis. The mobile phones of the persons are abstracted as processes and the mobile network as communication channels between these processes. The possible failures of the processes are abstracted as failure models, which provide a precise definition of the faulty behavior. 1.2 Problem Statement Processes in a distributed system have to agree on a common value to solve the consensus problem. Therefore, each of the processes provides one input value, called the proposal. The agreed value, called the decision, has to be in the set of the input values. The processes are allowed to send messages over

27 1.3 Result Overview 3 unreliable communication channels in order to reach a decision. Of course, not all messages are lost by the channels, some delivery guarantees, called fair loss links, are provided by the distributed system. Two failure models abstract the faulty behavior of the processes. The first one, called crash-stop model, allows a process to crash only once. After this crash a process is not allowed to take any further step. The second one, called crash-recovery model, allows several crashes with in-between recoveries. A crashed process is allowed to recover, but it loses all pre-crash information that was not stored on a special location, called stable storage, previously. The failure models group the processes into sets of correct processes and incorrect processes, e.g., in the crash-recovery model, the correct processes either never crash (called always up processes) or stop crashing eventually (eventually up), and the incorrect ones either are permanently crashed eventually (eventually down) or oscillate forever between crashes and recoveries (unstable). Consensus is studied in distributed system abstractions that consist of a failure model, an assumption of how many processes are correct and incorrect, and a mean of synchrony, which provides the processes with additional information. Note that without such a mean, consensus is not solvable as already previewed in the mobile phone example. The available means of synchrony in this thesis are failure detectors and synchronizers. A failure detector provides the processes with the ability to suspect any process of belonging to the set of incorrect processes. Thereby, two classes of failure detectors are studied, one called perfect failure detector, in which the detector makes no suspicion mistakes, and one called eventually perfect failure detector, in which the detector is allowed to make finite many suspicion mistakes. A synchronizer provides the processes with synchronization points at which all previously sent messages are delivered. Of course, with the help of these synchronization points, the processes can also detect incorrect processes, e.g., if no more I am alive messages are received from a certain process, this process presumably crashed. The following section provides an overview of the results of this thesis. The results focus on the crash-recovery model only, because the research on the crash-stop model is already rather sophisticated, and the ideas of the crashstop research are reused in the new ideas of this thesis. 1.3 Result Overview The following two figures 1.2 and 1.3 on page 4 and 5 summarize the studied cases of consensus solvability in the crash-recovery model.

28 4 1 Introduction no stable storage eventually perfect failure detector perfect failure detector synchronizer one correct ( n crashes ) impossible impossible impossible correct majority ( n crashes ) impossible impossible see 4.1/p. 52 impossible see 4.1/p. 52 one always up ( n 1 crashes ) impossible emulator see 4.4.1/p. 65 algorithm see 4.2.2/p. 58 correct majority & one always up ( n 1 crashes ) impossible see 4.4.3/p. 74 solvable solvable more always up than incorrect ( nbad crashes ) algorithm see app. C p. 113 solvable solvable always up majority ( n 1 2 crashes) emulator see 4.4.2/p. 71 solvable solvable Figure 1.2: Consensus solvability results in the crash-recovery model without the availability of stable storage. In this thesis, three parameters can be adjusted to study the consensus solvability. These are the availability of stable storage, a process state assumption, and a mean of synchrony, i.e., failure detectors and synchronizers. Since the availability of stable storage is either yes or no, two tables are used to present the results in a clearly arranged manner. The topmost row in the tables determines the used mean of synchrony, and the leftmost column determines the assumed presence of correct processes. The amount of processes that is allowed to crash simultaneously is enclosed in parentheses under the presence assumption. The red arrow that connects some cells depicts a logical implication, i.e., if the source cell of the arrow contains a particular result, the result in the destination cell of the arrow logically follows from the first one.

29 1.3 Result Overview 5 stable storage eventually perfect failure detector perfect failure detector synchronizer one correct ( n crashes ) impossible impossible see 4.5.4/p. 89 impossible see 4.5.4/p. 89 correct majority ( n crashes ) emulator see 4.5.3/p. 82 solvable solvable one always up ( n 1 crashes ) impossible see 4.5.2/p. 80 trivial see w/o stable storage trivial see w/o stable storage correct majority & one always up ( n 1 crashes ) emulator see 4.5.3/p. 82 solvable solvable more always up than incorrect ( nbad crashes ) trivial see w/o stable storage solvable solvable always up majority ( n 1 2 crashes) trivial see w/o stable storage solvable solvable Figure 1.3: Consensus solvability results in the crash-recovery model with the availability of stable storage. The single results can be distinguished in four groups. Impossibility results are simply denoted by impossible, and solutions that follow trivially from another one are denoted by solvable. The cases that are solvable with a special algorithm, which uses another algorithm from the crash-stop model, are denoted by emulator, and direct algorithms are simply denoted by algorithm. These emulation algorithms are the main idea of this thesis, and most of the work is spent on their study. The two figures 1.2 and 1.3 are later used in a condensed form to guide through this research. In all cases, the aim of the research is simplicity over efficiency, i.e., the solutions rather avoid performance than scarifying comprehensibility, although some performance issues are obvious and could easily be added.

30 6 1 Introduction 1.4 Roadmap In chapter 2, the complete computational model and environment are precisely defined. The above mentioned crash-stop and crash-recovery failure models, communication channels, failure detectors, and synchronizers are explained in more detail. Of course, the consensus problem is also defined. The chapter forms the foundation for the whole research in the succeeding chapters. Thus, all important definitions are presented in a regular sentence and a logical notation in order to emphasize their exact meaning. The reason for the impossibility of consensus without the availability of a mean of synchrony is explicated in chapter 3, and several solutions in the crash-stop and crash-recovery model are discussed. Two crash-stop model based algorithms are presented in detail, because they are important for the mentioned emulation idea. The complexity of the described crash-recovery solutions motivates the idea of emulation in the next chapter. In chapter 4, new approaches on the consensus solvability with the idea of emulating the crash-stop consensus solutions in the crash-recovery model are studied. The whole research is oriented at the two figures 1.2 and 1.3, and all cases are discussed step by step, e.g., first the processes are denied to use stable storage and later allowed. The emulation idea is further developed in chapter 5 in order to solve an extension of the consensus problem, called repeated consensus, in which the processes have to solve several possibly simultaneous instances of consensus. Note that chapters 4 and 5 contain the own work of this thesis. The last chapter 6 concludes the thesis and discusses further ideas. Three appendices contain additional information that is taken out of the main part of the thesis. The reason for this can be found in the text when the corresponding appendix is mentioned for the first time. An index at the end of the thesis provides quick access to the properties of the definitions of chapter 2, whose exact meaning becomes important in several proofs. 1.5 Conventions Throughout this thesis, the following conventions are used for the definitions and for the notation of algorithms in pseudo code, which is used as high-level description of algorithms, because it allows to neglect unnecessary details of real programming languages such as special variable declarations and obscure syntaxes.

31 1.5 Conventions Definitions Definitions are presented in a regular explanation and a logical notation. This logical notation is similar to first-order logic, but with the addition of several available domains for the variables. The two available quantifiers bound variables to these domains with the element symbol, i.e., with the following terms: n N and n N respectively. All quantifiers appear first in any logical formula in order to avoid mixing the important part with the quantification of variables. Note that the end of a definition is indicated by more vertical space between the last line of the definition and the next paragraph, but the end also becomes clear from the context. A list of all definitions of this thesis is provided on page xiii after the table of contents Algorithms Algorithms are written in pseudo code similar to the notation used in [Cormen et al., 2001]. The following pseudo code conventions are used: 1. Indentation indicates a block structure in order to increase the readability of the code and to reduce clutter. In real programming languages, this is typically done by bracketing or special begin and end statements; such marks are completely omitted here. 2. The symbol indicates that the remainder of the line is a comment. 3. If a line number is missing in a code listing, the unnumbered line belongs to the previous numbered one. 4. Variables are uncapitalized, e.g., message m. A set of variables is capitalized, e.g., set of messages M. Tuples of variables, e.g., (a, b), are also possible. Assignments are depicted by the symbol, e.g., m 0 and M {0}. 5. The symbol is the default value of any variable. 6. The amount of elements in a set M can be determined by M. The empty set is denoted by with = 0. Sets are duplicate free, i.e., {0, 1, 1} = {0, 1} = Text-strings are written in small capitals, e.g., TEXT. 8. Arrays are denoted by a[], and its elements can be accessed by a[i], with i N. If an element has not been set before, the value is returned. An array can be reset by the statement a[].

32 8 1 Introduction 9. A multiple assignment of the form a b c assigns to both variables a and b the value of expression c. 10. Because this thesis studies distributed systems and therefore message exchanges, a special syntax for messages is used. The content of a message m can be accessed by m = v 1, v 2,..., v i, where v 1, v 2,..., v i are variables, which reference the values of the corresponding slots in message m. 11. The boolean operators and and or are short circuiting, i.e., if the expression a and b is evaluated, a is evaluated first and b is only evaluated if a is true; because if a is false, the whole expression can only be false. If the expression c or d is evaluated, c is evaluated first and d is only evaluated if c is false; because if c is true, the whole expression is true regardless of the evaluation of d. The listing of algorithms is similar to the notation of [Guerraoui and Rodrigues, 2006]. Algorithm 1.1 shows the general structure of algorithms in this thesis. Algorithms always implement protocols, which offer request and indication events. A caller of a protocol, denoted by the term application, can ask for information with request events and the protocol answers with corresponding indication events. Figure 1.4 depicts the layers of algorithm 1.1. Note that the terms algorithm, protocol, and event are all defined precisely in the next chapter. Algorithm 1.1 Example Algorithm Implements: Protocol X (start, stop) Uses: Message Exchange (send, receive) Assumption: at least one correct process. The same algorithm runs on every process p {1,..., n}. 1: upon start do 2: send message EXAMPLE to all 3: upon receive message EXAMPLE do 4: stop The header of the pseudo code listing contains information about the protocol that is implemented by the algorithm, a list of other used protocols and their events, and possible assumptions that must hold in order to satisfy all properties of the implemented protocol.

33 1.5 Conventions 9 The event model of the algorithms works as follows. At every computational step and each time the input changes, the listing is executed top down, i.e., the first upon statement that matches the new input is executed. Thereby, input refers to any new information, e.g., received messages, calls from higher application layers, and notifications from lower protocols. Application (caller) Requests start stop Indications Protocol X Requests send receive Indications Message Exchange Figure 1.4: Illustration of the computation layers of algorithm 1.1. The red arrows illustrate the events occurring in algorithm 1.1 Note that a list of all algorithms of this thesis is provided on page xi and a list of all figures on page xv.

35 11 Chapter 2 Model In this chapter, the computational model is introduced. The model abstracts real distributed systems and failures in order to be able to study the solvability of consensus among certain failure models in the next chapters. All assumptions and properties are defined in a full sentence and by a logical term. This helps to clarify special cases and to emphasize the exact meaning of properties. 2.1 Environment The first definition introduces the basic abstraction of this thesis, the asynchronous system. Henceforth, all definitions rely on this system. Definition 2.1 (Asynchronous System): An asynchronous system is a distributed system, which consists of a set of n processes, denoted by Π = {1,..., n}. This set is static, i.e., no additional processes are present, and all processes are aware of each other. The computation speed of each single process is unknown; while one process takes a step, another process can take any finite number of steps. In the system exists a discrete global clock, T N. Every step in the system corresponds to a tick, t T, of this clock. The processes do not have access to the clock; it just simplifies the representation of the system. Furthermore, all processes are connected to a network so-called links between the processes and communicate by message exchange, i.e., the system provides methods for sending and receiving messages over the network. The channels of the network are unreliable, i.e., messages can get lost and transmission delays are unknown.

36 12 2 Model Figure 2.1 illustrates the process speed assumption in asynchronous systems, which states that no process can take infinite many steps while any other process takes only one. p p } {{ } p 1 takes at most a finite number of steps Figure 2.1: Process speed assumption in asynchronous systems. Figure 2.2 depicts the unreliability of the communication channels in asynchronous systems. A process that waits for a message cannot distinguish between an unknown transmission delay and the loss of the message. m p 1 1 m 2 p 2 Time t: t s t r t Figure 2.2: Illustration of the unreliability of the assumed communication channels in asynchronous systems. The transmission delay t is unknown as in the case of message m 1, and messages can get lost as in the case of message m 2. The problem is the uncertainty, whether the message is lost or just needs more time until its delivery. The next task is the definition of the purpose of an asynchronous system. As mentioned in the introductory chapter, the processes in such a system should solve problems. This is accomplished by algorithms, which describe the action of each process. A precise definition of algorithms follows in section 2.4 of this chapter. The problems that processes in a distributed computing environment have to solve are called protocols in this thesis and defined as follows. Definition 2.2 (Protocol): A protocol is a task that a set of processes has to solve in an asynchronous system. Usually, the protocol defines properties that have to be satisfied in order to solve the task. Two classes of properties exist:

37 2.2 Failures 13 Liveness properties: Something good happens. Liveness properties typically describe anything that has to happen eventually during a computation, e.g., termination of an algorithm. Safety properties: Nothing bad happens. Safety properties typically describe anything that must not happen during a computation, e.g., wrong output of an algorithm. This classification of properties was formalized by Alpern and Schneider [1985]. They also proved that any problem specification, which is called protocol in this thesis, can be written as conjunction of safety and liveness properties. An algorithm implements a protocol, if it solves the task and meanwhile fulfills all properties. If the processes are not caught in a deadlock situation, i.e., the algorithms are well designed in order to avoid deadlocks, and all processes work flawlessly, every basic protocol could be solved. Assuming message delivery guarantees, the processes could just wait long enough until all communication problems are dyed out, i.e., all messages are delivered. Since no failures occur inside the processes, all liveness properties are fulfilled eventually. 2.2 Failures A flawless distributed system is unrealistic. Failures can happen at all times; therefore, the remaining system should react in some way to complete its tasks. In order to understand possible failures, the following definition abstracts a common failure the crash of a process that influences the solvability of protocols heavily as studied later. Definition 2.3 (Failure Pattern): A failure pattern F is a function that determines the set of processes that are not functioning are down at a particular time of the global clock. Thus, F: T 2 Π and F(t) = {p Π p is down at t}. If process p / F(t), p is up at time t. Furthermore, if process p / F(t 1) and p F(t), p crashes at time t denoted by p F c (t). If process p F(t 1) and p / F(t), p recovers at time t denoted by p F r (t). This definition enables the description of failures occurring in a computation. But as mentioned before, if no failure occurs in the asynchronous system, the processes could solve protocols easily. Such algorithms have only to deal

38 14 2 Model with message losses and can wait for all processes to finish their computation, since no processes crashes. But this scenario is unrealistic. Therefore, the following two failure models are introduced, which extend the asynchronous system model with possible failures. The first failure model allows processes to crash once, but thenceforward they have to stop to participate in the computation of the system. Definition 2.4 (Crash-Stop Model): In the crash-stop model processes can fail by crashing and do not recover. According to a failure pattern F, this means if process p F(t c ), i.e., p crashes at time t c, then t T : t t c p F(t). Thus, the processes can be classified in two groups: Correct or good processes that do not crash. These processes are denoted by good(f) = {p Π t T : p / F(t)}. Incorrect or bad processes that crash. They are denoted by bad(f) = {p Π t T : p F(t)}. Figure 2.3 illustrates the possible failure behavior of the processes in the crash-stop model. p 1 up } correct p 2 down } incorrect Time t: t Figure 2.3: Classification of process failure behavior in the crash-stop model. The second failure model allows processes to crash and later recover. Definition 2.5 (Crash-Recovery Model): In the crash-recovery model processes can fail by crashing, but later recover. So once they recovered, they can crash again then recover and so on. Because of this behavior, four groups of processes can be distinguished (according to a failure pattern F): Always up processes that never crash. Eventually up processes that crash and recover finitely often, but eventually remain up. Eventually down processes that crash and recover finitely often, but eventually remain down.

39 2.2 Failures 15 Unstable processes that crash and recover infinitely often. The always up and eventually up processes together are named correct or good processes and denoted by: good(f) = {p Π t T : p / F(t)} {p Π s T t T : t > s p / F(t)} Similarly, the eventually down and unstable processes are named incorrect or bad processes. They are denoted by: bad(f) = {p Π s T t T : t > s p F(t)} {p Π t T t T : t > t ( p F(t) p / F(t ) ) ( p / F(t) p F(t ) ) } Figure 2.4 illustrates the possible failure behavior of the processes in the crash-recovery model. p 1 p 2 always up eventually up }{{} correct p 3 p 4 eventually down unstable }{{} incorrect Time t: t Figure 2.4: Classification of process failure behavior in the crash-recovery model. The two definitions of failure models are similar; they even use the same terminology for sets of processes with the same faulty behavior. Thus, in the following it is always made clear, which model is used to avoid confusion. On the other hand, this similarity helps to differentiate between the models directly. Note that the failure models are very different from the amount of algorithmic complexity. In the latter model the algorithms have to deal with recoveries and, even worse, with unstable processes, i.e., no point in time exists when no more crashes occur, and the system somehow stabilizes.

40 16 2 Model 2.3 Stable Storage If an algorithm runs in the crash-recovery model and if a process crashes and later recovers, this process loses all variables due to the crash, and therefore its whole state. But, some of these variables could be important for the whole run of the algorithm, so processes in the crash-recovery model have the ability to save variables on stable storage, which is preserved during a crash period. Definition 2.6 (Stable Storage): The processes have access to stable storage in order to store values and to retrieve them after a recovery. For this purpose, special save and load functions, which are defined in the upcoming definition 2.8, are provided by the system. Note that the access to stable storage is typically expensive. Thus, usage of storage operations should be minimized, and stored values should be as small as possible. 2.4 Algorithms In this section, the often mentioned term algorithm is defined precisely. This is important in order to understand the behavior of runs of these algorithms in asynchronous systems. Definition 2.7 (Algorithm): An algorithm in a distributed system is a set of n automata, one for each process. These automata proceed in atomic steps, which change the state a representation of the local variables of each process. Every step corresponds to a tick t T of the global clock. According to a failure pattern F, the automaton at a process p takes only a step a so-called normal step at a time t, if p / F(t). Otherwise, p is in a special state at time t, the crash state. In this state, all local variables not on stable storage are lost. At the time t a process p crashes p / F(t 1) p F(t) p takes a socalled crash step at time t 1. If p recovers at time t, it takes a special recovery step first and then can take a normal step again at time t + 1. At one normal step a process can perform two of the following six actions: Send/receive a message to/from the network. Set/get an external output/input.

41 2.4 Algorithms 17 Save/load a set of variables on/from its stable storage. These actions are denoted by the set A = {send, receive, set, get, save, load}. Furthermore, each automaton can also perform any finite number of local operations at every step in addition to the actions that do not affect other processes. Crashes are assumed to happen fairly, i.e., a process can always complete a step. The crash either happens before or after a step, but not inbetween. Of course, this is only valid, if a process does not calculate too long at one step, i.e., although the amount of local steps is arbitrary, a process can only take reasonable many local steps in order to avoid crashing in-between a step. As mentioned before, an algorithm implements a protocol, if it fulfills all its properties. The following definition explicates how an algorithm can achieve this, i.e., how it can access the input of a calling application and output the results of its computation. Definition 2.8 (Interface): An interface of a protocol provides events to processes. The set of all events of a protocol is denoted by E. Furthermore, two types of events are distinguished: Request event: The algorithm has to handle a request of its caller. Indication event: The algorithm informs its caller about the event. If an algorithm implements a protocol, it has to implement all request events and trigger indication events to its caller. Additionally, every protocol provides the following four events: Request init: Used to initialize variables at the beginning of the computation. Request recover: Used to handle recoveries of crashed processes. Indication save variables v 1,..., v i at register LOCATION: Used to save variables v 1,..., v i on stable storage at register LOCATION. Indication load variables v 1,..., v j at register LOCATION: Used to load data into variables v 1,..., v j from stable storage at register LOCATION. Note that if an algorithm does not implement all request events and one of them occurs, nothing bad happens, but the algorithm misses the occurring event. If an algorithm does not use save or load events, the algorithm does not use stable storage at all.

42 18 2 Model 2.5 Communication As already mentioned in definition 2.1 of asynchronous systems, the processes are able to communicate through a message exchange network, which is called links between the processes. The following definition formalizes these links. Definition 2.9 (Links): The processes in an asynchronous system are connected by links and communicate via message exchange, which is unreliable, i.e., messages can get lost and transmission delays are unknown. The set of all messages that are sent through the network is denoted by M. Every message m M is labeled with a unique identifier, which is provided by the system automatically. The sending of a message by a process to another one at a time is denoted by the function S: Π Π T M with: { m, if process p sends message m to process q at time t S(p, q, t) =, else The receiving of a message by a process from another one at a time is denoted by the function R: Π Π T M with: { m, if process q receives message m from p at time t R(q, p, t) =, else Note that the receiving of two different messages at the same time is excluded by this definition, and no process has access to the set M. The basic links provide no delivery guarantees, and message loss fairness is therefore added in the next subsection Fair Loss Links The unreliability of the links makes solutions of protocols impossible. With such links, all messages could get lost and no two-sided communication is possible. In order to solve this problem, some minimal delivery guarantees of the links are assumed. The message losses should be fair, i.e., not all messages should get lost. The following definition explicates such fair loss links, which were introduced by Lynch [1996] as strong loss limitation property of communication channels. Henceforth, these links are assumed to be present in every asynchronous system.

43 2.5 Communication 19 Definition 2.10 (Fair Loss Links): The links of an asynchronous system are fair loss links, if the following properties are satisfied: Property 2.1 (Fair Loss): If a process p infinitely often sends a message m to a good process q, then q receives m an infinite number of times. Formal: m M p, q Π t T t s, t r T : q good(f) S(p, q, t) = m S(p, q, t s ) = m t s > t R(q, p, t r ) = m t r > t Property 2.2 (Finite Duplication): If a process p only finitely often sends a message m to a process q, then q receives m only a finite number of times. Formal: m M p, q Π s T t, t T : t > t > s S(p, q, t) = m R(q, p, t ) = m Property 2.3 (No Creation): If a process q receives a message m from a process p, then m was previously sent to q by p. Formal: m M p, q Π t r T t s T : R(q, p, t r ) = m S(p, q, t s ) = m t s < t r Algorithms can use fair loss links via the following interface. Definition 2.11 (Interface of Fair Loss Links): If an algorithm wants to use fair loss links, it has to implement the following indication event and can use the following request event: Request FL-send message m to process q: Used to send message m to process q. Indication FL-receive message m from process q: Used to receive messages. Figure 2.5 shows how fair loss links can be used in algorithms in oder to guarantee message delivery. The sender has to transmit the message infinitely often. p 1 m m m m m m p 2 t Figure 2.5: Usage of the fair loss property: process p 1 sends message m to process p 2 infinitely often.

44 20 2 Model Stubborn Links Definition 2.10 provides delivery guarantees to the processes. Thus, algorithms have some guarantee that communication is possible. But, if a process sends a message to another one and wants to be sure that the message arrives, it has to send the message an infinite number of times. Since this sureness is needed in many algorithms, Guerraoui et al. [1996] introduced stubborn links. These links build upon fair loss links and ensure reception by infinitely many retransmits. The advantage of stubborn links is the hiding of the retransmission process. Thus, an algorithm does not need to care about the fair loss of the communication network. Figure 2.6 illustrates the technique. The sender issues one stubborn send command, and the link abstraction takes care of the retransmission to avoid message losses. p 1 m p 2 t Figure 2.6: Illustration of stubborn links: p 1 stubbornly sends m to p 2. Originally, Guerraoui et al. [1996] defined the retransmission for the last sent message of each process only. Later Oliveira et al. [1997] introduced k- stubborn links, which deal with the last k messages. Fortunately, the 1-stubborn links are sufficient for the yet to define consensus problem, as shown by the correctness proof of a consensus algorithm in the paper of Guerraoui et al. [1996]. Thus, the following definition, which is used in this thesis, is similar to the original one. Definition 2.12 (Stubborn Links): The links of an asynchronous system are stubborn links, if the following properties are satisfied: Property 2.3 (No Creation), confer page 19. Property 2.4 (Stubbornness): If a process p sends a message m to a good process q, and does not crash, and indefinitely delays sending any further message to q, then q eventually receives m. Formal: m M p, q Π t s, t r T t T : q good(f) S(p, q, t s ) = m t r > t > t s S(p, q, t) = p / F(t) R(q, p, t r ) = m Remark: Note that the requirement of the indefinite delay in property 2.4 does not mean that a process p is not allowed to send any new message m

45 2.5 Communication 21 after the sending of a message m in order to ensure the reception of m. In fact, no certain delay exists for which process p has to wait until it can send m without compromising the reception of m. This incertitude is expressed by property 2.4 in the part t r > t > t s S(p, q, t) = of the formal term, which means that p sends no message (S(p, q, t) = ) until q receives m (t r > t > t s ), where t represents all times between the sending time t s and the receiving time t r. Figure 2.7 illustrates the meaning of the stubbornness property 2.4. The indefiniteness of the delay, that a process has to wait before sending the next message, can cause message loss as in the case of m 2, where process p 1 did not wait long enough before sending m 3. Processes can overcome the indefiniteness, if they acknowledge the reception of messages. This technique is also used in the algorithms of this thesis. p 1 m 1 m 2 m 3 p 2 t Figure 2.7: Illustration of stubbornness: p 1 stubbornly sends m 1, m 2, and m 3 to p 2. Algorithms can use stubborn links via the following interface. Definition 2.13 (Interface of Stubborn Links): If an algorithm wants to use stubborn links, it has to implement the following indication event and can use the following request events: Request send message m to process q: Used to send message m to process q stubbornly. Indication receive message m from process q: Used to receive messages. Request single-send message m to process q: Used to send message m to process q once. Request stop-retransmit: Stops the stubborn sending of any message. As mentioned before, stubborn links can be built upon fair loss links. An algorithm which uses fair loss links and implements stubborn links is shown in appendix A, where also the differences to other stubborn links definitions are explicated. The reason for the last request event, which is called finalsend in the original definition by Guerraoui et al. [1996], is presented in the following subsection.

46 22 2 Model Quiescence Quiescence in the context of distributed systems means that a point in time exists after which all communication is finished. Aguilera et al. [1997] researched on quiescence of failure detectors first, but the idea is suited for all communication problems, because after a problem is solved, no more communication is desired. Definition 2.14 (Quiescence): An algorithm is quiescent, if all processes stop sending messages eventually. If an algorithm uses stubborn links, it should be able to stop the retransmission of the last message. For this task the stopretransmit event of definition 2.13 can be used. Note that quiescence can never be satisfied, if an algorithm that solves an agreement problem runs in the crash-recovery model and unstable processes are present. No point in time exists, after which no more I recovered messages occur, because the freshly recovered processes typically also have to agree on some result, and the algorithm cannot know at the time of the recovery that a process is unstable. Nevertheless, all algorithms of this thesis satisfy quiescence in the absence of unstable processes. If unstable processes are present, the communication is reduced to a minimum to get close to quiescence. 2.6 Failure Detection Any algorithm never terminates, if it waits for the delivery of messages sent by crashed processes. In order to prevent such situations, failure detectors are introduced to ensure the liveness of protocols. The following failure detector definition is valid for both failure models. Definition 2.15 (Failure Detector): A failure detector D is a function D: Π T 2 Π. Each process p Π is equipped with the same failure detector, but the output of these local failure detectors can differ. The failure detector at process p suspects process q of being down at time t, if q D(p, t). Algorithms can use failure detectors via the following interface. Definition 2.16 (Interface of Failure Detectors): If an algorithm wants to use a failure detector, it has to implement the following indication event: Indication suspect set of processes Q: The failure detector suspects all processes q Q.

Easy Consensus Algorithms for the Crash-Recovery Model

Reihe Informatik. TR-2008-002 Easy Consensus Algorithms for the Crash-Recovery Model Felix C. Freiling, Christian Lambertz, and Mila Majster-Cederbaum Department of Computer Science, University of Mannheim,