Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures

Size: px

Start display at page:

Download "Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures"

Percival Simon
5 years ago
Views:

1 Research Report 56/2007. Technische Universität Wien, Institut für Technische Informatik (Revised Version of Technical Report 183/1-126, Department of Automation, Technische Universität Wien, Jan. 2003) Josef Widder Ulrich Schmid Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures Originally published in Distributed Computing. c Springer Verlag Abstract This paper provides description and analysis of a new clock synchronization algorithm for synchronous and partially synchronous systems with unknown upper and lower bounds on delays. It is purely message-driven, timer-free and relies on a hybrid failure model incorporating both process and link failures, in both time and value domain. Unlike existing solutions, our algorithm works during both system start-up and normal operation: Whereas bounded precision (the mutual deviation of any two clocks) can always be guaranteed, accuracy (clocks being within a linear envelope of real-time) and hence progress is only ensured when sufficiently many correct processes are eventually up and running. By means of a detailed analysis, we provide formulas for resilience, precision and envelope bounds. Keywords Fault-Tolerant Distributed Algorithms Initial Clock Synchronization System Start-up Hybrid Failure Models Link Failures Partially Synchronous Systems 1 Introduction Clock synchronization is an important service in distributed systems [29]. It assumes that every process p owns a discrete clock C p (t), which is periodically adjusted by a faulttolerant clock synchronization algorithm. We will focus on deterministic clock synchronization here, which must guarantee the following properties: (P) Precision: The simultaneous reading of any two correct clocks may deviate by at most some D max. This work has been supported by the Austrian START programme Y41-MAT, the BM:vit FIT-IT Embedded Systems project DCBA (proj.no ), and the FWF project Theta (proj.no. P17757-N04). Josef Widder Ulrich Schmid Technische Universität Wien, Embedded Computing Systems Group (E182/2), Treitlstraße 3, 1040 Vienna, Austria, EU, {widder,s}@ecs.tuwien.ac.at Josef Widder Laboratoire d Informatique LIX, Ecole Polytechnique, Palaiseau Cedex, France, EU (A) Accuracy: Any correct clock remains within a linear envelope of real-time. Many different clock synchronization algorithms have been proposed in literature [49,35,38,1]. Most of these assume systems with known bounds on the durations of computing steps and communication delays, and cannot handle system start-up. In real systems, however, each process may start independently at some unpredictable time. Thus, processes may not have completed booting when some earlier process starts sending messages. During start-up, even messages from correct processes may hence be lost, and failure assumptions like the one that less than a third of the processes are Byzantine faulty (which is necessary [13] for achieving (P) and (A) in the presence of Byzantine failures) do not hold. Moreover, many real networks (like the Internet) cannot be modeled properly as synchronous systems [53,11,22]. To implement clock synchronization 1 in such systems, a timer free start-up mechanism is required. In [55], we provided a solution to this problem that unlike naïve startup algorithms avoids increasing the required number of processes and/or adding a priori timing assumptions to the system model: By modifying the clock synchronization algorithm by Srikanth and Toueg [51], which is based on the well-known consistent broadcasting primitive, we derived a simple and efficient clock synchronization algorithm that requires just n 3 f + 1 processes for coping with up to f Byzantine faulty processes and works both during normal operation and system startup: Whereas some precision D max is guaranteed during the whole system lifetime, progress of the clocks, i.e., accuracy, is only guaranteed when sufficiently many correct processes are eventually up and running. In this paper, we will present and analyze a variant of the algorithm of [55] under a powerful hybrid failure model. Since less severe failures can be handled with fewer processes than more severe ones, a hybrid failure model leads to 1 Although clock synchronization is traditionally studied in synchronous systems with hardware clocks, it is a useful service in partially synchronous systems with software clocks (counters) as well; see Sect. 5 for details.

2 2 Josef Widder, Ulrich Schmid a decreased system size in the presence of realistic failures [54, 3]: Given the maximum number of failures of certain types, the required number of processes is smaller than that of a solution where all even benign failures are treated as Byzantine. As demonstrated by Powell [34], less processors in the system that could fail may actually increase the overall system s dependability. Apart from the refined treatment of process failures, our perception-based analysis also shows that the algorithm tolerates a large number of communication failures, i.e., moving link failures. It can therefore be applied even in typical wireless settings, where link failure rates up to 10 2 are common [47,41]. Designed for synchronous or partially synchronous systems with unknown lower and upper bounds on delays, our round-based algorithm is completely message-driven, does not employ timers, and can hence be run on systems with different timing characteristics without recompilation. In fact, it is only the achieved precision and the envelope bounds but not the algorithm itself that depend on the underlying system s timing behavior: If the lower and upper bound τ, τ + on delays used for computing D max hold during an execution, the algorithm s actual precision in this execution will not exceed D max. Note that, in contrast to classic clock synchronization algorithms, D max does not depend on the delay uncertainty ε = τ + τ but rather on the delay ratio Θ = τ + /τ. Moreover, it is elastic with respect to timing assumptions: If the assumed bounds are (temporarily) violated, the actual precision might (but need not necessarily) exceed D max ; it eventually returns to within D max, however, when the assumed bounds hold again. The remainder of this paper is organized as follows: After a survey of existing work in Sect. 2, we provide a glossary of all our notation in Sect. 3 for further referencing. In Sect. 4, we introduce our system and failure model. It is a refinement of the generic perception-based failure model of [42] extended to the system startup phase. In Sect. 5, we present our clock synchronization algorithm and its operation principles. Sect. 6 and Sect. 7 contain the analysis of our algorithm in early and degraded mode, respectively, when not sufficiently many processes have completed booting. In Sect. 8, we investigate our algorithm s performance in normal mode, when sufficiently many processes are eventually up and running. A discussion of certain impacts of our synchrony and failure assumptions in Sect. 9 and a summary of our accomplishments in Sect. 10 round off the paper. Although clock synchronization in synchronous systems [1, 13,49,35,48,38,32], as well as in partially synchronous systems [15, 33], is a very well-researched subject, there are only a few papers [51,31,26,32,52] that deal with initial synchronization. Rather than considering a full system startup, however, most of those papers are devoted to integrating a new process into an already running system. The only exceptions known to us are [52,10], which deal with solutions to the startup problem in very specific TDMA system architectures, and [26], which considers an approach for initialization in the MAFT architecture under stronger system assumptions. None of these solutions is messagedriven and time(r)-free and works in partially synchronous systems. This is also true for the self-stabilizing Byzantine clock synchronization algorithms of [14, 12]. Although selfstabilization obviously solves the booting problem as well, it is an overkill here in that booting nodes do not start from an arbitrary internal state. Consequently, in sharp contrast to our solution, none of the algorithms of [14,12] can provide constant booting (i.e., stabilization) time. Clock synchronization in the presence of link failures has been studied in [36] and in some of our previous work [39, 45, 44, 40, 27]. None of these papers considered initial startup and, to the exception of [40,27], all those papers consider synchronous systems only. In [27], we introduced a time(r)-free implementation of the perfect failure detector in partially synchronous systems based on consistent broadcasting. In [23], it was shown how such an algorithm can be implemented in an architecture based on broadcast bus networks. In both [27,23], however, we assumed that all processes are up right from the start. This assumption is relaxed in the implementation of the eventually perfect failure detector in [57], which employs the clock synchronization algorithm of [55] to cope with the booting problem. 3 Notational Conventions Throughout the paper, we use the following notation: Processors are denoted by lowercase letters like p and q; round numbers by lowercase letters like k and l. Process names are upper case letters but usually represented by their processors. Processor subscripts denote the process where a quantity like Vq p,k and tq p,k is locally available, processor superscripts denote the remote source of a quantity. Calligraphic variables like Vp r denote sets or vectors, bold variables like τ denote intervals. Real-time values and variables are denoted by lower case letters like t,t, logical time values are just round numbers k,k. An overview of the most important terms and variables, including a reference to their definition, is provided in Table 1. 2 Related Work 4 System Model In this section, we will define our system model. Starting from the basic execution and timing model in Sect. 4.1 and Sect. 4.2, we introduce a hybrid failure model for roundbased algorithms in Sect. 4.3, which distinguishes different classes of process and communication failures, both in the time and in the value domain. In Sect. 4.4, we define a convenient abstraction originally proposed in [40, 47, 42] which considerably simplifies the analysis of round-based algorithms. It is based on how a process locally perceives

3 Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures 3 Name Description Definition Page /0 empty perception value Sect be k p = Vp(t k p) k process p s round k broadcast event Sect k = C p (t) p s logical clock value (i.e., round number) at real-time t Sect. 5 9 C max (t) maximum clock value in the system at real-time t Definition D max maximum precision during whole operation Theorem 5 17 D MCB maximum precision from degraded mode on Theorem 4 16 D MCB maximum precision in normal mode Theorem δq p,k end-to-end delay of round k message from p q Sect init synchronization latency Theorem 8 19 Ep k process p s perception vector for (echo,k) Sect (echo,k) algorithm s echo-message for clock value k Fig ε = τ + τ correct end-to-end delay uncertainty Sect f a maximum number arbitrary faulty processes Definition 7 6 f s maximum number symmetric faulty processes Definition 7 6 f i maximum number omission faulty processes Definition 7 6 f c maximum number clean crash faulty processes Definition 7 6 fl s maximum number of outbound link failures per round and process Definition 10 7 fl sa maximum number malign outbound link failures per round and process Definition 10 7 fl r maximum number of inbound link failures per round and process Definition 10 7 fl ra maximum number malign inbound link failures per round and process Definition 10 7 σfirst k 1 real-time when first process s clock reaches k Definition Ip k process p s perception vector for (init,k) Sect (init,k) algorithm s init-message for clock value k Fig I σ (t) indicator function for non-synchrony with real-time t Definition M set of messages created by obedient processes during a step Sect (N1) n fl s + 2 f l ra + 2 fl r + 3 f a + 2 f s + 2 f i + f c + 1 Sect (N2) n 2 fl s + 2 f l ra + 2 fl r + 3 f a + 3 f s + 2 f i + 2 f c + 1 Sect n up (k) number of processes that see all round k messages Definition P up (k) set of processes that see all round k messages Definition Π = {1,..., n} set of all processes Sect peq p,k = Vq p,k (tq p,k ) process q s round k perception event for sender p Sect R k matrix of round k messages actually received system-wide Equation 3 6 S k matrix of round k messages sent system-wide Equation 2 6 S set of messages sent by obedient processes during a step Sect σp k process p s round switching time k k+ 1 Sect tick k abbreviation for (init,k) and/or (echo,k) messages Sect tp k occurence time process p s round k broadcast event Sect tq p,k occurrence time process q s round k perception event for sender p Sect τ,τ + lower and upper bounds on end-to-end delay Sect t up real-time when late starter gets up Theorem 5 17 t sync real-time when late starter gets synchronized Theorem 6 18 Θ = τ + /τ correct end-to-end delay ratio Sect Vp k value sent by p in round k message Sect Vq p,k process q s perception of the round k value from sender p Sect V k = V k (t) round k perception matrix of the system Sect Vp k = Vp k (t) round k perception vector of process p Sect Table 1 Glossary of our Notation the behavior of the other processes in a round, i.e., the presence and absence of this round s messages. In Sect. 4.5, we will extend this perception-based model to incorporate the system booting phase. 4.1 Execution Model We consider a distributed system of n processors linked by a fully connected point-to-point network. Every processor is identified by a unique processor id p Π = {1,...,n}. Processors are down initially and boot at unpredictable times, after which they are called up. When up, every processor executes one or more concurrent processes, appropriately scheduled on the processor s CPU. Processes are uniquely identified system-wide by the tuple p, N, where the process name N is chosen from a suitable name space. Since a process will usually communicate with processes of the same name, we will distinguish processes primarily by their processor ids and suppress process names when they are clear from the context. Actually, we will define our process failure model for a single process per processor only. If a distributed algorithm consists of several processes per pro-

4 4 Josef Widder, Ulrich Schmid cessor, the failure model must be applied at the processorlevel, in conjunction with the assumption that all processes executed by a single processor may commit failures at most as severe as the failure mode of the processor allows. Concerning the communication between processes, we assume that every pair of processes is connected via a pair of dedicated unidirectional links. Links are considered independent 2 of each other and need not be FIFO. Executions are modeled as sequences of atomic computing steps. We assume a message-driven model [24], where all steps except the initial one, which is triggered when the processor gets up of processes that faithfully execute their algorithms are triggered by some message reception (possibly from itself). More specifically, a computing step s consists of the reception of exactly one message 3, a state change (depending on the former local state of the process and the received message), and the generation and sending of a (possibly empty) set M s of messages. We call a step s correct if the process both changes its state and creates M s according to its algorithm and successfully sends M s. A step s is called obedient, if the process changes its state and creates M s according to its algorithm, but succeeds only to send a (possibly empty) subset of messages S s M s. A step that is neither correct nor obedient will be called faulty. The following related terms will be used in the sequel: Definition 1 (Correct process) A process p is correct up to step s in an execution, if it performs a correct step on every message reception up to step s. A process p is correct in an execution, if it performs a correct step on every message reception. Note that we assume that the algorithm also contains code that handles (typically via a no-operation) the reception of any illegal message. An illegal message is a message that is detectably malformed w.r.t. the protocol, e.g. bad checksum, obviously wrong message content and/or message format. Definition 2 (Obedient process) A process p is obedient up to step s in an execution if it performs an obedient step on every message reception up to step s. A process p is obedient in an execution, if it performs an obedient step on every message reception. Note that, according to this definition, a correct process is also obedient. In Sect. 4.3, we will use the notion of correct, obedient and faulty steps to define the exact semantics of process failures. According to Definition 6, benign faulty processes (crash and send omissions 4 ) are obedient, whereas non-benign faulty processes (symmetric and arbitrary) may also take faulty steps. 2 Hence, we do not model the fact that all links between processes residing on the same pair of processors share the single physical link connecting the two processors. Link failure budgets will also be assigned on a per process basis, rather than on a per processor basis. 3 Formerly, as in [17], such algorithm were denoted asynchronous. 4 Receive omissions and hence general omissions are incorporated via communication link failures, rather than via process failures. See Sect. 4.3 for details and Sect. 9.2 for some discussion. 4.2 Timing Assumptions The execution model introduced above is based on atomic steps, which are executed in zero time. Since real processors and communication links have finite speed, computation and communication delays (resulting from processing, transmission and queuing) are modeled via some non-zero time interval between computing steps. Unlike the classic partially synchronous model by Dwork, Lynch, and Stockmeyer [15], which postulates the existence of (possibly unknown) bounds on processing speed ratio and communication delays, we employ a timing model that is solely based on the end-to-end delays of successfully received messages. In the following definitions, we take a sender-centric view, in that we consider the transmission of some particular message m that is sent by some process p. While the receiver may get multiple messages with the same content as m (e.g., due to failures or due to some algorithm which does not use unique messages), we focus on the question of whether/when a particular message m sent by p is received. Definition 3 (Successful reception) A message m S s sent by process p to an obedient process q in some step s is successfully received by q, if q eventually takes a step that is triggered by m. Successfully received messages will be further classified according to their end-to-end delay, which is defined as follows: Definition 4 (End-to-end delay) Let process p send message m S s to process q in some step s at time t s, and let m be successfully received by the obedient process q in some step r at time t r. The end-to-end delay of the successfully received message m is defined as δq p,m = t r t s. In the following sections, we will consider executions E = E(τ,τ + ), which are implicitly parametrized by two (unvalued) timing parameters τ and τ +. Informally, τ and τ + represent the lower respectively upper bound on the endto-end delay of successfully received messages in E i.e., the values of these bounds may differ for different executions. They must satisfy 0 < τ τ + < and hold also for the self-reception delay δp p,m (cf. Sect. 9.3 for how to remove this simplifying assumption). Let τ = [τ,τ + ] be the closed interval spanned by those bounds. Definition 5 (Timely reception) A message m that is successfully received by obedient process q and sent by process p (including p = q) is timely received at q if δq p,m τ, i.e., τ δq p,m τ +. The resulting bound for δq p,m s delay uncertainty and delay ratio, which will play a central role in our analysis, is given by ε = τ + τ resp. Θ = τ + /τ. Obviously, ε and Θ solely depend on τ and τ + of the execution under consideration. It is important to note that τ and τ + are not known to the algorithm analyzed in this paper. Rather, they are unvalued variables that are solely used for analysis purposes:

5 Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures 5 The formulas derived for quantities like precision or accuracy involve τ and τ +. Obviously, those formulas can be used to actually compute those quantities when the algorithm is employed in a certain system: For example, all executions of a system S adhering to the synchronous model obey some a priori known bounds τ and τ +. Hence, plugging in τ = τ and τ + = τ + into e.g. our precision formula (D max ) provides the worst case precision of our algorithm in any execution of S. Our formulas can also be used for computing those quantities in other system models, however, in particular, for the partially synchronous system model with unknown delay bound from [15]. See Sect. 9.1 for details. 4.3 Hybrid Failure Model for Round-based Algorithms In this section, we will specify our comprehensive hybrid failure model for round-based algorithms. Such algorithms execute in a sequence of (non-lockstep) rounds, where every process broadcasts the same message 5 to all processes. Following the detailed definition of our round-based algorithms, we introduce several classes of process failures and associated failure bounds. Next, we provide various classes of communication failures (link failures) and associated failure bounds. Link failures, which may be moving (transient) here, are defined on a per round basis, and are orthogonal to process failures. To simplify the analysis of distributed algorithms under our detailed failure model, we proceed with a high-level abstraction based on how processes perceive each other in a round [47,42], based on the set of received messages. Finally, we extend the validity of the resulting perception-based failure model to the system booting phase. In this paper, we consider asynchronous round-based algorithms only. Every process executes a sequence of consecutive rounds k = {0, 1,...} here, which are asynchronous in the sense that different processes may be in different rounds at the same time. In round k, obedient process p receives round k messages, performs some computation based on the received messages and its local state, and broadcasts (i.e., sends to all processes in Π, including itself) the resulting round k + 1 message. The round k message sent from obedient process p to q in such a broadcast step has the form (Vp,k) k p,q, where Vp k represents an algorithm-specific value and k the current round number. The sender identifier p and the receiver identifier q are not explicitly included in the message, since they are implicitly determined according to our point-to-point network assumption (we assume that the network prevents masquerading). These subscripts are hence usually omitted for brevity. In more detail, obedient process p s initial round k = 1 consists of a single step s 1 p, triggered when the process gets up, where p s round 0 message (Vp,0) 0 is broadcast. 5 Only broadcast-based algorithms can benefit from our class of symmetric failures (cf. Definition 6), which allow a faulty processes to broadcast an erroneous value. As long as such erroneous values are received consistently by all receivers, symmetric failures are less severe than Byzantine failures. Process p s round k 0 consists of some number l k 1 of round k steps, which are triggered by the successful reception of the round k messages from the processes. Steps 1,...,l k 1 are pure receiving steps, where no messages are sent by p (such that M s = /0). Process p s round k is terminated by the round k step l k, termed p s round k round switching step s k p, in which p computes and broadcasts its round k + 1 message (Vp k+1,k + 1). Hence, M s k p = n here. Typically (but depending on the particular algorithm), the round switching step occurs when sufficiently many round k messages (from distinct peers) have arrived. Let the round switching time σp k be the real-time when the round switching step s k p occurs, that is, when p switches from round k to k + 1. We consider consecutive rounds only, such that σp 1 σp 0 σp 1... for every obedient process p. Note that round k messages arriving after σp k are discarded, i.e., we consider a communication-closed model. From the above, it is apparent that both the local state reached by process p after its round k round-switching step s k p and the content Vp k+1 of the round k + 1 message broadcast in step s k p are based on p s local state at the beginning of (the first step of) round k and all 6 the round k messages received during round k. Nevertheless, since rounds are asynchronous, it may of course happen that some round k k steps triggered by the reception of a round k message from some peer occur at p when it is actually in round k. Note that such early messages are of course not illegal ones, recall our comment following Definition 1. In Sect. 4.1, we described the behavior of correct and obedient processes, both of which faithfully execute their algorithm. Here we add the behavior of non-benign faulty processes: A non-benign faulty process may take faulty steps, which are not in accordance with its algorithm and may thus send messages of arbitrary content to arbitrary receivers. We distinguish processes which exhibit arbitrary failures here, where there is no restriction on the set of messages sent in any step, and symmetric failures (similar to identical Byzantine failures from [2]), where, for any step s of process p in an execution, it holds that either no message is sent in s at all, or the same (possibly erroneous) message m is broadcast. Definition 6 (Process Behaviors) The behavior of a process in an execution is classified as follows: Correct: See Definition 1. Clean Crash. A process that is correct up to some step and then does not take any further steps. Crash: A process that is obedient up to some step and then does not take any further steps. Symmetric Omission: An obedient process, where all steps s satisfy S s = /0 or S s = M s. Asymmetric Omission: An obedient process as introduced in Definition 2. 6 If multiple round k messages arrive from the same process q, which can happen if q is faulty, for example, the handling of these messages may be determined by the algorithm.

6 6 Josef Widder, Ulrich Schmid Symmetric: A process p, where for all steps s in the execution either S s = /0 or (V,k) : q Π : (V,k) p,q S s S s = Π. Arbitrary: A process with no restriction on the recipients and content of the messages S s sent in step s. Processes exhibiting failures ranging from clean crashes up to asymmetric omissions are called benign and are obedient. Both symmetric and arbitrary faulty processes are nonbenign and not obedient, such that no assumption is made on how steps are triggered and performed on such processes. Our hybrid process failure model rests on the following upper bounds on the number of faulty processes during an execution. Definition 7 (Maximum number of process failures) We assume that, during every execution of an algorithm, at most f c processes are either clean crash or symmetric omission faulty (obedient processes that either perform complete broadcasts or full omissions), f i processes are either crash faulty or asymmetric omission faulty (obedient processes that may perform incomplete broadcasts), f s processes are symmetric faulty, and f a processes are arbitrary faulty. In addition to process failures, our failure model also provides communication failures (i.e., link failures). In sharp contrast to process failures, link failures may also hit messages sent and received by correct processes, and are typically moving and transient [37], in the sense that they can affect different links in different rounds of an execution. The following Definition 8 classifies the communication failures provided by our failure model, by considering their effects on a single message. Note that we only have to consider obedient receivers in our specifications, since non-obedient receiver processes need not follow their algorithm anyway. Definition 8 (Link behaviors) The communication behavior when sending a message (V k p,k) p,q S s M s in step s of process p to some obedient process q is classified as follows: Correct: The end-to-end delay δq p,k satisfies δq p,k τ (see Definition 5) and the message is received unaltered. Lost: During the whole execution, no step is triggered by the reception of message (Vp,k) k p,q at q. Untimely: The end-to-end delay δq p,k satisfies δq p,k τ (i.e., early or late message) and the message is received unaltered. Corrupted: The message content (value Vq p,k and/or round number k ) received at q differs from the message content (value Vp k and round number k) sent. Spurious: A message that has not been sent by p is received by q. Lost, untimely, corrupted or spurious communication behavior is termed a link failure; a link failure which is not a lost message is termed a malign link failure. Since it is impossible to solve any representative distributed computing problem in presence of unrestricted communication failures [20, 37], the power of link failures must be restricted somehow. As in [40,47,42], we will restrict the admissible link failure patterns by requiring that, for every process, only a certain fraction of its outbound links and a certain fraction of its inbound links suffer from link failures. Note that a single link failure hitting the link from process p to q affects both the outbound link to receiver q of the sender p and the inbound link from sender p of the receiver q. The links actually hit by link failures can change from round to round, i.e., may be moving [37]. In [41,47], we have shown that this model has good coverage even in typical wireless settings, where link failure rates up to 10 2 are common. A comprehensive collection of impossibility results and lower bounds for consensus in this model can be found in [46]. To formally define our link failure model, we start with the set Sp(q) k of round k messages actually sent by process p to process q in all 7 steps s of an execution E: { } Sp(q) k = m p,q = (,k) p,q : m p,q S s (1) s E Clearly, Sp(q) k = /0 or Sp(q) k = {(Vp,k) k p,q } for every obedient sender process p, depending on whether p suffered from a send omission in its round k 1 switching step s k 1 p or not. For any non-benign faulty process p, however, Sp(q) k can be arbitrary since there are no restrictions on how many round k messages are sent to q in an execution. The sets of round k messages sent system-wide in an execution can hence be represented by the following matrix: S1 k (1) S k 2 (1)... S k n (1) S k S1 k = (2) S 2 k (2)... S k n (2) (2).... S1 k(n) S 2 k (n)... S k n (n) Note that the p-th column in this matrix provides all the round k messages sent by p (to any receiver). Similarly, the q-th row contains all the round k messages sent to q (from any sender). Our link failures act on the above matrices (2), for any k, thereby leading to possibly modified matrices R1 k (1) Rk 2 (1)... Rk n(1) R k R1 k = (2) Rk 2 (2)... Rk n(2)...., (3) R1 k(n) Rk 2 (n)... Rk n(n) with entry R k p(q) denoting the set of round k messages actually received by q over the inbound link from p. Extending the single-message link semantics given in Definition 8 to 7 Since we are dealing with a communication-closed model, we could even restrict our attention to round k messages sent before the round k switching step of q, i.e., to all steps s pref k q(e) in the finite prefix pref k q(e) of E up to step s k q in (1).

7 Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures 7 the behavior relevant for round k messages, we arrive at the following Definition 9. Definition 9 ( Round k link behaviors) Given the systemwide matrices of round k messages (2) and (3), we distinguish the following round k link behavior between sender p and obedient receiver q: Correct: R k p(q) = S k p(q), and the successful receptions are timely. Lost: R k p(q) S k p(q), and additionally the successful receptions are timely. Untimely: R k p(q) S k p(q), and at least one message m R s p(q) is untimely. Corrupted: R k p(q) S k p(q), due to at least one message m R k p(q) that is corrupted. Spurious: R k p(q) S k p(q), due to at least one message m R k p(q) that is spurious. Lost, untimely, corrupted or spurious round k link behavior is termed a round k link failure. Note that a single malign link failure that changes a message (V k p,k) to (V k p,l) affects both R k and R l. Finally, our link failure model given in Definition 10 below will just restrict the number of entries in the matrix (3) that are affected by link failures as introduced above. Definition 10 (Maximum number of round k link failures) For any k 0 and for any process p Π and any obedient process q Π, in matrix R k, all entries are correct except (R) at most fl r entries in row q may be affected by round k link failures, with at most fl ra fl r of those caused by malign link failures, and (S) at most fl s entries in column p may be affected by round k link failures, with at most fl sa fl s of those caused by malign link failures. Obviously, (R) bounds the number of failures per round at inbound links of the obedient process q, while (S) bounds the number of failures per round at the outbound links of process p. The particular links actually hit by link failures may be different in different rounds. In addition, since the effects of process failures and link failures are orthogonal, it can of course happen that a link failure hits a message from a faulty sender process. Note that this actually increases the power of the adversary: In case of a clean crash or symmetric faulty sender process p, for example, malign link failures can create spurious messages or message with erroneous content from p at some receivers. Further details and consequences of our failure assumptions will be provided in Sect Perception-based Analysis In Sect. 4.3, we defined our hybrid failure model by looking into the details of the execution model: How (faulty) processes may perform their steps and how messages may be affected by communication failures. We will now introduce a higher level of abstraction, which considerably simplifies the analysis of distributed algorithms in our model. This abstraction rests on how processes perceive their peers at the end of a round, i.e., whether and which round k messages they have received. In Lemma 1 below, we will derive some properties of those perceptions from our physical failure model. The round-based model introduced in Sect. 4.1 can be viewed at a higher level of abstraction as follows: For every round number k, we assume that every (obedient) process q collects the values received in round k in a local array Vq k = (Vq 1,k,...,Vq n,k ) called perception vector. The multiset Vq p,k Vq k (called perception) is either the special value /0 if no round k message from process p has been received yet, or it contains the values of all 8 round k messages received over the link from process p so far. Obviously, Vq k = Vq k (t) as well as its entries Vq p,k = Vq p,k (t) are time-dependent (we will usually suppress the observation real-time t in order not to overload our notation): Initially, all Vq p,k (0) = /0. In the first step where process q receives a round k message from process p, at some time tq p,k > 0 (called perception event peq p,k here), the received value is put into the perception Vq p,k ; thus Vq p,k (t ) = /0 if t < tq p,k and Vq p,k (t ) /0 otherwise. Note that tq p,k = in case of complete message loss. Process p s round k terminates at the round switching time σ k p, which is the real-time when the round k switching step s k p occurs (where p s round k + 1 messages are sent). This instant is called broadcast event be k+1 p = V k+1 p (t k+1 here and occurs at real-time t k+1 p p ) = σp. k In case of synchronous algorithms, round switching at time σp k is enforced by some means external to the algorithm. In case of messagedriven algorithms like the one of this paper, σp k is determined by the algorithm itself, typically, when sufficiently many messages have been received. The value Vp k+1 broadcast in p s round k+1 message is computed from the round k perceptions available 9 in Vp k = Vp k (σp) k and p s local state at time σp. k Process p s broadcast event be k p = Vp(t k p) k and process q s perception event peq p,k = Vq p,k ) are related via their values V k p, V p,k q of no failure, V p,k q. (In case = t k p+δq p,k, with the endto-end delay δ p,k q (t p,k q and their occurrence times t k p and t p,k q = {V k p } and t p,k q τ.) Our perception-based analysis rests on the n n perception matrix V k (t) of all round k perceptions observed at 8 Multiple round k messages via the same link can occur in case of failures and/or late booting retransmissions. 9 More specifically, in an algorithm-specific way, a single value may be chosen from V k p and used to compute the next state.

8 8 Josef Widder, Ulrich Schmid the same real-time t at all processes typically, the round switching time of some process: V1 k (t) V k V2 k (t) = (t). = Vn k (t) V 1,k 1 (t) V 2,k 1 (t)... V n,k V 1,k 2 (t) V 2,k 2 (t)... V n,k.. n (t) V 2,k V 1,k n. (t)... Vn n,k 1 (t) 2 (t). (4) (t) Note that V k (t) and the (in time and space) global matrix R k of round k messages defined in (3) are of course closely related: Vq p (t) is the subset of the messages in R k p(q) with reception times before or at time t, or /0 if there is no such message. The above perception matrix V k (t) is a quite flexible basis for the analysis of distributed algorithms, since it provides a system-global view of all the processes local views (i.e., perception vectors) at an arbitrary observation time t. The primary way of using the perception matrix in the analysis of agreement-type algorithms like the clock synchronization algorithm of this paper is the following: Given the perception vector Vq k (σq) k of some specific obedient receiver process q at its round switching time σq k, it allows to determine how many non-empty perceptions will at least be present in any other obedient process r s perception vector Vr k (σq k + ε) shortly thereafter. The following Lemma 1 developed in [4, 42] formalizes this fact. Lemma 1 (Difference in Perceptions) At any time t, the perception vector Vq k (t) of any process at an obedient receiver q may contain at most fl ra + f a + f s time and/or valuefaulty perceptions Vq p,k /0. Moreover, at most f = fl ra + fl r + f a + f i perceptions Vr p,k, for any p where Vq p,k /0 in Vq k (t), may be empty in any other obedient receiver s Vr k (t + t) for any t ε. Proof The first statement of our lemma is an obvious consequence of Definition 7 and Definition 10. To prove the second statement, we note that at most fl ra+ f a + f i messages may have been available (partly too early) at q without being available yet at r, additional fl ra fl ra perceptions may be late at r, and fl r f l ra ones could suffer from a loss at inbound links to r. All messages from symmetric faulty processes present in V q (t) must also be present in V r (t + t), however. Summing up all the differences, the expression for f given in Lemma 1 follows. The above perception-based model for a single process per processor is easily generalized to distributed algorithms consisting of multiple processes per processor: Every processor can host multiple perception vectors, one per process. Any process but at most one per processor may send messages for a specific perception vector, i.e., to some specific process. Only one process per processor may process the messages from a specific perception vector, however. 4.5 System Booting In classic models, it is assumed that all processes boot simultaneously at time t = 0 and are hence able to receive each other s messages right from the start. In this section, we will show how to get rid of this assumption. At the very beginning, all processes are down. Every message that arrives at a process while it is down is lost, and no messages are sent by such a process. Note carefully that we do not even allow spurious messages from a down process here. Obedient processes boot at times that are not known a priori, while non-obedient processes are assumed to be up right from the beginning at t = 0. During startup, an obedient process goes through the following sequence of operating modes: Down: A process is down when it has not been created yet or has not completed booting. Messages which drop in while a process is down are lost, i.e., are not successfully received and do not generate a corresponding perception. Up: During booting, process q s perception vectors are initialized and the receive processing is set up. When q completes booting at some time σq 1, we say that it gets up. Messages that are sent to process q by process p before time σq 1 τ may be lost, since q need not be up at the time of reception. The relation between the round k perception matrix V k (t) and the global matrix R k if booting is considered is as follows: Vq p (t) contains all the messages in R k p(q) received within the time interval [σq 1,t], or /0 if there is no such message. Note carefully that such late booting losses do not contribute to the failure bounds in Definition 7 and Definition 10. Passive: Upon getting up, a process performs an initialization phase, during which it is called passive. The actions to be performed in passive mode are of course algorithmspecific. Typically, a process broadcasts a special join message as its round 0 message when it gets up. The first reception of a join message from some process p causes every already up receiver to send an algorithmdependent response message back to p (a point-to-point reply is sufficient here); subsequent join messages from the same sender are ignored. When p has received sufficiently many replies for constructing a sufficiently accurate view of the instantaneous system state, it terminates passive mode. 10 Active: A process that has completed its initialization phase and thus left passive mode is called active. It operates as described in the execution model in Sect In case of the clock synchronization algorithm given in Fig. 1, for example, a passive process broadcasts join namely, (init, 0), in order to get the last (init, ) and (echo, ) message of every peer and participates in the algorithm as in active mode. It need not satisfy the clock synchronization conditions (P) and (A) while passive, however. The transition to active mode occurs when the process can be sure that it is within the synchronization precision D max. This happens when sufficiently many messages with certain properties have been obtained.

9 Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures 9 The communication pattern for joining is slightly different from ordinary rounds: Whereas join is broadcast by a single joining process p as its round 0 message as usual, response messages are sent back immediately (maybe even point-to-point) upon reception of the first join message from p. Fortunately, response messages can also be accommodated easily in our framework: We just consider such a message as q s retransmission of some (recent) round k message, triggered by the reception of p s join. Since the joiner p must be provided with up-to-date information to construct an internal state that satisfies problem specific invariants (as e.g. precision of all active clocks), this is a sensible approach. In fact, all definitions in Sect. 4.3 remain valid when we allow processes to retransmit round k messages (with a minor caveat, see below): Process failures have been defined at the level of single computing steps. Adding retransmission steps hence does not affect the process failure model in Definition 7. Communication failures are defined for single rounds. Since all related definitions, in particular Equation (1), are based on sets of messages, they transparently extend to retransmissions of round k messages. Hence, the link failure model in Definition 10 applies without changes as well. Note carefully, however, that this binds together the original round k broadcast and all round k retransmissions w.r.t. link failure bounds, as they are all contained in the round k matrices S k and R k. A response message retransmitting the round k message of some process q upon p s join is considered as belonging to the original round k broadcast, with the actual transmission deferred until the joining receiver p is up. This peculiarity creates a minor problem with respect to timely messages, however, which must be considered by the algorithm: A retransmitted round k message is of course not sent at the same time as the original round k broadcast. Hence, the former cannot be considered timely w.r.t. the latter, but may suffer from a (late) timing failure. Hence, Lemma 1 does not apply to (just joined) processes that have received at least one response messages. This case will extensively be treated in our analysis in Sect Clock Synchronization In the following, we define the clock synchronization problem and discuss its properties. Definition 11 (Clock synchronization properties) Every obedient process p is equipped with an adjustable discrete clock C p (t) that can be read at arbitrary real-times t. A correct clock synchronization algorithm guarantees the following properties: (P) Uniform Precision: In each execution, there is some precision D max > 0 that is independent of real-time t such that C p (t) C q (t) D max (5) for any two obedient active processes p and q and any real-time t. (A) Uniform Accuracy: In each execution, there exist some bounds R, O, R +, O + > 0 that are independent of real-time t such that O (t 2 t 1 ) R C p (t 2 ) C p (t 1 ) O + (t 2 t 1 )+R + (6) for any obedient active process p and any two real-times t 2 t 1. The precision requirement (P) just states that the difference of any two correct clocks in the system must be bounded, whereas (A) guarantees some relation of the progress of clock time with respect to the progress of real-time; (A) is also called envelope requirement in literature. Note that (P) and (A) are uniform [21] here, i.e., they do not only hold for correct but also for active benign faulty (obedient) processes. Any such process must hence satisfy the clock synchronization conditions until it (possibly) crashes. Traditional research on clock synchronization (see [49, 35, 48, 38] for an overview) considers synchronous systems equipped with hardware clocks with high time resolution and small drift ρ (in the range of µs/s). All executions in such systems must obey some (usually a priori known) lower and upper bounds τ, τ + on the end-toend delay of successfully transmitted messages between correct processes. Consequently, in synchronous systems, the bounds D max, R, O, R + and O + in Definition 11 are indeed constants that hold in all executions, i.e., are worst case bounds. Optimal clock synchronization algorithms for synchronous systems like [51,16,45] guarantee O = 1 ρ and O + = 1 + ρ and very small R, R +. The achievable precision depends primarily on the transmission delay uncertainty ε = τ + τ [30], which is in the ms-range for typical local area networks. By contrast, in partially synchronous systems [15, 33] like ours, different executions may obey different lower and upper bounds τ, τ + for the end-to-end delays. Every execution E(τ,τ + ) is hence parametrized by τ, τ +, and D max, R, O, R + and O + in Definition 11 depend on τ and τ + of the current execution. Clocks are usually implemented as simple software counters here. For example, in our paper, process p s clock will be the round number of the clock synchronization algorithm running on p: C p (t) is incremented by 1 when process p switches to the next round. The timeresolution of a clock is hence determined by the number of round switches within a given real-time interval, which is about once per round trip: Theorem 11 will reveal that our algorithm ensures O = 1/(2τ + ), O + = 1/(2τ ) with reasonably small R, R +, where the lower bound holds only if sufficiently many processes are eventually up and running When implementing this algorithm, it is possible to stretch τ to obtain τ + τ (1 + ρ) + ε by using local bounded-drift interval timers at every sender process. This way, a situation comparable to the synchronous clock synchronization setting can be established using our message-driven algorithm (originally designed for partially syn-

10 10 Josef Widder, Ulrich Schmid Remarkably, it will turn out that the precision D max as well as R, R + depend only on the delay ratio Θ = τ + /τ, rather than on τ + and τ itself. Hence, our algorithm guarantees some constant precision even in systems with arbitrary delays, provided that Θ remains bounded by a constant in every execution. See Sect. 9.1 for further details. 5.1 The Algorithm The clock synchronization algorithm considered in this paper is a hybrid variant of the algorithm of [55]. It is based on the well-known non-authenticated 12 clock synchronization algorithm of [51], which employs consistent broadcasting as a primitive for generating nearly-simultaneous global resynchronization events in the system. Its pseudo-code is given in Fig. 1. Note carefully that our algorithm is completely time(r)-free in that it does not incorporate τ + and τ, and not even ε nor Θ, and that it is completely message-driven as every step is triggered by a message reception (and not by the progress of time, as in time-driven execution models). The algorithm is given in an event-based style that consists of the 6 outermost if statements, numbered from the zeroth if (line 4) up to the fifth if (line 25). The variable k provides the clock value C p (t) of the processor p that executes the algorithm. The second and the third if are just the instances of the fourth and fifth if for l = k, which have been incorporated explicitly for the ease of explanation only and could entirely be omitted from the code. The presence of the third if in Fig. 1 implies, however, that the only part of the fifth if that is ever executed is setting mode to active when in passive mode (line 26). Note carefully that this also happens when l < k. The clock value k is never changed in the fifth if, since k is already set to l in the fourth if, which always triggers before the fifth if. Comparison of the algorithm of Fig. 1 with hybrid variants [40, 28] of the original consistent broadcasting primitive (which do not incorporate system start-up handling) shows that the first three if-clauses 13 (see lines 7 to 17) are the same: Each round k is started by sending an (init,k) message to all. When an obedient process achieves sufficient evidence that at least one obedient process has sent a round k message (line 7 or line 10), it sends (echo, k) to all. When a process can be sure that there are enough round k messages in the system, such that every obedient process will eventually reach sufficient evidence (line 13), it advances to round k + 1 and hence sends (init,k+1) (line 16). This guarantees both (P) and (A) if sufficiently many correct processes are up and running right from the beginning. chronous systems). Note that this also reduces the message load imposed on the system, see [27] for details. 12 We do not consider authenticated algorithms here. Besides the disadvantages of computational and communication overhead, it is never guaranteed that malicious processes cannot break the authentication scheme. Using the algorithm of Srikanth and Toueg [51], our correctness proofs cannot be invalidated by this event. 13 We do not count the booting-related zeroth if (line 4) here. In reality, however, processes complete booting at unpredictable times. Late starters could hence miss the (echo, k) and/or (init, k) message(s) of early starters. Consequently, three consecutive modes of system operation must be distinguished to properly handle system startup: Early mode, where the first few correct processes have completed booting and started exchanging messages. This mode terminates when the first obedient process advances its clock to 1. Degraded mode, where sufficiently many correct processes are up such that some clocks may advance when assisted by faulty processes and transmissions. Normal mode, where enough correct processes are up and synchronized to guarantee progress for all clocks. Note that the system remains in normal mode forever, i.e., does not return to degraded mode. Unfortunately, it is impossible for any process in the system to delimit the exact borders between those modes from local information. In order to add the handling of system startup to the original consistent broadcasting algorithm, join messages and two additional if-clauses are required. First, a newly booted process must tell all others that it is up now and wants to learn their current clock values. This is accomplished by means of join messages, as introduced in Sect. 4.5: Every process p sends join = (init,0) as the very first message after having completed booting. Every process q that receives this message replies by retransmitting its previously sent (init, k) and (echo,k ) message (line 3-6). The latter message is omitted if q did not yet send (echo,0) in round k = 0; otherwise, k = k or k = k 1, depending on whether (echo,k) has already been sent. This ensures that p will eventually get sufficiently many messages (which may have been lost while it was down) to trigger the catch-up rule described below. The major problem in degraded mode is the impossibility to guarantee (P) solely via the third if-clause (line 13): There are not sufficiently many correct processes to guarantee that every obedient process will eventually advance its clock when the first one does so. Here it is where the fourth if (line 18), our pivotal catch-up rule, comes into play: It allows an obedient process to advance its clock to round l when sufficiently many (echo, l) or (echo, l + 1) messages have been received. Hence, eventually, a sufficiently large group of correct processes can be ensured to be within two rounds of each other. Note, that the algorithm of Fig. 1 cannot guarantee that the clock of an obedient process takes on every integer value: It may leap forward due to catch-up. The catch-up rule could cause two other problems: First, the second and third if-clause (line 10 or line 13) must trigger when sufficiently many (echo) messages from different processes within 2 rounds have been received. Since messages from two consecutive rounds k 1, k could trigger round switching, which is not directly supported by the round model introduced in Sect. 4.3, the following convention is used: The reception of an (echo,k) message must cause perceptions for both round k and k 1. More specifically, since the reception of (echo, k) at process q implies

The Theta-Model: Achieving Synchrony without Clocks

Distributed Computing manuscript No. (will be inserted by the editor) Josef Widder Ulrich Schmid The Theta-Model: Achieving Synchrony without Clocks Supported by the FWF project Theta (proj. no. P17757-N04)