Clock Synchronization in the Presence of. Omission and Performance Failures, and. Processor Joins. Flaviu Cristian, Houtan Aghili and Ray Strong

Size: px

Start display at page:

Download "Clock Synchronization in the Presence of. Omission and Performance Failures, and. Processor Joins. Flaviu Cristian, Houtan Aghili and Ray Strong"

Philip Hunt
6 years ago
Views:

1 Clock Synchronization in the Presence of Omission and Performance Failures, and Processor Joins Flaviu Cristian, Houtan Aghili and Ray Strong IBM Research Almaden Research Center Abstract This paper presents a simple practical protocol for synchronizing clocks in a distributed system. Synchronization consists of maintaining logical clocks which run at roughly the speed of a correct hardware clock and are within some known bound of each other. Synchronization is achieved by periodically computing adjustments to the hardware clocks present in the system. The protocol is tolerant of any number of omission failures (e.g. processor crashes, link crashes, occasional message losses) and performance failures (e.g. overloaded processors, slow links) that do not partition the communications network and handles any number of simultaneous processor joins. An earlier version of this paper was ppresented at the 16th IEEE Int. Symp. on Fault-tolerant Computing Systems, Vienna, July 1-4, Flaviu Cristian is now with the University of California, San Diego. Houtan Aghili is now with IBM Research, T. J. Watson Research Center. 1

2 1 Introduction Consider a set of processors interconnected in a distributed system to perform certain distributed computations, where each processor is equipped with a hardware clock. If one wants to measure the time elapsed between the occurrences of two events of a computation local to a processor p, one can instantaneously read the values of the local hardware clock register when these events occur and compute their dierence. A dierent method must be devised if the intention is to measure the time elapsed between two events of a distributed computation. For instance, if an event e p occurs on a processor p and another event e q occurs on a dierent processor q, it is practically impossible for q to instantaneously read its hardware clock when the remote event e p occurs. Indeed, the sending of a message from p to q that noties the occurrence of e p entails a random transmission delay that makes it impossible for q to exactly know the value displayed by q's hardware clock at the instant when e p occurred in p. Hence, q (and by a similar argument p) cannot compute exactly the time elapsed between e p and e p by relying only on their own clocks. The problem of measuring the time elapsed between occurrences of distributed events would be easily solved if the processor clocks could be exactly synchronized. That is, if at any instant of a postulated Newtonian time referential, all clocks would read the same value. It would then be sucient for p to send the value of its local clock reading when e p occurs to q, and for q to subtract this from the value of its clock reading when e q occurs. Unfortunately, the uncertainty on message transmission delays inherent in distributed systems makes exact clock synchronization impossible. This paper presents a protocol for maintaining approximately synchronized clocks in a loosely coupled distributed system. The protocol maintains on each correctly functioning processor p a logical clock C p which measures the passage of real time with an accuracy comparable to that of a hardware clock and which, at any instant t, displays a clock time C p (t) that is within some known bound DMAX from the clock times displayed by all other logical clocks C q running on other correct system processors q: 8 t : jc p (t)? C q (t)j < DMAX Such logical clocks allow one to measure with an a priori known accuracy the time that elapses between events which occur on distinct processors. 2

3 A logical clock is maintained by periodically computing an adjustment function for a hardware clock. We refer the reader to discussions of various methods for maintaining smooth logical clocks by amortizing adjustments in [DHSS] and [CS]. One of our main objectives was to design a protocol that is practical. Such a protocol should work in the presence of events that are reasonably likely to occur, (e.g. processor crashes and joins, message losses or delays in communications or processing) yet be simple enough to be understandable, correctly implementable, and maintainable. 2 Failure Classication Let P denote the set of processors of a distributed system and L the set of physical links between the processors in P. Processors and links undergo certain state transitions in response to service requests. For example, a link (p,q)l that joins processor pp to processor qp delivers a message to q if p so requests. Similarly, a processor p computes a given function if its user so requests. System components, such as processors and communication links, are correct if, to any incoming service request, they respond in a manner consistent with their specication. Service specications prescribe the state transition that a component should undergo in response to a service request and the real time interval within which the transition should occur. A component failure occurs when a component does not deliver the service requested in the manner specied. We distinguish among three general failure classes [CASD]. If the component never responds to a service request it suers an omission failure. If the component delivers a requested service either too early or too late (i.e. outside the real time interval specied) it suers a timing failure. We call late timing failures performance failures. If a component delivers a service dierent from the one requested or delivers unrequested "services" it suers an arbitrary or Byzantine failure. Typical examples of omission failures are processor crashes, hardware clocks that stop running, link breakdowns, processors that occasionally do not relay messages they should, and links that sometimes lose messages. Examples of performance failures are occasional message delays caused by overloaded relaying processors, hardware clocks running at a speed lower than (1+)?1, where is the maximum drift specied by the clock manufacturer. An example of an early timing failure is a hardware clock that runs at a speed that exceeds 1+. Examples of Byzantine failure are an undetectable message corruption on a link, due to electro-magnetic noise or human sabotage, or a 3

4 processor that sends two messages "the time is 10:00am" and "the time is 11:00am" when the correct time is midnight. The system components that suer failures are eventually taken o the system by maintenance personnel for repair or replacement. After maintenance such components join the system of active components by explicitely executing a join protocol. 3 Assumptions A hardware clock consists of an oscillator, which generates a cyclic waveform at a uniform rate, and a counting register, which records the number of cycles elapsed since the start of the clock. We assume that the hardware clocks of the processors that compose the system are driven by quartz crystal controlled oscillators that are highly stable and accurate. The use of this type of clock is very common in modern computer architectures (e.g. the IBM 4300, 3080 and 3090 series). Typically, a correct quartz clock may drift from real time at a rate of at most, where is on the order of 10?6 seconds per second. Quartz clocks are not only highly stable, but are also extremely reliable. Current experience indicates that the average quartz clock that is used in medium to highend digital computers has a mean time between failures (MTBF) in excess of years, and that good clocks, like those used in military applications, can have MTBF's expressed in hundreds of years [MIL]. Many of the clock failures likely to occur in practice can be detected by the error detecting circuitry incorporated in clock chips. For example, if the counting register that composes a clock is self-checking, the occurrence of a physical failure within it will generate (with high probability) a clock-error exception. If a detectable physical failure aects a hardware clock, any attempt at reading its value terminates with a clock-error exception [IBM370]. Given the very signicant MTBFs observed for current quartz based clock chips, and the extensive error detecting circuitry built in such chips, in this paper we will assume that the likelihood of undetectable clock failure occurrences is negligible compared to other sources of system failures (for a precise interpretation of what "negligible" means, we refer to [Cr]). Let HC(t) denote the value displayed by a hardware clock HC at some real time t. (As in [LM,DHSS], we write the variables and constants that range over real time in lower case, and the variables and constants ranging over clock time in upper case.) We can formulate our assumption concerning the high reliability of a hardware clock by saying that a hardware clock is within a linear envelope of real time (which runs 4

5 by denition with speed 1): (A1) After it is powered on, the hardware clock HC of a processor measures the passage of time between any two successive real time instants t 1, t 2 correctly: (1 + )?1 (t 2? t 1 )? G < HC(t 2 )? HC(t 1 ) < (1 + )(t 2? t 1 ) + G: A clock that fails signals a clock-error exception whenever an attempt at reading its value is made. G is a constant depending on the granularity of the hardware clock. For simplicity in what follows, we will assume G = 0. Given that by hypothesis (A1) correct clocks drift from real time by at most, one can infer that the rate at which two correct clocks can drift apart from each other during t real time units is at most (1+)t?(1+)?1 t = (2+)=(1+)t. We denote by dr (2 + )=(1 + ) (relative clock drift rate) the factor which when multiplied by a real time interval length gives the net amount by which hardware clocks could drift apart in the worst case during that time interval. The next assumption is not necessary for the correct functioning of our algorithm, but it simplies computing performance estimates. (A2) Let N be the maximum number of processors participating in the protocol. Then the rate of drift is suciently small that 3N(2 + 2 ) < 1: The next assumption concerns the normal speed at which messages can be sent over correct links between two processes running on adjacent correct processors: (A3) A message sent from a correct processor p to a correct processor q over a correct link (p,q) arrives at q and is processed at q in less than ldel (link delay) real time units. If a message sent from p to a neighbor q needs more than ldel real time units to arrive at q, or never arrives, then at least one of the processors p,q or the link (p,q) has experienced a failure. The fourth assumption states that, during clock synchronization, any two correct active processors in P are linked by at least one chain of correct links and correct intermediate processors. That is, 5

6 (A4) No partition of the system of correct processors and links occurs during clock synchronization. If suciently many redundant physical communication paths exist between any two processors in a network, the likelyhood of this hypothesis being violated can be made negligible. Assumptions (A3,A4) allow us to conclude that, if a synchronization message is sent by a correct processor p to a correct processor q, then there is a chain of correct links and intermediate processors over which the message can require no more than ndel = (N? 1) ldel (network delay) real time units between p and q, where N is the maximum number of processors in the system. (For a better, but more complex, upper bound on the network delay which would guarantee a closer synchronization of clocks, the interested reader is referred to [CASD].) Our protocol is designed to tolerate omission and performance failures that do not partition the network of correct processors. We acknowledge the possibility that other types of failures (e.g. very fast clocks, or sabotaged processors). Such failures can in principle occur, and there are several more complex protocols that have been designed to handle them [LM,LL,DHSS]. We have chosen improved simplicity and performance at the expense of a chance of loss of synchronization in the presence of these rare failure types. Thus, rather than aiming at generality and power, the goal was to favor simplicity and practicality and to aim for those applications where the likelihood of early timing or Byzantine clock failures causing major damage is negligible. Our intention is to develop a protocol that can handle the overwhelming majority of failures that are likely to occur in practice, yet is simple to understand, prove correct, implement, and maintain. 4 Objectives The goal of the clock synchronization protocol is to ensure the following three properties: (C1) For any two correct joined processors pq P, the clocks C p, C q indicating the current logical time should be synchronized within some a priori known bound DMAX (for maximum deviation): 9 DMAX : 8 t : jc p (t)? C q (t)j < DMAX 6

7 (C2) The clocks of joined correct processors should read times within a linear envelope of real time. That is, there should exist a constant such that for any clock C p and any real time t: X + (1 + )?1 t < C p (t) < X 0 + (1 + )t; where X, X 0 are constants which depend on the initial conditions of the clock synchronization algorithm execution. (C3) A correct processor that joins a set of synchronized processors, should have its clock synchronized with those of the other correct processors within some a priori known real time delay jdel (join delay). Also, in the absence of processor joins, it is required that each periodic clock synchronization terminate within some known real time delay sdel (synchronization delay). A protocol that achieves (C1,C2) is said to achieve linear envelope clock synchronization. 5 Informal Algorithm Presentation Our algorithm is based on information diusion [CASD], [DHSS]. It is simpler than [DHSS] because we limit the class of failures to be tolerated to omission and performance failures. It also uses a simpler method for handling processor joins. The protocol to be presented is based on the following consequence of assumptions (A3,A4): If at real time t a correct processor pp diuses a message containing its clock time T to all other processors, and each correct processor qp sets its clock to T upon receipt of a message from p, then the clocks of all correct processors will indicate times within ndel(1 + ) by real time t + ndel. By message diusion we mean the following process: processor p sends a new synch message on all outgoing links and any processor q that receives a new synch message 7

8 on some link, relays it on all other outgoing links. We call such a synchronization message diusion a synch wave. An informal (not quite accurate) picture of how our protocol works is provided by assuming that one clock is suciently faster than all others so that its time messages diuse in a synch wave causing each other correct clock to set its time ahead to match the time of the wave. In this section we make this assumption. In the following two sections we give a more formal and more accurate description and proof of correctness of our protocol. Although, immediately after a synch wave propagation the clocks of all correct processors are within NDEL=ndel (1 + ), as time passes, clocks will naturally tend to drift apart. For instance, t real time units after the end of a synch wave propagation, the correct clocks might be as far apart as ndel(1 + ) + dr t. If the intention is to keep the processor clocks close at all times, one has to periodically re-synchronize the clocks. If PER is the clock synchronization period length (in clock time units) then in the interval between two successive synchronization waves numbered s, s+1, the clocks might drift as far apart as D = ndel(1 + ) + dr(p ER(1 + ) + ndel). It will be the role of the (s+1)th synchronization wave to bring the clocks back again within ndel(1 + ). In the absence of processor crashes or joins, one could use a predened synchronizer processor to generate synch waves. If processor crashes are likely, and they certainly are, the existence of a unique synchronizer becomes a single point of failure. As observed in [DHSS], it is better to distribute the role of synchronizer among all processors. The idea is that any processor should be able to initiate a synch wave if it discovers that PER clock time units have elapsed since the last synchronization occurred. If (as we assume for this section) one clock is suciently fast, then its synch wave will happen before any others and make the others unnecessary. Synch waves also have to be generated when new processors join a cluster of already synchronized processors, in order to synchronize the clocks of the new processors with the clocks of the old processors. In such a case a joiner p will send a special \new" message to all its neighbors, forcing them to initiate synch waves. The neighbors of these neighbors either propagate these waves, if their clocks are slower than the clocks of the wave initiators, or generate new synch waves, if their clocks are faster. After at most ndel real time units, a 'winning' synch wave is generated in this way by some processor with a fastest clock. When this propagates to all the other processors, including the ones who are joining, they will all synchronize their clocks within ndel (1 + ). Thus, within at most 2 ndel real time units from the moment a join demand is made by a processor pp, a winning synch wave is reected back to p. At 8

9 that moment, p is joined. That is, its clock is at most ndel(1 + ) apart from the clocks of previously joined correct processors. In the next section we give the protocol and in the following section we discuss how this informal discussion must be modied to provide for the case when there is no winning synch wave. 6 Detailed Algorithm Description A detailed description of the clock synchronization protocol is given in Figure 1. This description is made in terms of two abstract data types: Logical-Clock and Timer. Instances C and TP of these data types can be declared as shown in line 2 of Figure 2. Users of an instance C of the Logical-Clock data type can perform the following operations on it. An invocation of a C.initialize operation initializes the time displayed by C to 0. The operation C.adjust(L,T:Time) adjusts the local time L currently displayed by C so that after PER time units C will show the same time as a logical clock which currently shows time T (assuming that the clocks run at roughly the same speed). Such an adjustment can be implemented either by bumping the local clock to T, or by slightly increasing the speed of the local clock so as to catch up with the remote clock [C], [CS]. The operation C.read reads the current value displayed by C. The operation C.duration(T:Time), used to measure time intervals, reads the number of time units elapsed between a previous time T and the present time. The Timer data type has a unique operation \set(t:time)". If TP is a Timer instance, the meaning of invoking the operation TP.set(T) is \ignore all previous TP.set calls and signal a Timeout condition T clock time units from now." Thus, if after invoking TP.set(100) at time 200, a new invocation is made at time 250, there is no Timeout condition at time 300, but there might be one at time 350. If no other invocation of TP.set is made between 250 and 349, then a Timeout condition occurs at time 350. For convenience of presentation, we use two independent timers TP and TJ (although one is in principle sucient). The former is used to measure the maximum time interval which can elapse between periodic resynchronizations. The latter is used to time the join process. The protocol uses the following communication primitives: receive(m,l) which receives a message m on some link and returns the identity l of that link, forward(m,l) which sends a message m on all outgoing links except l, and send-all(m), that sends m on all 9

10 outgoing links. We do not assume that the forward and send operations are atomic with respect to failure occurrences, i.e. a processor can fail after sending a given message on certain links and before sending it on the remaining links. 1 task Time-Manager ::= 2 var L,T:Time; C:Logical-Clock; TP,TJ:Timer; 3 s,s': Natural-Number; joined:boolean; l:link; 4 s 0; C.initialize; joined false; 5 send-all(\new"); TJ.set(NDEL + T DEL); 6 cycle 7 select 8 receive(\new",l)! s s + 1; send-all(s,c.read); receive(s',t,l)! L C.read; 11 if 12 (s'<s) _ (s' = s&t L)! loop; (s'=s)&(t>l)! C.adjust(L,T); forward((s,t),l); (s'>s)&(t<l)! s s'; send-all(s,l); (s'>s)&(t L)! s s'; C.adjust(L,T); forward((s,t),l); 19 ; Timeout TJ! joined true; Timeout TP! s s+1; L C.read; C.adjust(L,L); send-all(s,l); 24 endselect; 25 TP.set(PER); 26 endcycle; Figure 1. At processor start, the local synch wave sequence number s and the current local clock time are initialized to 0 (line 4). Then, a join phase that lasts for NDEL + T DEL time units begins with the sending of a special \new" message on all outgoing links (line 5). As before N DEL = (1 + )ndel. The constant TDEL is slightly larger than NDEL. A real time duration of tdel is dened in the next section and T DEL = 10

11 (1 + )tdel. Under assumption A2 we may choose tdel = 2 ndel. This gives the particularly simple join delay of jdel = 3 ndel. During the join phase, the \joined" Boolean variable is false, and nothing can be said about how close the local clock is to other clocks of the cluster being joined. At the end of the join phase (line 21), \joined" becomes true and measurements of delays elapsed between distributed event occurrences can begin. The Time-manager can be awakened by three kind of events: a \new" message that arrives from a neighbor that joins (line 8), a message belonging to a synch wave numbered s' that announces that the time is T (line 10), and a Timeout condition generated by the timers TJ or TP (lines 21, 23). The reception of a \new" message results in an attempt to generate a `winning' wave with a new sequence number s'=s+1 and local time L=C.read. The Boolean tests executed by a processor when a message (s',t) belonging to such a wave is received ensure that either the processor forwards the wave (s',t) to all its neighbors (if T L holds, see lines 14,18), or that it will attempt itself to initiate a wave (if T<L is true, see line 16). In this way, a \new" message issued by a non-isolated processor causes the new sequence number s' to diuse to all correct processors within real time ndel. The Timeout condition can become true either at the end of the join phase (line 21) or if a period of more than PER time units (as measured on the local clock) has elapsed since the last join or periodic synchronization without receiving any \new"or signicant \(s',t)" message (line 23). In joined state, this event triggers the generation of a synch wave with a new sequence number s'=s+1 and local logical time L=C.read. If the new wave is winning (i.e. in all the processors reached by it the condition T L is true), by the end of its propagation every two correct processor will have their clocks within ndel(1 + ) time units. In general we will show that, in spite of concurrent synch waves none of which diuses throughout the network, the clocks of correct processors will be synchronized to within 3ndel(1 + ) time units. Then they will drift apart for at most (1 + )P ER real time units before they are resynchronized by the protocol. 7 Algorithm Analysis To prove the correctness of our protocol we use a technique similar to that of [DHSS]. Since the objective of our synchronization protocol is to provide tolerance of only omission and performance failures, our simpler protocol provides stronger properties than those of [DHSS] (in the sense that the accuracy of the logical clocks is not 11

12 worse - as is the case in [DHSS] - than that of a correct hardware clock, and that the maximum adjustment by which a logical clock can be set forward is a constant smaller then that of [DHSS]). Theorem 1. The algorithm achieves linear envelope synchronization among all joined clocks. We say that an execution of a communication protocol is a diusion for proposition X, if, when a processor rst knows (because of some change of state such as the receipt of a message or a new reading of a clock) the information contained in X, it forwards that information to its neighbors. It is easy to see that the maximum real time required for information to diuse throughout a maximal connected component of a network is bounded above by the product of the maximum time required for message transmission and processing between neighbors and the diameter of the component. Our upper bound on this real time is ndel. Let M(t) be the maximum time on any clock C p at real time t with p joined and correct. The theorem will be proved using the following lemmas. Lemma 1. There is a constant ndel such that if t is the rst real time any processor has s=i, then some processor r initiates a diusion by executing send-all(i,t) at t (where T is C r (t)), and all correct processors have s i by real time t+ndel. Proof: The lemma follows from the observation that our protocol diuses the information that s i. 2 Lemma 2. There is a constant tdel such that if any processor executes send-all(i,t) at real time t, then all correct processors have s i and C T by real time t+tdel. Proof: If each execution of send-all(i,t) constituted a diusion of the proposition s i and C T, then we could set tdel=ndel and be done. Unfortunately, if processor p already has s i when it receives (i,t), and if C p T at the time of receipt, then 12

13 p ignores and does not forward the information. Thus, while the information s i diuses in time ndel, the information C T may require a longer time to reach all processors. However, a correct processor p fails to forward the message (i,t) only if it has already sent or forwarded a message (i,t') with T'<T. We now prove (by induction on d) that if correct processor p executes send-all(i,t) at real time t, and if correct processor q is separated from p by a chain of correct processors and links with no more than d links, then q has C T by real time t + d(ldel+ndel) where 1?d abbreviates The case d = 1 is trivial because q receives and processes the message from p by real time t + ldel. Now assume that we have proved the result for correct r at distance d from p and consider a neighbor q of r at distance d + 1 from p. If r sends message (i; T ) to q then we are done, so assume that r sent a message (i; T 0 ) to q and does not adjust its clock again until after it has C = T. It will be helpful to distinguish the following real times: e 1 is the time at which r sends (i; T 0 ). e 2 is the time at which q processes (i; T 0 ) from r. e 3 is the time at which p sends (i; T ). e 4 is the time at which r has C T. e 5 is the time at which q has C T. By induction hypothesis we have u = e 4? e 3 d (ldel + ndel): 1? d By Lemma 1, v = e 3? e 1 ndel: The duration e 5? e 2 represents the same clock time T? T 0 on q that the duration e 4? e 1 represents on r, so one can be no larger than (1 + ) 2 = 1 + times the other: w = e 5? e2 (1 + )(u + v): Also, x = e 5? e 1 w + ldel: 13

14 Thus y = e 5? e 3 = x? v (1 + )u + ldel + v: Straightforward algebraic substitution gives y d + 1 1? d (ldel + ndel) d + 1 (ldel + ndel): 1? (d + 1) This completes the inductive proof. By assumption A2 we can then take tdel = 2ndel. 2 Lemma 3. There is a constant and a sequence of real times ft i g with 0 < t i +1=? t i < (1 + )P ER, such that if [t i ; t i + ] is a subinterval of the time interval in which processors p and q are both joined and correct, then C q (t i + ) C p (t i ). Proof: Let = tdel + ndel. Let t i be the rst real time at which some correct joined processor sets s=i. At t i some processor initiates a diusion propagating the information s i&l T, where (i,t) is the contents of the initial message. Within ndel this information will have reached every other correct joined processor including p. Let t be the real time at which p rst sets s i. Then t i t t i +ndel. Also, C p (t) C p (t i ). If T C p (t) then C q (t i +ndel) T C p (t i ). If T < C p (t) then C q (t+tdel) C p (t i ). In either case C q (t i + ndel + tdel) C p (t i ). 2 Lemma 4. If [u,v] is a subinterval of the interval in which processor p is joined and correct, then (1 + )?1 (v? u) C p (v)? C p (u) (1 + )(v? u) + M(u)? C p (u). Proof: Since clocks are never set back, we need only show that C p (v) (1 + )(v? u) + M(u). Because we are only considering omission failures, and because a message from a processor that has not violated the corresponding relationship cannot cause the recipient to violate it, the relationship C p (v) (1 + )(v? u) + M(u) holds for each correct processor p. 2 Proof of Theorem: 14

15 Assume that processor p is joined and correct during the interval [t i,t i+1 + ]. C p (t i + ) M(t i ). M(t i + ) M(t i ) + (1 + ) Thus M(t i + )? C p (t i + ) (1 + ). Consider t in [t i + ; t i+1 + ]. M(t)? C p (t) (1 + ) + (1 + )(t? (t i + ))? (1 + )?1 (t? (t i + )). Let DMAX = (1 + ) + (2 + )P ER. Then M(t)? C p (t) DMAX. Thus (1 + )?1 (v? u) C p (v)? C p (u) (1 + )(v? u) + DMAX, when [u,v] is a subinterval of the interval in which p is joined and correct. Moreover, the maximum dierence between the readings of correct joined clocks is bounded by DMAX. 2 l 8 Conclusion This paper has presented a new simple solution to the problem of synchronizing the clocks of a distributed system in the presence of likely failures such as omission and performance failures, and in the presence of processor joins. Because of the simpler failure model considered, the protocol presented is considerably simpler then those presented in [LM,DHSS,LL], especially in the handling of processor joins. The engineering approach adopted is similar in spirit to the one adopted in [KO], where protocol simplicity is achieved by limiting the total number of failures that can be tolerated during a synchronization. Synchronized clocks are useful for a number of reasons. They can be used to totally order the events of a distributed system [L] (e.g. merging of the data base logs generated on distinct computers into a common unique log), and they can be used to measure the time that elapses between events that occur on dierent processors (e.g. do performance evaluation of distributed systems). Another important application is in ensuring consistency among the knowledge states of the computers of a distributed system. An earlier paper [CASD] has presented protocols for ensuring the consistency of replicated data that depend on synchronized clocks. 15

16 References [C] F. Cristian: Probabilistic Clock Synchronization, Distributed Computing, Vol. 3, pp , [Cr] F. Cristian: Understanding Fault-tolerant Distributed Systems, Communications of the ACM, Vol. 34, No. 2, Feb 1991 and erratum in CACM Vol. 34, No. 4, April [CS] F. Cristian and F. Shmuck: Continuous Clock Amortization Need not Aect the Precision of a Clock Synchronization Algorithm, IBM Research Report RJ7290, January [CASD] F. Cristian, H. Aghili, R. Strong D. Dolev: Fault-Tolerant Atomic Broadcast Protocols. Proc. 15th Int. Conf. on Fault-Tolerant Computing, Ann Arbor, Michigan, June [DHSS] D. Dolev, J. Halpern, B. Simons, R. Strong: Fault-Tolerant Clock Synchronization, IBM Research Report RJ4094, [IBM370] IBM System/370: Principles of Operation, GA , [KO] H. Kopetz, W. Ochsenreiter: Internal Clock Synchronization with a VLSI Synchronization Unit, TR 1985/7, Technical Univ. Vienna, [L] L. Lamport: Time, Clocks, and the Ordering of Events in a Distributed Systems, Comm. of the ACM, Vol. 21, No. 7, July 1978, pp [LM] L. Lamport, M. Melliar-Smith: Synchronizing Clocks in the presence of failures, Journal of the Association of Computing Machinery, Vol. 32, No. 1, January 1985, pp [LL] J. Lundelius, N. Lynch: A new failure-tolerant algorithm for clock synchronization, Proc. of the 3rd ACM Symposium on Principles of Distributed Computing, [MIL] MIL Handbook 217D, Notice 1, 13 June 1983, pp

Degradable Agreement in the Presence of. Byzantine Faults. Nitin H. Vaidya. Technical Report #

Degradable Agreement in the Presence of Byzantine Faults Nitin H. Vaidya Technical Report # 92-020 Abstract Consider a system consisting of a sender that wants to send a value to certain receivers. Byzantine