I reduce synchronization costs in parallel computations by

Size: px
Start display at page:

Download "I reduce synchronization costs in parallel computations by"

Transcription

1 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL Isotach Networks Paul F. Reynolds, Jr., Member, /E Computer Society Craig Williams, and Raymond R. Wagner, Jr. Abstract-We introduce a class of networks called isotach networks designed to reduce the cost of synchronization in parallel computations. Isotach networks maintain an invariant that allows each process to control the logical times at which its messages are received and consequently executed. This control allows processes to pipeline operations without sacrificing sequential consistency and to send isochrons, groups of operations that appear to be received and executed as an indivisible unit. Isochrons allow processes to execute atomic actions without locks. Other uses of isotach networks include ensuring causal message delivery and consistency among replicated data. Isotach networks are characterized by this invariant, not by their topology. They can be implemented in a wide variety of configurations, including NUMA (nonuniform memory access) multiprocessors. Empirical and analytic studies of isotach synchronization techniques show that they outperform conventional techniques, in some cases by an order of magnitude or more. Results presented here assume fault-free systems; we are exploring extension to selected failure models. Index Terms-Logical time, interprocess coordination, concurrency control, isochronicity, atomicity, sequential consistency, interconnection networks, multiprocessor systems. 1 INTRODUCTION sotack networks are a new class of networks designed to I reduce synchronization costs in parallel computations by providing stronger guarantees about the order in which messages are delivered. In existing networks, a process can neither predict nor control the order in which its messages are delivered relative to concurrently issued messages. Few existing networks offer ordering guarantees any stronger than FIFO delivery order among messages with the same source and destination. As a result, a process must use delays and higher-level synchronization mechanisms such as locks to ensure that execution is consistent with the program s constraints. Isotach networks offer a guarantee called the isotach invariant that allows processes to predict and control the logical times at which their messages are received. This invariant serves as the basis for some efficient new synchronization techniques. The invariant is expressed in logical time because a physical time guarantee would be prohibitively expensive, whereas the logical time guarantee can be enforced cheaply, using purely local knowledge. Isotach networks are characterized by this invariant, not by any particular topology or programming model. They can be implemented in a wide variety of configurations, including NUMA (nonuniform memory access) multiprocessors. Our performance studies show that isotach synchronization techniques are more efficient than conventional techniques. As contention for shared data increases, conventional techniques cease to perform acceptably, but isotach techniques P.F. Reynolds, ]r. and C. Williams are with the Departrnent of Computer Science, Thornton Hall, University of Virginia, Charlottesville, VA (reynolds, craigwl@virginia.edu. - R.R. Wngner, 1.. is with The Vermovzt Softzuam Cowpnny. Inc., Boldruin House, PO Box 731, Wells River, VT Roy. Wasncu@VTSoftwaue.co,n. Marruscript received Feb. 13,1995. For. iriformatiori on obtaining reprints of this article, please serid e-rrrnil to: triiirspds~corirputer.ors, and reference IEEECS Log Nirrnber D continue to perform well. Our focus in this paper is on shared memory model (SMM) computations on tightly-coupled multiprocessors. There are two reasons for this focus: We assume a fault-free system. This assumption is standard in discussions of synchronization techniques for SMM computations on tightly coupled systems but not in distributed systems, where dealing with faulty communications is a major concern. Unlike many other synchronization techniques, isotach techniques are suitable for the fine-grained synchronization required for SMM computations. Isotach techniques are also applicable to message based model computations, but their use in distributed computing requires finding a way to relax the assumption of a fault-free system. We address this issue in the discussion of ongoing work in Section 5. The principal contributions of this paper are in 1) defining isotach logical time, the logical time system that incorporates the isotach invariant; 2) giving an efficient, distributed algorithm for implementing isotach logical time; and 3) showing how to use isotach logical time to enforce two basic synchronization properties, sequential consistency and isochronicity. An execution is sequentidy consistent if operations specified by the program to be executed in a specific order appear to be executed in the order specified [26] and is isochronous if operations specified by the program to be executed at the same time appear to be executed at the same time. As explained in Section 3, isochronicity and atomicity are similar, though not identical, concepts. Enforcing sequential consistency and isochronicity is trivial, given an assumption of fault-freedom, on systems in which processes communicate over a single shared broad /97/$ IEEE

2 338 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 cast medium such as a shared bus. However such systems do not scale. Point-to-point networks scale, but enforcing sequential consistency and isochronicity is hard on pointto-point networks, even with the assumption of faultfreedom, and existing algorithms are expensive. The isotach algorithms for enforcing isochronicity and sequential consistency are unique in that they are both scalable-they have low message complexity and no intrinsic bottleneck such as a shared bus-and light-weight-they are efficient and require only a single communication round. Section 2 of this paper defines isotach logical time and shows how it can be implemented. Section 3 describes isotach synchronization techniques. Section 4 reports on our performance studies and section 5 summarizes the paper and describes ongoing work. 2 ISOTACH LOGICAL TIME A logical time system is a set of constraints on the way in which events of interest are ordered, i.e., assigned logical times. Isotach logical time is an extension of the logical time system defined by Lamport in his classic paper on ordering events in distributed systems We begin by describing Lamport s system. We then define isotach logical time and give an algorithm for implementing isotach logical time using network switches and interfaces. In Lamport s system, the events of interest are the sending and receiving of messages. Times assigned to these events are required to be consistent with the happened before relation, a relation over send and receive events that captures the notion of potential causality: Event n happened before event 6, denoted a + b, if 1) a and b occur at the same process and a occurs before h; 2) a is the event of sending message m and b is the event of receiving the same message m; or 3) there exists some event c such that a -+ c and c + 6. In Lamport s system, a + b * t(a) < t(6). (Refer to Fig. 1. for notation relating to logical time.) Lamport gives a simple distributed algorithm for assigning times consistent with this constraint. Each process has its own logical clock, a variable that records the time assigned to the last local event. When it sends a message, a process increments its clock and timestamps the message with the new time. When it receives a message, a process sets its clock to one more than the maximum of the current time and the timestamp of the incoming message. Logical time systems have been proposed that extend Lamport s system by adding the constraint t(a) < t(b) a + b (thus t(n) < f(b) if and only if a -+ b). For a recent survey, see [37]. ISIS [7] is a messaging layer that incorporates this extension to Lamport s logical time system. Isotach logical time also extends Lamport s time system, but in a different way. In isotach logical time, the additional constraint is that the times assigned be consistent with an invariant relating the logical distance a message travels to the logical time it takes to reach its destination. 2.1 Defining Isotach Logical Time In an isotach logical time system, times are lexicographically ordered n-tuples of nonnegative integers. The first and For any event e t(e) For any message m t,(m) t,(m) d(m) pid(m) logical time assigned to e logical time assigned to the event of sending ni logical time assigned to the event of receiving in logical distance traveled by m unique identifier of the process that issued m issue rank of m, rank (m) = i if m is the ith message issued by pid(m) rank(m) + Lamport s happened-before relation Fig. 1. Logical time notation. most significant component is the pulse. The number and interpretation of the remaining components can vary. In this paper, logical times are three-tuples of the form (pulse, pid(m), runk(m)). Isotach logical time extends Lamport s logical time by requiring that the times assigned be consistent with both the -+ relation and the isotach invariant: Each message m is received exactly d(m) pulses after it is sent. 1 Thus t,(m) = (i, j, k) implies t,(m) = (i + d(m), j, k). In this paper, we use the routing distance, i.e., the number of switches through which rn is routed, as the measure of logical distance, but other measures can be used. Isotach logical time is so named because all messages travel at the same velocity in logical time-one unit of logical distance per pulse of logical time. As we will show in Section 3, the isotach invariant provides a powerful mechanism for coordinating access to shared data. Given the isotach invariant, a process that knows d(m) for a message m that it sends can determine t,(m) through its choice of t,(m). Thus processes can use the isotach invariant to control the logical times at which their messages are received. 2.2 Implementing Isotach Logical Time An isotach network is a network that implements isotach logical time, i.e., that delivers messages in an order that allows send and receive events to be assigned logical times consistent with the isotach invariant and the + relation. Several networks have been proposed that can, in retrospect, be classified as isotach networks. The alphasynchronizer network proposed by Awerbuch to execute SIMD graph algorithms on asynchronous networks (31, and a network proposed to support barrier synchronization [51, [17] can be viewed as isotach networks that maintain a logical time system in which each logical time consists of the pulse component only. A network proposed by Ranade [351, 1361 as the basis for an efficient concurrent-read, concurrent-write (CRCW) PRAM emulation can be viewed as realizing the more general and powerful n-tuple isotach logical time system. We give an algorithm, the isonet algorithm, for implementing isotach logical time on networks of 1. For any message 111 for which d(m) = 0, e.g., a cache hit or a message between collocated processes, the isotach invariant requires t,(nt) = t,(m). For this reason, we relax Lamport s requirement that a -+ b t(n) < t(b). In isotach logical time, n + b 3 t(n) s I@).

3 REYNOLDS ET AL.: ISOTACH NETWORKS 339 arbitrary topology. The algorithm localizes the work of implementing isotach logical time in the network interfaces and switches. It is simple but impractical. Later in this section, we show how to make it practical Model We consider a network with arbitrary topology consisting of a set of interconnected nodes in which each node is either a switch or a host. Adjacent nodes communicate over FIFO links. Each host has a network interface called a switch interface unit (SIU) that connects it to a single switch. A host communicates with its SIU over a FIFO link. All components and interconnecting links are failure-free. To postpone considering the issue of communication deadlock, we initially assume each switch has infinite buffers. A message is issued when the process passes it to the SIU, sent when the SIU passes it to the network, and received when the SIU accepts the message from the network. We assume that the destination SIU delivers messages to the destination process in the order in which it received them lsonet Algorithm Each SIU and switch maintains a logical clock, a counter initialized to zero, that gives the pulse component of the local logical time. The clocks are kept loosely synchronized by the exchange of tokens. A token is a signal sent to mark the end of one pulse of logical time and the beginning of the next. Initially each switch sends a token wave, i.e., it sends a token on each output, including the outputs to adjacent SIUs, if any. Thereafter, each switch sends token wave i + 1 upon receiving the ith token on each input, including the inputs from adjaceht SIUs. Each time a switch sends a token wave, it increments its clock. When an SIU receives a token, the SIU increments its clock and returns the token to the switch. A message is sent (received/routed) in pulse i if it is sent (received/routed) while the local clock is i. An SIU may send no messages or several messages in a pulse. The SIU tags each message m it sends with the lexicographically ordered pair (pld(m), runk(m)). The messages an SIU sends in a pulse must be sent in pid-rank order. Thus if m and m are issued by the same process and runk(m) < rank( ), the SIU must send m before m if it sends m and m in the same pulse. However the rule does not prohibit the SIU from sending m in an earlier pulse than m. During each pulse, an SIU continuously compares the tag of the next message to be sent with the tag of the next message to be received and handles (sends or receives) the message with the lowest tag. If its network input is empty, the SIU waits for it to be nonempty. The switches in an isotach network route messages in pid-rank order in each pulse. The switch behaves the same as a switch in a conventional network except that it always selects for routing the message with the lowest tag among the messages at the head of its inputs. As in the case of an SIU, a switch waits if one of its inputs is empty. Pseudocode for the isonet algorithm is given in Fig. 2. To show that the isonet algorithm implements isotach time, we 1) give a rule for assigning logical times to send and receive events, and At each switch clock = 0; repeat forever send a token on each output; increment clock; route all messages up to next token on each input in pid-rank order consume a token from each input; At each SIU clock = 0; repeat forever accept a token from adjacent switch increment clock; return the token to the switch; send and receive messages in pid-rank order; Fig. 2. Pseudocode for isonet algorithm. 2) prove that this assignment is consistent with the isotach invariant and the + relation. Let sjulse(m) denote the pulse in which m is sent and r_pulse(m) the pulse in which m is received. Time assignment rule: For each message m, t,(m) = (s_pulse(m), pid(m), runk(m)) and t,(m) = (r_pulse(m), pid(m), runk(m)). THEOREM 1. The isonet algorithm maintains isotach logical time. PROOF. Since a message routed by a switch in pulse i is routed at the next switch in pulse i + 1, for any message m, r_pulse(m) - s_pulse(m) = d(m). Since t,(m) and t,(m) differ only in the pulse component, the isotach invariant holds. Showing that the logical time assignment is consistent with the -+ relation requires showing 1) for each message m, t,(m) 2 ts(m); and 2) for each SIU, the times assigned to local events are nondecreasing, i.e., local logical time doesn t move backwards. The first part follows directly from the isotach invariant. Since for any message m, d(m) 2 0, t,(m) 2 t,(m) by the isotach invariant. Since the messages each SIU sends in each pulse are sent in pid-rank order, and since each switch merges the streams of messages arriving on its inputs so that it maintains pid-rank order on each output, a simple inductive proof over the number of switches through which a message travels shows that messages received at an SIU within each pulse are received in pid-rank order. Since an SIU interleaves sends and receives so as to handle the all messages in each pulse in pid-rank order, the times assigned to all events at each SIU are nondecreasing Discussion An isotach network need not maintain logical clocks at the SIUs and switches or assign logical times to send and receive events. It need o,nly ensure that messages are received in an order consistent with the isotach invariant and the -+ relation. The logical clocks are auxiliary variables that make the algorithm easy to understand but are not essential to

4 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 the implementation. Similarly, the rank component of the pid-rank tags can be eliminated or reduced to a few bits. The rank component is used in assigning logical times to send and receive events, but this role is auxiliary to the algorithm. The rank component is essential to the isonet algorithm only to ensure that messages sent by the same source to the same destination in the same pulse are received in the order in which they were issued and sent. If the network topology ensures FIFO delivery order among messages with the same source and destination, the rank component is unnecessary. If it does not, the size of the rank component can be reduced by limiting the number of messages a process can send to a single destination in a single pulse, or it can be eliminated by requiring that multiple messages sent to the same destination in the same pulse be sent as a single larger message. The bandwidth required for the pid component of tags is not attributable to maintaining isotach time because messages are typically required to carry the identity of the issuing process for reasons unrelated to maintaining isotach time. Token traffic is the only other source of bandwidth overhead attributable to isotach time. Tokens can be piggybacked on messages at a cost in bandwidth of one bit per message Latency and Throughput The principal cost of the isonet algorithm is the additional waiting implied by the requirement that messages be routed in pid-rank order. One way to reduce this cost is to use ghosts. When a switch sends an operation on one output, it sends a ghost [36] or null message [lo] with the same pid-rank tag on each of its other output(s). A switch receiving a ghost knows all further operations it receives on the same input will have a larger tag than the ghost and this knowledge may allow it to route the operation on its other inputk). Ghosts improve isotach network performance by allowing information about logical time to spread more quickly and are needed in some networks to avoid deadlock. Since a ghost can always be overwritten by a newer message, ghosts take up only unused bandwidth and buffers. Another way to reduce waiting in the switches is to add internal buffers. In a switch with input queueing only, such as the switch shown in Fig. 3a, a blocked message at the head of an input queue blocks subsequent messages arriving on the same input, a phenomenon known as head of line (HOL) blocking. Buffers placed internally, as shown in Fig. 3b, reduce HOL blocking [211, [241. Our simulation study of isotach networks shows that internal buffers are of significantly more benefit in isotach networks than in conventional networks. One reason for this difference is that a conventional switch of either design can route messages onto both outputs in a single switch cycle, but an isotach switch of the simpler design can route at most one operation per cycle since we allow the router to make only one pid-rank tag comparison per cycle. An isotach switch with internal buffers can route messages onto both outputs since it has two merge units operating in parallel. A more fundamental reason for the difference in benefit is that the switch with internal buffers offers increased throughput at the cost of increased delay. Isotach synchronization techniques, because they are more concurrent, e.g., they allow pipelining, allow processes to take better advantage of increases in potential throughput. Splitter Splitter Fig. 3. Switch designs: (a) simple switch, (b) switch with internal buffers Deadlock The requirement that switches route messages in pid-rank order makes deadlock freedom a harder problem in isotach networks than in conventional networks. The isonet algorithm uses infinite buffers at the switches to prevent deadlock. One way to bound the number of buffers is to limit the number of messages sent per SIU per pulse. Although such bounds may be useful for other reasons, the number of buffers required to ensure deadlock freedom using this approach is very large. A better approach is to find routing schemes that ensure deadlock freedom. We have proven deadlock free routing schemes for equidistant networks using one buffer per switch per physical channel and for arbitrary topologies using O(d) buffers and virtual channels per switch per physical channel, where d is the network diameter [46]. The isonet algorithm, as modified, is scalable and efficient: It is completely distributed; the bandwidth attributable to maintaining isotach logical time is one bit per message; and the additional work required at each SIU and switch is one tag comparison per message. The main cost of the algorithm is the increased probability that a message will wait within the network. The cost of maintaining isotach logical time relative to the benefits in reduced synchronization overhead is the subject of Section 4. 3 USING ISOTACH LOGICAL TIME We show how isotach logical time provides the basis for enforcing isochronicity and sequential consistency. Our focus is on SMM computations. Each message is an operation or a response to an operation and each operation is a read or write access to a variable. Each host is a processing element (PE) or memory module (MM). The destination of each operation is an MM. Since synchronization is not an issue in relation to private variables, we use the term vari able to refer to a shared variable. 3.1 lsochronicity and Sequential Consistency An isochron is a group of operations that can be issued as '- batch and that is to be executed indivisibly, i.e., withou' apparent interleaving with other operations. An executioi is isochronous if every isochron appears to be executed indi

5 REYNOLDS ET AL.: ISOTACH NETWORKS 34 I visibly. Isochrons and isochronicity are closely related to atomic actions and atomicity. An atomic action is a group of operations that is to be executed indivisibly [28], An execution is atomic if the operations in each atomic action appear to be executed indivisibly. Isochrons and atomic actions differ in two ways: An atomic action, in many contexts, is a unit of recovery from hardware failure. Where a fault-free system is assumed or fault-freedom is not a major concern, atomicity is synonymous with indivisible execution [27], As stated previously, we assume a fault-free system. An atomic action can have internal dependences preventing the operations in the atomic action from being issued as a batch. An atomic action with such dependences, e.g., A = B, is a Structured atomic action. An atomic action with no such dependences is a flat atomic action. Since we assume a fault-free system, flat atomic actions and isochrons are the same. Isochrons are also similar to totally ordered multicasts (TOMS). The main distinctions are 1) TOM algorithms are usually fault-tolerant; and 2) the messages making up a TOM are typically identical and in some implementations constrained to be identical, whereas the operations making up an isochron are typically different. (Of course, subcomponents of identical messages can be addressed to different recipients, at the cost of increased message size.) We compare implementations at the end of this section. Isochrons are directly useful in obtaining consistent views, making consistent updates, and maintaining consistency among replicated copies. They are also useful as a building block for implementing structured atomic actions. We describe isotach techniques for executing structured atomic actions later in this section. SMM computations on tightly-coupled multiprocesses execute atomic actions using some form of locking. Drawbacks of locking include overhead for lock maintenance in space, time, and communication bandwidth, unnecessarily restricted access to variables, and the care that must be taken when using locks to avoid deadlock and livelock. In an isotach system, a process can execute flat atomic actions without synchronizing with other processes and can execute structured atomic actions without acquiring locks or otherwise obtaining exclusive access rights to the variables accessed. The second synchronization property we consider is sequential consistency. An execution is sequentially consistent if operations appear to be executed in an order consistent with the order specified by each individual process s sequential program [26]. This property is so basic it is easily taken for granted, but it is expensive to enforce in non-busbased systems because stochastic delays within the network can reorder operations. A violation of sequential consistency is shown in Fig. 4. Each circle in the figure represents an operation on a shared variable, A or B. Solid directed lines connect circles representing operations on the same variable and indicate the order in which the operations are executed at memory. Dashed directed lines connect circles representing operations issued by the same process and indicate the order specified by that process s sequential program. In the execution shown, process P, reads the new value of A (the value written by P2) and subsequently reads the old value of B, in contradiction to Pis program, which specifies that B S value be changed first. This violation could occur even if both processes sent the operations in the correct order, if one operation (in this case the write to B) is delayed in the network. The conventional solution is to prohibit pipelining of operations, i.e., each process delays issuing each operation until it receives an acknowledgement for its previous operation. However, pipelining is an important way to decrease effective memory latency, so this solution is expensive. The high cost of enforcing sequential consistency has led to extensive exploration of weaker memory consistency models, e.g., [161, These weaker models are harder to reason about and still impose significant restrictions on pipelining, but make sense given the cost of maintaining sequential consistency in a conventional system. In an isotach system, processes can pipeline memory operations without violating sequential consistency. Pl :: P2:: write(b); write(a); a Program order at process - t Execution order at MM 3.2 Enforcing lsochronicity and Sequential Consistency We consider an SMM computation on an isotach system. Each process s program consists of a sequence of isochrons interspersed with local computation. Further assumptions: 1) Each process issues operations by sending them through a reliable FIFO channel to the SIU for the PE on which it is executing; 2) Each process marks the last operation in each isochron; 3) Each MM executes operations in the order in which they are received; and 4) Each SIU knows the logical distance to each MM with which it communicates. Although some applications (e.g., cache coherence) require that isotach logical time be maintained for both operations and responses to operations, the techniques described here require that logical times be assigned only to the send and receive events for operations. Since each MM need only execute operations in the order in which they are received and need not ensure that its responses are received at any given logical time, an MM requires only a conven- tional network interface. Thus we use SIU in this section to mean the SIU for a PE. The key to enforcing isochronicity and sequential consistency in isotach systems is the isotach invariant. Given the isotach invariant and knowledge of the logical distance for each operation it sends, an SIU can control the logical

6 342 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 times at which the operations it sends are received by controlling the logical times at which they are sent. In particular, the SIU can ensure that its operations appear to be received at the same time (isochronicity) or in some specified sequence (sequential consistency) by sending operations in accordance with the send order rules: Isochronicity. Send operations from the same isochron so they are received in the same pulse. Sequential Consistency. Send each operation so it is received in a pulse no earlier than that of the operation issued before it. Processes need not be aware of how their ordering constraints are enforced. These rules are implemented entirely by the SIU. In a system with an equidistant network, i.e., a network in which every message travels the same logical distance, the rules call for each SIU to send operations in the order in which they were issued and to send all operations from the same isochron in the same pulse. To avoid delaying logical time in the event of a context switch or process failure, the SIU should not send any operation from an isochron until the last operation in the isochron has been issued. In systems with nonequidistant topologies, the rules may require that an SIU send operations in an order that differs from the order in which they were issued. EXAMPLE. Assume process P, is required to read and P, to write variables A and B and that each process's accesses must be executed indivisibly. In a conventional system, P, and P, must obtain locks, either on individual variables or on the section of code through which the variables are accessed. In an isotach system, P, and P, can execute their accesses without locks. Consider an isotach system with the nonequidistant topology shown in Fig. 5a. Each circle in Fig. 5a represents a switch and each rectangle an MM or PE. Fig. 5b shows one of many possible correct executions. In accordance with the first send order rule, the SIU for each process sends its operations so that they both arrive in the same pulse, e.g., the SIU for PI sends the operation on A one pulse after the operation on B since the routing distance from PI to A is one less than to B. If, as in the execution shown, all four operations happen to be received in the same pulse, operations on each variable will be received and executed in order by pid. As a result, each isochron will appear to be executed without interleaving with other operations. The correctness proof for the send order rules is similar to serializability proofs of database schedulers, see, e.g., [33]. We show that for any execution on an isotach system that conforms to the send order rules, there is an equivalent serial execution that is isochronous and sequentially consistent. To establish that two executions of the same program are equivalent, it is sufficient to show that each variable is accessed by the same operations and that operations on the same variable are executed in the same order in both executions. THEOREM 2. Any isotach system execution E that satisfies the send order rules is isochronous and sequentially consistent. PROOF. Since no two operations issued by the same process have the same issue rank, operations in E are totally ordered by their logical receive times. Let E, be the serial execution in which operations are executed in order of their logical receive times in E. We show 1) E, and E are equivalent; 2) E, is isochronous; and 3) E, is sequentially consistent. It follows that E is isochronous and sequentially consistent. Consider any two operations op, and op, on the same variable V. Without loss of generality, assume op, is received in E before op,. Since logical receive times are unique and consistent with the + relation, t,(op,) < tr(op,). Thus op, is executed before op, in E,. Since operations on V are executed in E in the order in which they are received, op, is also executed before op, in E. Since, for any isochron A, all operations in A are received in the same pulse (first send order rule), issued by the same process, and have consecutive issue ranks, the logical receive time of any operation not in A is outside the interval of logical timc delimited by the receive events for operations in A Thus the operations in A are executed in E, indivisibly, without interleaving with other operations For any pair of operations op, and op, issued by tht same process where rank(op,) < rank(opj, op, and op are either received in the same pulse or op, is received in an earlier pulse than op, (second senc order rule). In either case, op, is executed before op in E,. C EXAMPLE. Consider again the execution shown in Fig. 5. Thc space-time diagram in Fig. 6a shows one of the manj possible ways in which this execution can occur in.. PE operation 1, d tr 1 j reada i (i+l,l,x) i 2 j (i+3,1,x).. i reads i (i,l,x+l) i 3 i (i+3,1,x+1)...: j... &. 2 i write A i (62,~). 3 f. (i+3,2,y) write B (i+1,2,y+l) i 2 (i+3,2,y+l) (a) Fig. 5. Executing isochrons: (a) network topology, (b) operation send and receive times.

7 REYNOLDS ET AL.: ISOTACH NETWORKS 343 physical time. Fig. 6b shows the serial execution in which operations are executed in order by their logical receive times. Solid lines connecting circles indicate the associated operations are from the same isochron. Since each variable is accessed by the same operations in the same order in both executions, the executions are equivalent. In discussing a similar pair of diagrams, Lamport provides an alternative way to view the equivalence among the executions in the figure [251: Without introducing the concept of time into the system (which would require introducing physical clocks), there is no way to decide which of these pictures is the better representation. Intuitively, any vertical line in the diagram can be stretched or compressed in any way that preserves the relative order of the points on the line and the result will be equivalent to the original execution. Without memory-mapped peripherals or other special instrumentation, the executions are indistinguishable. Since the second of the equivalent executions in Fig. 6 is isochronous, so is the first, even though it interleaves execution of isochrons in physical time. A B A B (a) (b) Fig. 6. Space-time diagrams of equivalent executions. i Accessby Isochrons and Totally Ordered Multicasts Algorithms for totally ordered multicasts (TOMS) ensure total ordering without the benefit of the assumption of fault-freedom. TOMs have been used to maintain consistency among replicated copies of objects in distributed systems, e.g., [4], and could theoretically be used to execute flat atomic actions in SMM computations on multiprocessors, and as a basis for techniques to execute structured atomic actions, but we are not aware of any proposals to use TOMs in this way. Since locks are the standard technique for executing atomic actions we use them as the basis for comparison in the performance studies described in the next section. Even if we leave out the fault-tolerant aspects, concentrating only on those aspects of TOM implementations designed to ensure that multicasts are received in a consistent order, we find existing TOM implementations are not scalable or are unsuitable for SMM computations for other reasons. Several messaging layers, e.g., 161, [71, [34], use a logical time system in obtaining an ordering property imong multicasts called causal ordering, but the logical time systems these messaging layers use does not offer an efficient scalable way to obtain total ordering. Existing TOM algorithms introduce some element to obtain the total ordering that makes them unsuitable for use in SMM computations. Most use a serialization point, either in the network itself or in a communication pattern superimposed on the computation, e.g., [ll, ill], [201, to order multicasts. These algorithms scale poorly. Multicasts for hierarchically structured networks or for applications with a hierarchical communication pattern scale more successfully, but still have a potential bottleneck at the root. The few TOM algorithms that do not use a centralized serialization point [61, [13], [14], [18] are based on worst case assumptions or require multiple message rounds. Many TOM implementations, e.g., [7], [15], [34], are group-based. They send each multicast to the full membership of a process group and use the fact that every process sees every message sent to its group in obtaining the total ordering. Group-based protocols have been shown to work well in distributed message based model computations, but the process group concept does not translate well to the SMM. In contrast to existing TOM algorithms, the implementation of isochrons is scalable and light-weight. The underlying implementation of isotach logical time is distributed and the total ordering among isochrons is achieved without introducing a centralized serialization point. Comparing message complexity is complicated by the question of what cost to assess for token traffic in the isotach network and the fact that multiple processes can each send multiple isochrons in the same pulse. Since a token is an empty control signal sent between neighboring switches that can be piggybacked on other messages if there is any real traffic, we believe a cost of 0 for tokens is reasonable. Assuming a cost of 0 for tokens, an isochron with k destinations requires one round of k messages, the minimal cost on a point-to-point network for sending k unique messages to k destinations Other Applications Other applications of isotach logical time include the following: Atomicity. Given our assumption of a fault-free system, isotach techniques enforce atomicity for both flat and structured atomic actions. Structured atomic actions cannot be executed using isochrons consisting of ordinary reads and writes because internal dependences prevent issuing all of the operations in a batch, but they can bc executed using isochrons consisting of split operations [48]. A split operation, in brief, is half of a read or write. Split operations are used in executing reads and writes in two steps, where the first step schedules the access and the second transfers a value. Dividing accesses into two steps allows a process that has incomplete knowledge about an access due to an unsatisfied dependence to reserve the context for the access even though the value to be transferred has yet to be determined. A process executes a structured atomic action by first issuing an isochron that contains split operations scheduling all of the accesses required for the atomic action. The process executes the assignment steps for these accesses as it determines the values to be assigned. Execution is atomic because the initial isochron reserves a consis-

8 344 IlEEE TRANSACTION: j ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 tent time slice across the histories of the accessed variables. This technique works when the access set, the set of variables accessed by the atomic action, can be determined prior to execution of the,atomic action. We have proposed variations on the technique for atomic actions with data dependent access sets [48]. Isotach techniques for executing structured atomic actions are similar to techniques used in multiversion concurrency control I381 except that isochrons streamline the& techniques, eliminating roll-back and permitting an efficient in-cache implementation of variable histories. The resulting techniques are sufficiently light-weight in that they can be used in SMM computations. Cache Coherence. Isotach based techniques for enforcing atomicity and sequential consistency extend to systems with caches. The resulting cache coherence protocols [441, [471, 1481 are more concurrent than existing protocols for non-bus-based systems in that they support multiple concurrent reads and writes of the same block, allow processes to pipeline memory accesses without sacrificing sequential consistency, and allow processes to read and write multiple shared cache blocks atomically without obtaining exclusive rights to the accessed blocks. Combining. Combining is a technique for maintaining good performance in the presence of multiple concurrent accesses to the same variable [231. An isotach network need not implement combirdng, but if it does, it can combine operations not (combinable in other networks, resulting in improved concurrency in accessing shared memory [491. Other applications include support for causal message delivery, causal and totally ordered multicasting, migration mechanisms, checkpointing, wait-free communication primitives, and highly concurrent access to linked data structures. Important special applications include parallel and distributed databases and production systems 1411, [ PERFORMANCE We studied the performance of equidistant isotach networks using both simulation [40] and analytic modeling [46]. The following is a synopsis; the interested reader is referred to the noted references for more details. 4.1 Simulation Studies Isotach networks do more work than conventional networks-they route messages and deliver them in an order consistent with the isotach invariant and the + relation. Thus isotach networks can be expected to have less raw power than comparable conventional networks. Raw power, as determined by network throughput and latency, measures the ability of a network to deliver a workrload of generic messages without atomicity or sequencing constraints to their destinations. Our studies were designed to answer the following questions: 1) How do the raw power of an isotach network and a conventional network relate? 2) Under what conditions, if any, do isotach networks make up for an expected loss in raw power through more efficient support of synchronization? We report on the results of three simulation studies: raw power, sequential consistency, and warm-spot traffic mode' with sequential consistency and atomicity. These studie: compared isotach systems to conventional systems tha enforce sequential Consistency by restricting pipelining anc enforce atomicity by locking the variables accessed in eacl atomic action. Improving the performance of locking ii shared memory systems has been the subject of severa studies in recent years (e.g.,[2], [29], [30]). Although we dic not implement caches, and thus could not implemen cache-based locks, we implemented a highly efficient lock ing protocol, described below, that avoids spinning on locks in an alternative way. We simulated two physical networks, N1 and N2, eacl composed of 2 x 2 switches, synchronously clocked, anc arranged in a reverse baseline topology. The messagi transmission protocol was store-and-forward using send acknowledge. N1 and N2 differed only in the design of tht individual switches. N1 used the simple switch shown ii Fig. 3a; N2 used the internal buffer switch shown in Fig. 3b. We simulated isotach and conventional versions of bot1 Nl and N2. Networks I1 and I2 are the isotach version o N1 and N2, respectively. Similarly, C1 and C2 are the conventional versions of N1 and N2. Each isotach networl used the same switch algorithm as the corresponding con ventional network except that the isotach switches selectec operations for routing in pid-rank order and sent token: (piggybacked on operations) and ghosts. Latency results are reported in cycle units, which repre sent physical time. Cycle unit duration depends on low level switch design and technology. We assume switches i; C1 and I1 have the same best case time-the time requirec for an operation to move through a switch assuming nc waiting-of one cycle unit, and switches in C2 and I2 hav best case times of two cycle units. This assumption does nc imply that average switch latency is the same in convei tional and isotach networks of comparable design. Averag switch latency tends to be longer in isotach switches becau: operations must be routed in pid-rank order. Throughoi the studies, throughput is the average number of oper, tions arriving at memory, per MM, per cycle. We analyzed C1, C2,11, and I2 under a variety of synthet. workloads, which ranged from one with no constraints o operation execution order to ones with combinations ( atomicity, sequencing, and data dependence constraints Raw Power Study We measured the raw power of each of the four network: i.e., throughput and delay assuming no atomicity c sequencing constraints. The workload consisted of generi operations-reads and writes were not distinguished. M assumed memory cycle time was the same as switch cyc tim-ach MM could consume one operation per cycl Although the workload had no sequencing constraint there was no way to relax enforcement of sequential COI sistency in I1 and 12. Thus we actually compared isoh networks that enforce sequential consistency with convei tional networks that did not.

9 REYNOLDS ET AL.: ISOTACH NETWORKS 345 Delay was measured from the time an operation was generated until it was delivered to the MM. As expected, C1 and C2 handled higher load with less delay than I1 and 12, respectively. Conventional networks have more raw power because isotach network switches are subject to the additional constraint that they must route operations in isotach time order. Also of significance: C2 and I2 offered higher throughput at the cost of higher latency than C1 and 11. In all of the remaining studies, C1 outperformed C2 at cach data point. C2 performed poorly relative to C1 because its latency was higher than Cl s. The increased potential throughput offered by C2 could not be exploited when locks and restrictions on pipelining are used for synchronization. Henceforth we report results only for Cl,Il, and Sequential Consistency Study We compared networks under a workload requiring sequential consistency but no atomicity. Here and in the warm-spot study, traffic in both the forward (PE -+ MM) and reverse (MM -+ PE) directions was simulated. The reverse networks were conventional networks: I2 used C2, and I1 and C1 used C1. In each cycle, each MM (PE) could receive one operation (response) from the network and issue one response (operation). Fig. 7 shows the high cost of enforcing sequential consistency in a conventional network. In I1 and 12, operations could be pipelined without risk of violating sequential consistency. We assumed no data dependences constrained the issuing of operations. Each PE issued a new operation whenever its output buffer was empty. C1 enforced sequential consistency by limiting each PE to one outstanding operation at a time. Throughput for I1 and 12, shown in Fig. 7, was the same as the raw power studies, since the isotach networks enforced sequential consistency in both cases, but was much lower for C1. I1 and I2 outperformed C1 in terms of throughput, but with longer delay, partly because pipelining caused I1 and I2 to be more heavily loaded. We would expect data dependences among operations issued by the same process to diminish throughput in isotach networks because they prevent the networks from taking full advantage of pipelining. When we set the data dependence distance to one, meaning each process needs the result of its current operation before it can issue its next operation, the throughput of the isotach networks was low. As the data dependence distance increased, the throughput of I1 and I2 climbed sharply until network saturation. Delay also increased as the networks became more heavily loaded. Because its latency is longer and its saturation point higher, we observed that I2 requires a larger data dependence distance than I1 to take full advantage of its ability to pipeline. I2 performed less well than I1 at small dependence distances because I2 cannot take advantage of its higher throughput and is hurt by a higher response time due to higher latency. Since C1 cannot pipeline regardless of the data dependence distance, throughput and delay for C1 do not vary with the data dependence distance. We did observe that the throughputs of I1 and I2 began to exceed Cls at a data dependence distance of only two Warm-Spot Traffic Model Study We combined a warm-spot traffic model with a workload with atomicity and sequencing constraints. A warm-spot model has several warm variables instead of a single hot variable and is more realistic than either the uniform or hotspot traffic models. The warm-traffic model we used was based on the standard 80/20 rule (see, e.g., [22]), modified to lessen contention for the warmest variables [40]. Atomic actions were assumed to be flat and independent, i.e., there were no data dependences among or within atomic actions. I1 and I2 enforced atomicity by issuing all operations from the same atomic action in the same pulse. C1 enforced atomicity using locks: A lock was associated with each variable and a process acquired a lock for each variable accessed by the atomic action before releasing any of the atomic action s locks. To avoid deadlock, each process acquired the locks it needed for each atomic action in a predetermined linear order. Our lock algorithms were generously efficient. Instead of spinning, lock requests queued at memory. Reads used nonexclusive locks and comprised 75% of the workload. Instead of sending a lock request for an operation, a process sent the operation itself. Each operation implicitly carried a request for a lock of the type indicated by the operation. When it received an operation, the MM enqueued it. If no conflicting operation was enqueued ahead of it, the operation was executed and a response returned to the source PE. A PE knew it had acquired a lock when it received the Sequential consistency NOT enforced 0.60 } 1 OBO Sequential consistency enforced i 1 I # Stages Fig. 7. Effect of enforcing sequential consistency. 0: C1, 0: 11, 0: 12

10 346 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 199- response. Eliminating explicit lock requesting and granting reduced traffic and eliminated roundtrip PE/MM message delays when a lock was granted. When the P,E had acquired all the locks for the atomic action, it sent a lock release for each lock it held. Execution of our warm-spot model was sequentially consistent. C1 enforced sequential consistency by limiting the number of outstanding atomic actions per PE to one. In I1 and 12, a PE could have any number of active atomic actions without sacrificing sequential consistency. Fig. 8 shows the effect of varying the average size (AA-size) of atomic actions under a warm-spot model. Delay is delay per operation, i.e., the average nuimber of cycles from the time an atomic action is generated until it is completed, divided by AA-size. For C1, the tirne required to release locks is not included in delay. A process in C1 generated a new atomic action as soon as it had received all the responses due in its current atomic action. In I1 and 12, a process generated a new atomic action whenever its output buffer was empty. The graphs show C1 performed poorly for large atomic actions. As the atomic action size increased, Cl s throughput dropped and its delay rose steeply. Cl s performance was significantly worse than in a similar study we conducted that assumed a uniform access model, because when the probability of conflicting operations is higher, locks have more impact on performance. As atomic actions grow larger, not only does the number of ciperations contending for access grow, but also the length of time each lock is held. In I1 and 12, by contrast, delay per operation actually decreased as the size of atomic actions increased, since increasing size meant more concurrency. As can be seen in Fig. 8, when atomic actions are large and traffic is nonuniform, Cl s performance is very poor relative to I1 and 12. For large atomic actions (AA-size = 161, throughput for I2 is about 78 times higher than Cl s and its delay about 24 times lower. We observed in other studies that hot-spot traffic results show the same pattern. The warm spot study shows that isotach systems perform well in relation to conventional systems under a workload with atomicity and sequencing constraints. Conventional systems cannot take full advantage of the greater raw power of their networks because the rate at which operations can be issued is limited by locks and restrictions on pipelining. Isotach system performance, by contrast, is limited only by the raw power of the networks. Though raw power of the isotach networks is somewhat lower tha that of the conventional networks, the synchronizatio support they provide allows them to outperform the con ventional systems by a wide margin. Finally, we analyzed conventional and isotach system for a workload of structured atomic actions in which dat dependences existed both within and among atomic at tions. I1 and I2 continued to outperform C1, though by smaller margin since the data dependences prevented th isotach networks from taking advantage of their ability + pipeline operations Simulation Study Summary Our study shows conventional networks have higher ra1 power than isotach networks, but with atomicity ani sequencing constraints, isotach networks outperform cor ventional networks, in some cases by a factor of 10 or morc Conventional systems perform relatively poorly in spite ( their greater raw power because they enforce atomicity an sequencing constraints by restricting throughput and in posing delays. Isotach networks perform best in relation t conventional networks when the distance between dat dependent operations within the same process s program 1 large enough to permit pipelining; atomic actions are largc and contention for shared variables is high. When th workload has two or more of these characteristics, an isc tach network performs markedly better than a convention, network. These results indicate that isotach based synchrc nization techniques warrant further study. We have e: tended our simulation to include isotach based cache cohei ence protocols [ Analytic Model Analytic models were developed for networks C1, C2, I and 12. That work led to the discovery of novel modelin methods that will be presented in a future paper. We ou line the approach and results we obtained here. A detaile presentation appears in [46]. We extended a mean value analysis technique first prl sented by Jenq [19] for banyan-like MINs. Our extensior enabled the modeling of 1) routing dependencies, 2) a measure of bandwidth we have called informatio flow, and 3) message types I 2500 I 1 t I 1 I I I I I & c I 1-1 ~ ~ AA Ske AA Size I 1 1 I Fig. 8. Varying the atomic action size-warm-spot traffic. 0: C1, 0: 11, 0: 12

11 HEYNOLDS ET AL.: ISOTACH NETWORKS 347 $Llting dependencies exist in isotach networks in the form constraints requiring routing in pid-rank order (Jenq did not include this kind of dependence). Information flow is a measure of the bandwidth of a given stage of a MIN. [Ilcluding information flow allowed us to introduce independence among network stages, gaining tractability with little sacrifice in accuracy. Finally, certain optimizations that,,wid be made in an implementation of an isotach net- *rk, for example, piggy-backing of tokens on messages of itent and ghost messages, needed to be reflected in the,1i1dytic model. We succeeded in capturing the effects of these message types. For networks I1 and 12, the analytic model predicted throughput with no worse than 12% deviation from the simulation results presented earlier. The model tended to be optimistic, which we expected because it didn't account tor delays that variations in uniform distribution of mes-,iges would create in the network. Those delays were cap- :d in the simulation. Prediction of delay was even better,th a maximum error of 5%. In both cases, the results were tor networks ranging in depth from two to 10 stages. 5 CONCLUSIONS AND ONGOING WORK We described a new approach to enforcing ordering constraints in a fault-free system in which processes communicate over a point-to-point network. This problem is difficult, 1 tn with the assumption of fault-freedom. The conventional I I tion uses locks and restrictions on pipelining. Solutions to the harder problem, where fault-freedom is not assumed, have been found useful in distributed computations, but these solutions are not suitable for use in SMM computations. We have described an approach that does not require locks or restrictions on pipelining and that is scalable and lightweight enough for use in SMM computations. The approach is based on isotach logical time, a logical time system that extends Lamport's system by adding an 1 mant that relates communication time to communicalii distance. This invariant is a powerful coordinating mechanism that allows the sender to control the order in which its messages are received. We defined isotach logical time, gave an efficient distributed algorithm for implementing it, and showed how to use isotach logical time to enforce ordering constraints. The work of implementing isotach logical time and enforcing the ordering constraints can be localized in the network interfaces and switches. We riefined a new class of networks called isofach networks that iplement isotach logical time. Implementing isotach logical time, of course, is not free and we are well aware of controversy (181, 1121, 1391) over implementing synchronization services at a low level in end-to-end systems Isotach networks are justifiable on a cost/benefit basis-the guarantees they offer can be implemented cheaply, yet are sufficiently powerful to enforce fundamental synchronization properties. We reported the results of analytic and empirical studies that.' ow that isotach synchronization techniques are far more 1 ilcient than conventional techniques. Ongoing work. We are building an isotach network prototype and studying the use of isotach techniques in imple- menting distributed shared memory and rule-based systems. One of our goals in building the prototype is to demonstrate that isotach techniques can work in distributed systems and can tolerate some types of faults. Preliminary explorations show that isotach networks can function under a number of different failure models. One result or failstop models appears in Our approach to ensuring a reliable implementation of the isochron on the prototype system will be similar to the approach reported in [451, exicept that we intend to use a distributed protocol in place of <a centralized credit manager for flow control. Other goals in lbuilding the prototype include showing that implementing iisotach logical time does not significantly interfere with network traffic that does not use isotach logical time, and {hat isotach logical time can be implemented without blocking messages within the network. The former is important in addressing the end-to-end controversy referred to above. The latter is important to allow the use of off-theshelf networks and can be accomplished by pushing most of the work that the switches do in the isonet algorithm onto the receiving network interface. Our target platform for the prototype is a nonequidistant cluster of personal computers (PCs) interconnected with Myrinet [9]. ACKNOWLEDGMENTS This work was supported in part by DARPA Grant DABT63-95-C-0081, U.S. National Science Foundation Grant CCR , and DUSA(0R) Simtech funding (through the National Ground Intelligence Center). REFERENCES Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, and P. Ciarfella, "The Totem Single-Ring Ordering and Membership Protocol," ACM Trans. Computer Systems, vol. 13, no. 4, pp , Nov T.E. Anderson, 'The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 1, pp. 6-16, Jan B. Awerbuch, "Complexity of Network Synchronization," 1. ACM, vol. 32, no. 4, pp , Oct H.E. Bal and AS. Tanenbaum, "Distributed Programming with Shared Data," Proc. I CS 1988 Int'l Conf. Computer Languages, pp , Miami, Fla., Oct Y. Birk, P.B. Gibbons, J.L.C. Sanz, and D. Soroker, "A Simple Mechanism for Efficient Barrier Synchronization in MIMD Machines," Technical Report RJ 7078, IBM, Oct K.P. Birman and T.A. Joseph, "Reliable Communication in the Presence of Failures," ACM Trans. Computer Systems, vol. 5, no. 1, pp , Feb K.P. Birman, A. Schiper, and P. Stephenson, "Lightweight Causal and Atomic Group Multicast," ACM Trans. Computer Systems, pp , Aug K. Birman, "A Response to Cheriton and Skeen's Criticism of Causal and Totally Ordered Communication," OS Review, vol. 28, no. 1, pp , Jan N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, and W. Su, "Myrinet-A Gigabit-per-Second Local- Area Network," IEEE Micro. vol. 15. no. 1. DD Feb [101 K. Chandy and J. Misra, "Distributed Sim;ttion; A'Case Study in Design and Verification of Distributed Programs," IEEE Trans. Software Eng., vol. 5, no. 5, pp , Sept [l I] J. Chang and N.F. Maxemchuk, "Reliable Broadcast Protocols," ACM Trans. Computer Systems, vol. 2, no. 3, pp , Aug [12] D.R. Cheriton and D. Skeen, "Understanding the Limitations of Causally and Totally Ordered Communication," OS Review, 14th ACM Symp. OS Principles, pp , Dec

12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 I;. Cristian, Synchronous Atomic Broadcast for Redundant Broadcast Channels, Technical Report RJ 7203 (67682), IBM, Dec M. Dasser, TOMP: A Total Ordering Multicast Protocol, Opntirig Systems Review, vol. 26, no. 1, Jan H. Garcia-Molina and A. Spauster, Ordered and llieliable Multicast Communication, ACM Trans. Corirputer Systeiris, vol. 9, no. 3, pp , Aug K. Gharachorloo et al., Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors, Proc. Fourth ASPLOS, pp , Apr P.B. Gibbons, The Asynchronous PRAM: A St.ini-Sytichronous Model for Shared Memory MIMD Machines, , ICSI, Berkeley, Calif., December, K.J. Goldman, Highly Concurrent Logically Synchronous Multicast, Distributed Computing, pp , Berlin-Heidelburg-New York Springer-Verlag, Y. Jenq, Performance Analysis of a Packet Switch Based on Single-Buffered Banyan Network, IEEE ]. Selcctecl Areas Comm., vol. 1, no. 6, pp. 1,014-1,021, Dec M.F. Kaashoek, A.S. Tanenbaum, S.F. Huminell, and H.E. Bal, An Efficient Reliable Broadcast Protocol, OS Review, vol. 23, no. 4, pp. 5-19, Oct M.J. Karol, M.G. Hluchyj, and S.P. Morgan, Input Versus Output Queueing on a Space-Division Packet Switch, IEEE Trans. Comm., vol. 35, no. 12, pp. 1,347-1,356, Dec D.E. Knuth, Sorting and Searching, Fundnmental Algorithms, vol. 3, p Addison-Wesley, C.P. Kruskal, L. Rudolph, and M. Snir, Efficient Synchronization on Multiprocessors with Shared Memory, ACM Trans. Programming Languages and Systems, vol. 10, no. 4, pp , Oct M. Kumar and J.R. Jump, Performance Enhancement of Buffered Delta Networks Using Crossbar Switches and Multiple Links, ]. Parallel and Distributed Computing, vol. 1, pp , L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, Comm. ACM, vol. 21, no. 7, pp. 558-S65, July L. Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocessor Programs, IEEE Trans. Com- puters, VO~. 28, pp ,1979. L. Lamport, On Interprocess Communication, Distributed Conputiiig, vol. 1, no. 2, pp , Apr D.B. Lomet, Process Structuring, Synchronization, and Recovery Using Atomic Actions, SIGPLAN Notices, vol. 12, no. 3, pp. 12&137, Mar J.M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM Trans. Computer Systems, vol. 9, no. 1, pp , Feb M.M. Michael and M.L. Scott, Scalability of Atomic Primitives on Distributed Shared Memory Multprocessors, Technical Report 528, Univ. of Rochester, Computer Science Dept., July S. Owicki and D. Gries, An Axiomatic Proof Technique for Par- allel Programs I, Acta Infornzatica, vol. 6, pp , i S. Owicki and L. Lamport, Proving Liveness Properties of Concurrent Programs, ACM Trans. Programming Lnrigunges nnd Systems, vol. 4, no. 3, pp , July C. Papadimitriou, Dntabase Concirrrermy Control. Computer Science Press, I341 L.L. Peterson, N.C. Bucholz, and R.D. Schlichting, Preserving and Using Context Information in Interprocess Communication, ACM Tmirs. Conzputer Systeins, vol. 7, no. 3, pp , Aug [351 A.G. Ranade, S.N. Bhatt, and S.L. Johnsson, The Fluent Abstract Machine, Technical Report 573, Yale Univ., Dept. of Computer Science, Jan (361 A. Ranade and A.G. Ranade, How to Emulate Shared Memory, IEEE Ann. Sytnp. Foirndntions of Currrpirter Science, pp , Los Angeles, M. Raynal and M. Singhal, Logical Time: Capturing Causality in Distributed Systems, Corrrputer, vol. 29, no. 2, pp , Feb D. Reed, Implementing Atomic Actions on Decentralized Data, ACM Trans. Computer Sysferns, vol. 1, no. 1, pp. 3-23, Feb R. Renesse, Why Bother With CATOCS? OS Reziiezu, vol. 28, no. 1, pp , Jan P.F. Reynolds Jr., C. Williams, and R.R. Wagtier Jr., Empirical Analysis of Isotach Networks, Technical Report 92-19, Univ. of Virginia, Dept. of Computer Science, June I 1 F Reynolds Jr and C Williams, Asynchronous Rule-Based Systems 111 CGF, Proc Fifth Conf Cowputer Generated Forces an0 Behn7m1rnl Representation, Orlando, Fla, May J H Saker, D P Reed, and D D Clark, End-To-End Arguments System Design, ACM Trnns Conrputer Systems, vol 2, no. 4, pp ,Nov C Schcurich and M Dubois, Correct Memory Operation of Cache- Based Multiprocessors, Proc 14th ISCA, pp , June B R Supinski, C Williams, and P F Reynolds Jr, Performance Evaluation of the Late Delta Cache Coherence Protocol, Technp CAI Report CS-96-05, Univ of Virginia, Dept of Computer Science Mar K Verstoep, K Langendoen, and HE Bal, Efficient Reliablc Multicast on Myrinet, Technical Report IR-399, Vrije Univer siteit, Amsterdam, Jan R.R Wagner Jr, On the Implementation of Local Synchrony; CS-93-33, Univ. of Virginia, Dept of Computer Science, June 1993 [471 C. Williams and P.F. Reynolds Jr., Delta-Cache Protocols: A New Class of Cache Coherence Protocols, Technical Report 90-34, Univ of Virginia, Dept. of Computer Science, Dec [481 C C Williams, Concurrency Control in Asynchronous Compu tations, Ph.D. thesis, No 9301, Univ of Virginia, I491 C C. Williams and P.F. Reynolds Jr, Combining Atomic Actions, ] Parallel and Distributed Computing, pp , Feb Paul F. Reynolds, Jr. received the PhD degree from the University of Texas at Arlington in 1979 He is currently an associate professor of com puter science at the University of Virginia, wherc he has been on the faculty since He hac published widely in the area of parallel computa tion, specifically in parallel stimulation and par allel language and algorithm design. He ha< served on a number of national committees anc advisory groups as an expert on parallel com putation, and, more specifically, as an expert o parallel stimulation. He has served as a consultant to numerous corpo rations and government agencies in the systems and simulation areas Craig Williams received her BA degree in eco nomics from the College of William and Mary it 1973, a JD from Columbia Law School in 1975 and her Masters degree in computer science ir Her PhD was awarded by the University o Virginia in She is now a senior scientist ir the Computer Science Department at the Un versity of Virginia. Her research interests includ parallel data structures and both the hardwarc and software aspects of parallel computation. Raymond R. Wagner, Jr. received his BA i/ computer science from Dartmouth College ir 1985, and his MS and PhD degrees in compute science from the University of Virginia in 198- and He is currently a senior associate anc partner in the Vermont Software Company, Inc in Wells River, Vermont.

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 06: Synchronization Version: November 16, 2009 2 / 39 Contents Chapter

More information

Agreement. Today. l Coordination and agreement in group communication. l Consensus

Agreement. Today. l Coordination and agreement in group communication. l Consensus Agreement Today l Coordination and agreement in group communication l Consensus Events and process states " A distributed system a collection P of N singlethreaded processes w/o shared memory Each process

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 6 (version April 7, 28) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.2. Tel: (2)

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS SYNCHRONIZATION Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Topics Clock Synchronization Physical Clocks Clock Synchronization Algorithms

More information

The Weakest Failure Detector to Solve Mutual Exclusion

The Weakest Failure Detector to Solve Mutual Exclusion The Weakest Failure Detector to Solve Mutual Exclusion Vibhor Bhatt Nicholas Christman Prasad Jayanti Dartmouth College, Hanover, NH Dartmouth Computer Science Technical Report TR2008-618 April 17, 2008

More information

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation Logical Time Nicola Dragoni Embedded Systems Engineering DTU Compute 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation 2013 ACM Turing Award:

More information

Time. To do. q Physical clocks q Logical clocks

Time. To do. q Physical clocks q Logical clocks Time To do q Physical clocks q Logical clocks Events, process states and clocks A distributed system A collection P of N single-threaded processes (p i, i = 1,, N) without shared memory The processes in

More information

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. CS 249 Project Fall 2005 Wing Wong

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. CS 249 Project Fall 2005 Wing Wong Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms CS 249 Project Fall 2005 Wing Wong Outline Introduction Asynchronous distributed systems, distributed computations,

More information

Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication

Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication Min Shen, Ajay D. Kshemkalyani, TaYuan Hsu University of Illinois at Chicago Min Shen, Ajay D. Kshemkalyani, TaYuan Causal

More information

1 st Semester 2007/2008

1 st Semester 2007/2008 Chapter 17: System Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2007/2008 Slides baseados nos slides oficiais do livro Database System c Silberschatz, Korth and Sudarshan.

More information

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links Jialin Zhang Tsinghua University zhanggl02@mails.tsinghua.edu.cn Wei Chen Microsoft Research Asia weic@microsoft.com

More information

Distributed Algorithms Time, clocks and the ordering of events

Distributed Algorithms Time, clocks and the ordering of events Distributed Algorithms Time, clocks and the ordering of events Alberto Montresor University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International

More information

Time. Today. l Physical clocks l Logical clocks

Time. Today. l Physical clocks l Logical clocks Time Today l Physical clocks l Logical clocks Events, process states and clocks " A distributed system a collection P of N singlethreaded processes without shared memory Each process p i has a state s

More information

Clocks in Asynchronous Systems

Clocks in Asynchronous Systems Clocks in Asynchronous Systems The Internet Network Time Protocol (NTP) 8 Goals provide the ability to externally synchronize clients across internet to UTC provide reliable service tolerating lengthy

More information

Distributed Systems Fundamentals

Distributed Systems Fundamentals February 17, 2000 ECS 251 Winter 2000 Page 1 Distributed Systems Fundamentals 1. Distributed system? a. What is it? b. Why use it? 2. System Architectures a. minicomputer mode b. workstation model c. processor

More information

Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1.

Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1. Coordination Failures and Consensus If the solution to availability and scalability is to decentralize and replicate functions and data, how do we coordinate the nodes? data consistency update propagation

More information

Time in Distributed Systems: Clocks and Ordering of Events

Time in Distributed Systems: Clocks and Ordering of Events Time in Distributed Systems: Clocks and Ordering of Events Clocks in Distributed Systems Needed to Order two or more events happening at same or different nodes (Ex: Consistent ordering of updates at different

More information

Chapter 11 Time and Global States

Chapter 11 Time and Global States CSD511 Distributed Systems 分散式系統 Chapter 11 Time and Global States 吳俊興 國立高雄大學資訊工程學系 Chapter 11 Time and Global States 11.1 Introduction 11.2 Clocks, events and process states 11.3 Synchronizing physical

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers

Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers Jonathan Turner Washington University jon.turner@wustl.edu January 30, 2008 Abstract Crossbar-based switches are commonly used

More information

Parallel Performance Evaluation through Critical Path Analysis

Parallel Performance Evaluation through Critical Path Analysis Parallel Performance Evaluation through Critical Path Analysis Benno J. Overeinder and Peter M. A. Sloot University of Amsterdam, Parallel Scientific Computing & Simulation Group Kruislaan 403, NL-1098

More information

arxiv: v1 [cs.dc] 9 Feb 2015

arxiv: v1 [cs.dc] 9 Feb 2015 Why Transactional Memory Should Not Be Obstruction-Free Petr Kuznetsov 1 Srivatsan Ravi 2 1 Télécom ParisTech 2 TU Berlin arxiv:1502.02725v1 [cs.dc] 9 Feb 2015 Abstract Transactional memory (TM) is an

More information

Shared Memory vs Message Passing

Shared Memory vs Message Passing Shared Memory vs Message Passing Carole Delporte-Gallet Hugues Fauconnier Rachid Guerraoui Revised: 15 February 2004 Abstract This paper determines the computational strength of the shared memory abstraction

More information

Distributed Systems. 06. Logical clocks. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 06. Logical clocks. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 06. Logical clocks Paul Krzyzanowski Rutgers University Fall 2017 2014-2017 Paul Krzyzanowski 1 Logical clocks Assign sequence numbers to messages All cooperating processes can agree

More information

Distributed Mutual Exclusion Based on Causal Ordering

Distributed Mutual Exclusion Based on Causal Ordering Journal of Computer Science 5 (5): 398-404, 2009 ISSN 1549-3636 2009 Science Publications Distributed Mutual Exclusion Based on Causal Ordering Mohamed Naimi and Ousmane Thiare Department of Computer Science,

More information

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications:

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications: AGREEMENT PROBLEMS (1) AGREEMENT PROBLEMS Agreement problems arise in many practical applications: agreement on whether to commit or abort the results of a distributed atomic action (e.g. database transaction)

More information

Cuts. Cuts. Consistent cuts and consistent global states. Global states and cuts. A cut C is a subset of the global history of H

Cuts. Cuts. Consistent cuts and consistent global states. Global states and cuts. A cut C is a subset of the global history of H Cuts Cuts A cut C is a subset of the global history of H C = h c 1 1 hc 2 2...hc n n A cut C is a subset of the global history of H The frontier of C is the set of events e c 1 1,ec 2 2,...ec n n C = h

More information

6.852: Distributed Algorithms Fall, Class 10

6.852: Distributed Algorithms Fall, Class 10 6.852: Distributed Algorithms Fall, 2009 Class 10 Today s plan Simulating synchronous algorithms in asynchronous networks Synchronizers Lower bound for global synchronization Reading: Chapter 16 Next:

More information

Genuine atomic multicast in asynchronous distributed systems

Genuine atomic multicast in asynchronous distributed systems Theoretical Computer Science 254 (2001) 297 316 www.elsevier.com/locate/tcs Genuine atomic multicast in asynchronous distributed systems Rachid Guerraoui, Andre Schiper Departement d Informatique, Ecole

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG Why Synchronization? You want to catch a bus at 6.05 pm, but your watch is

More information

A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch

A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch A Starvation-free Algorithm For Achieving 00% Throughput in an Input- Queued Switch Abstract Adisak ekkittikul ick ckeown Department of Electrical Engineering Stanford University Stanford CA 9405-400 Tel

More information

Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering

Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering Our Problem Global Predicate Detection and Event Ordering To compute predicates over the state of a distributed application Model Clock Synchronization Message passing No failures Two possible timing assumptions:

More information

Clock Synchronization

Clock Synchronization Today: Canonical Problems in Distributed Systems Time ordering and clock synchronization Leader election Mutual exclusion Distributed transactions Deadlock detection Lecture 11, page 7 Clock Synchronization

More information

Efficient Dependency Tracking for Relevant Events in Concurrent Systems

Efficient Dependency Tracking for Relevant Events in Concurrent Systems Distributed Computing manuscript No. (will be inserted by the editor) Anurag Agarwal Vijay K. Garg Efficient Dependency Tracking for Relevant Events in Concurrent Systems Received: date / Accepted: date

More information

Section 6 Fault-Tolerant Consensus

Section 6 Fault-Tolerant Consensus Section 6 Fault-Tolerant Consensus CS586 - Panagiota Fatourou 1 Description of the Problem Consensus Each process starts with an individual input from a particular value set V. Processes may fail by crashing.

More information

MAD. Models & Algorithms for Distributed systems -- 2/5 -- download slides at

MAD. Models & Algorithms for Distributed systems -- 2/5 -- download slides at MAD Models & Algorithms for Distributed systems -- /5 -- download slides at http://people.rennes.inria.fr/eric.fabre/ 1 Today Runs/executions of a distributed system are partial orders of events We introduce

More information

Slides for Chapter 14: Time and Global States

Slides for Chapter 14: Time and Global States Slides for Chapter 14: Time and Global States From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, Addison-Wesley 2012 Overview of Chapter Introduction Clocks,

More information

Causal Broadcast Seif Haridi

Causal Broadcast Seif Haridi Causal Broadcast Seif Haridi haridi@kth.se Motivation Assume we have a chat application Whatever written is reliably broadcast to group If you get the following output, is it ok? [Paris] Are you sure,

More information

Causality and Time. The Happens-Before Relation

Causality and Time. The Happens-Before Relation Causality and Time The Happens-Before Relation Because executions are sequences of events, they induce a total order on all the events It is possible that two events by different processors do not influence

More information

Failure detectors Introduction CHAPTER

Failure detectors Introduction CHAPTER CHAPTER 15 Failure detectors 15.1 Introduction This chapter deals with the design of fault-tolerant distributed systems. It is widely known that the design and verification of fault-tolerent distributed

More information

CS505: Distributed Systems

CS505: Distributed Systems Department of Computer Science CS505: Distributed Systems Lecture 5: Time in Distributed Systems Overview Time and Synchronization Logical Clocks Vector Clocks Distributed Systems Asynchronous systems:

More information

Distributed Algorithms (CAS 769) Dr. Borzoo Bonakdarpour

Distributed Algorithms (CAS 769) Dr. Borzoo Bonakdarpour Distributed Algorithms (CAS 769) Week 1: Introduction, Logical clocks, Snapshots Dr. Borzoo Bonakdarpour Department of Computing and Software McMaster University Dr. Borzoo Bonakdarpour Distributed Algorithms

More information

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks Plausible Clocks with Bounded Inaccuracy Causality & Concurrency a b exists a path from a to b Brad Moore, Paul Sivilotti Computer Science & Engineering The Ohio State University paolo@cse.ohio-state.edu

More information

DES. 4. Petri Nets. Introduction. Different Classes of Petri Net. Petri net properties. Analysis of Petri net models

DES. 4. Petri Nets. Introduction. Different Classes of Petri Net. Petri net properties. Analysis of Petri net models 4. Petri Nets Introduction Different Classes of Petri Net Petri net properties Analysis of Petri net models 1 Petri Nets C.A Petri, TU Darmstadt, 1962 A mathematical and graphical modeling method. Describe

More information

Distributed Algorithms

Distributed Algorithms Distributed Algorithms December 17, 2008 Gerard Tel Introduction to Distributed Algorithms (2 nd edition) Cambridge University Press, 2000 Set-Up of the Course 13 lectures: Wan Fokkink room U342 email:

More information

A subtle problem. An obvious problem. An obvious problem. An obvious problem. No!

A subtle problem. An obvious problem. An obvious problem. An obvious problem. No! A subtle problem An obvious problem when LC = t do S doesn t make sense for Lamport clocks! there is no guarantee that LC will ever be S is anyway executed after LC = t Fixes: if e is internal/send and

More information

Time, Clocks, and the Ordering of Events in a Distributed System

Time, Clocks, and the Ordering of Events in a Distributed System Time, Clocks, and the Ordering of Events in a Distributed System Motivating example: a distributed compilation service FTP server storing source files, object files, executable file stored files have timestamps,

More information

Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation

Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation Viveck R. Cadambe EE Department, Pennsylvania State University, University Park, PA, USA viveck@engr.psu.edu Nancy Lynch

More information

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014 Clojure Concurrency Constructs, Part Two CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014 1 Goals Cover the material presented in Chapter 4, of our concurrency textbook In particular,

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

6.852: Distributed Algorithms Fall, Class 24

6.852: Distributed Algorithms Fall, Class 24 6.852: Distributed Algorithms Fall, 2009 Class 24 Today s plan Self-stabilization Self-stabilizing algorithms: Breadth-first spanning tree Mutual exclusion Composing self-stabilizing algorithms Making

More information

CptS 464/564 Fall Prof. Dave Bakken. Cpt. S 464/564 Lecture January 26, 2014

CptS 464/564 Fall Prof. Dave Bakken. Cpt. S 464/564 Lecture January 26, 2014 Overview of Ordering and Logical Time Prof. Dave Bakken Cpt. S 464/564 Lecture January 26, 2014 Context This material is NOT in CDKB5 textbook Rather, from second text by Verissimo and Rodrigues, chapters

More information

Distributed Systems 8L for Part IB

Distributed Systems 8L for Part IB Distributed Systems 8L for Part IB Handout 2 Dr. Steven Hand 1 Clocks Distributed systems need to be able to: order events produced by concurrent processes; synchronize senders and receivers of messages;

More information

Time is an important issue in DS

Time is an important issue in DS Chapter 0: Time and Global States Introduction Clocks,events and process states Synchronizing physical clocks Logical time and logical clocks Global states Distributed debugging Summary Time is an important

More information

A Brief Introduction to Model Checking

A Brief Introduction to Model Checking A Brief Introduction to Model Checking Jan. 18, LIX Page 1 Model Checking A technique for verifying finite state concurrent systems; a benefit on this restriction: largely automatic; a problem to fight:

More information

Today. Vector Clocks and Distributed Snapshots. Motivation: Distributed discussion board. Distributed discussion board. 1. Logical Time: Vector clocks

Today. Vector Clocks and Distributed Snapshots. Motivation: Distributed discussion board. Distributed discussion board. 1. Logical Time: Vector clocks Vector Clocks and Distributed Snapshots Today. Logical Time: Vector clocks 2. Distributed lobal Snapshots CS 48: Distributed Systems Lecture 5 Kyle Jamieson 2 Motivation: Distributed discussion board Distributed

More information

Lan Performance LAB Ethernet : CSMA/CD TOKEN RING: TOKEN

Lan Performance LAB Ethernet : CSMA/CD TOKEN RING: TOKEN Lan Performance LAB Ethernet : CSMA/CD TOKEN RING: TOKEN Ethernet Frame Format 7 b y te s 1 b y te 2 o r 6 b y te s 2 o r 6 b y te s 2 b y te s 4-1 5 0 0 b y te s 4 b y te s P r e a m b le S ta r t F r

More information

Robust Network Codes for Unicast Connections: A Case Study

Robust Network Codes for Unicast Connections: A Case Study Robust Network Codes for Unicast Connections: A Case Study Salim Y. El Rouayheb, Alex Sprintson, and Costas Georghiades Department of Electrical and Computer Engineering Texas A&M University College Station,

More information

On queueing in coded networks queue size follows degrees of freedom

On queueing in coded networks queue size follows degrees of freedom On queueing in coded networks queue size follows degrees of freedom Jay Kumar Sundararajan, Devavrat Shah, Muriel Médard Laboratory for Information and Decision Systems, Massachusetts Institute of Technology,

More information

Verification of clock synchronization algorithm (Original Welch-Lynch algorithm and adaptation to TTA)

Verification of clock synchronization algorithm (Original Welch-Lynch algorithm and adaptation to TTA) Verification of clock synchronization algorithm (Original Welch-Lynch algorithm and adaptation to TTA) Christian Mueller November 25, 2005 1 Contents 1 Clock synchronization in general 3 1.1 Introduction............................

More information

Input-queued switches: Scheduling algorithms for a crossbar switch. EE 384X Packet Switch Architectures 1

Input-queued switches: Scheduling algorithms for a crossbar switch. EE 384X Packet Switch Architectures 1 Input-queued switches: Scheduling algorithms for a crossbar switch EE 84X Packet Switch Architectures Overview Today s lecture - the input-buffered switch architecture - the head-of-line blocking phenomenon

More information

The Discrete EVent System specification (DEVS) formalism

The Discrete EVent System specification (DEVS) formalism The Discrete EVent System specification (DEVS) formalism Hans Vangheluwe The DEVS formalism was conceived by Zeigler [Zei84a, Zei84b] to provide a rigourous common basis for discrete-event modelling and

More information

CS5412: REPLICATION, CONSISTENCY AND CLOCKS

CS5412: REPLICATION, CONSISTENCY AND CLOCKS 1 CS5412: REPLICATION, CONSISTENCY AND CLOCKS Lecture X Ken Birman Recall that clouds have tiers 2 Up to now our focus has been on client systems and the network, and the way that the cloud has reshaped

More information

C 1. Recap: Finger Table. CSE 486/586 Distributed Systems Consensus. One Reason: Impossibility of Consensus. Let s Consider This

C 1. Recap: Finger Table. CSE 486/586 Distributed Systems Consensus. One Reason: Impossibility of Consensus. Let s Consider This Recap: Finger Table Finding a using fingers Distributed Systems onsensus Steve Ko omputer Sciences and Engineering University at Buffalo N102 86 + 2 4 N86 20 + 2 6 N20 2 Let s onsider This

More information

Snapshots. Chandy-Lamport Algorithm for the determination of consistent global states <$1000, 0> <$50, 2000> mark. (order 10, $100) mark

Snapshots. Chandy-Lamport Algorithm for the determination of consistent global states <$1000, 0> <$50, 2000> mark. (order 10, $100) mark 8 example: P i P j (5 widgets) (order 10, $100) cji 8 ed state P i : , P j : , c ij : , c ji : Distributed Systems

More information

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit Finally the Weakest Failure Detector for Non-Blocking Atomic Commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory EPFL Abstract Recent papers [7, 9] define the weakest failure detector

More information

Failure Detectors. Seif Haridi. S. Haridi, KTHx ID2203.1x

Failure Detectors. Seif Haridi. S. Haridi, KTHx ID2203.1x Failure Detectors Seif Haridi haridi@kth.se 1 Modeling Timing Assumptions Tedious to model eventual synchrony (partial synchrony) Timing assumptions mostly needed to detect failures Heartbeats, timeouts,

More information

These are special traffic patterns that create more stress on a switch

These are special traffic patterns that create more stress on a switch Myths about Microbursts What are Microbursts? Microbursts are traffic patterns where traffic arrives in small bursts. While almost all network traffic is bursty to some extent, storage traffic usually

More information

Abstract. The paper considers the problem of implementing \Virtually. system. Virtually Synchronous Communication was rst introduced

Abstract. The paper considers the problem of implementing \Virtually. system. Virtually Synchronous Communication was rst introduced Primary Partition \Virtually-Synchronous Communication" harder than Consensus? Andre Schiper and Alain Sandoz Departement d'informatique Ecole Polytechnique Federale de Lausanne CH-1015 Lausanne (Switzerland)

More information

arxiv: v1 [cs.dc] 22 Oct 2018

arxiv: v1 [cs.dc] 22 Oct 2018 FANTOM: A SCALABLE FRAMEWORK FOR ASYNCHRONOUS DISTRIBUTED SYSTEMS A PREPRINT Sang-Min Choi, Jiho Park, Quan Nguyen, and Andre Cronje arxiv:1810.10360v1 [cs.dc] 22 Oct 2018 FANTOM Lab FANTOM Foundation

More information

Distributed Computing in Shared Memory and Networks

Distributed Computing in Shared Memory and Networks Distributed Computing in Shared Memory and Networks Class 2: Consensus WEP 2018 KAUST This class Reaching agreement in shared memory: Consensus ü Impossibility of wait-free consensus 1-resilient consensus

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Do we have a quorum?

Do we have a quorum? Do we have a quorum? Quorum Systems Given a set U of servers, U = n: A quorum system is a set Q 2 U such that Q 1, Q 2 Q : Q 1 Q 2 Each Q in Q is a quorum How quorum systems work: A read/write shared register

More information

CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO

CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO OUTLINE q Contribution of the paper q Gossip algorithm q The corrected

More information

Fault-Tolerant Consensus

Fault-Tolerant Consensus Fault-Tolerant Consensus CS556 - Panagiota Fatourou 1 Assumptions Consensus Denote by f the maximum number of processes that may fail. We call the system f-resilient Description of the Problem Each process

More information

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur Agreement Protocols CS60002: Distributed Systems Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur Classification of Faults Based on components that failed Program

More information

Multicore Semantics and Programming

Multicore Semantics and Programming Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

TIME BOUNDS FOR SHARED OBJECTS IN PARTIALLY SYNCHRONOUS SYSTEMS. A Thesis JIAQI WANG

TIME BOUNDS FOR SHARED OBJECTS IN PARTIALLY SYNCHRONOUS SYSTEMS. A Thesis JIAQI WANG TIME BOUNDS FOR SHARED OBJECTS IN PARTIALLY SYNCHRONOUS SYSTEMS A Thesis by JIAQI WANG Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for

More information

Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures

Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures Xianbing Wang 1, Yong-Meng Teo 1,2, and Jiannong Cao 3 1 Singapore-MIT Alliance, 2 Department of Computer Science,

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 & Clocks, Clocks, and the Ordering of Events in a Distributed System. L. Lamport, Communications of the ACM, 1978 Notes 15: & Clocks CS 347 Notes

More information

Consensus and Universal Construction"

Consensus and Universal Construction Consensus and Universal Construction INF346, 2015 So far Shared-memory communication: safe bits => multi-valued atomic registers atomic registers => atomic/immediate snapshot 2 Today Reaching agreement

More information

1 Lamport s Bakery Algorithm

1 Lamport s Bakery Algorithm Com S 6 Spring Semester 2009 Algorithms for Multiprocessor Synchronization Lecture 3: Tuesday, 27th January 2009 Instructor: Soma Chaudhuri Scribe: Neeraj Khanolkar Lamport s Bakery Algorithm Algorithm

More information

Transactional Information Systems:

Transactional Information Systems: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery Gerhard Weikum and Gottfried Vossen 2002 Morgan Kaufmann ISBN 1-55860-508-8 Teamwork is essential.

More information

Bilateral Proofs of Safety and Progress Properties of Concurrent Programs (Working Draft)

Bilateral Proofs of Safety and Progress Properties of Concurrent Programs (Working Draft) Bilateral Proofs of Safety and Progress Properties of Concurrent Programs (Working Draft) Jayadev Misra December 18, 2015 Contents 1 Introduction 3 2 Program and Execution Model 4 2.1 Program Structure..........................

More information

1 Introduction During the execution of a distributed computation, processes exchange information via messages. The message exchange establishes causal

1 Introduction During the execution of a distributed computation, processes exchange information via messages. The message exchange establishes causal Quasi-Synchronous heckpointing: Models, haracterization, and lassication D. Manivannan Mukesh Singhal Department of omputer and Information Science The Ohio State University olumbus, OH 43210 (email: fmanivann,singhalg@cis.ohio-state.edu)

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

Asynchronous group mutual exclusion in ring networks

Asynchronous group mutual exclusion in ring networks Asynchronous group mutual exclusion in ring networks K.-P.Wu and Y.-J.Joung Abstract: In group mutual exclusion solutions for shared-memory models and complete messagepassing networks have been proposed.

More information

Distributed systems Lecture 4: Clock synchronisation; logical clocks. Dr Robert N. M. Watson

Distributed systems Lecture 4: Clock synchronisation; logical clocks. Dr Robert N. M. Watson Distributed systems Lecture 4: Clock synchronisation; logical clocks Dr Robert N. M. Watson 1 Last time Started to look at time in distributed systems Coordinating actions between processes Physical clocks

More information

Module 5: CPU Scheduling

Module 5: CPU Scheduling Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 5.1 Basic Concepts Maximum CPU utilization obtained

More information

Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems

Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems Xianbing Wang, Yong-Meng Teo, and Jiannong Cao Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3, Singapore 117576 Abstract

More information

1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007

1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007 Lecture 1 Introduction 1 Introduction 1.1 The Problem Domain Today, we are going to ask whether a system can recover from perturbation. Consider a children s top: If it is perfectly vertically, you can

More information

Valency Arguments CHAPTER7

Valency Arguments CHAPTER7 CHAPTER7 Valency Arguments In a valency argument, configurations are classified as either univalent or multivalent. Starting from a univalent configuration, all terminating executions (from some class)

More information

Asynchronous Models For Consensus

Asynchronous Models For Consensus Distributed Systems 600.437 Asynchronous Models for Consensus Department of Computer Science The Johns Hopkins University 1 Asynchronous Models For Consensus Lecture 5 Further reading: Distributed Algorithms

More information

Eventually consistent failure detectors

Eventually consistent failure detectors J. Parallel Distrib. Comput. 65 (2005) 361 373 www.elsevier.com/locate/jpdc Eventually consistent failure detectors Mikel Larrea a,, Antonio Fernández b, Sergio Arévalo b a Departamento de Arquitectura

More information

Time Synchronization

Time Synchronization Massachusetts Institute of Technology Lecture 7 6.895: Advanced Distributed Algorithms March 6, 2006 Professor Nancy Lynch Time Synchronization Readings: Fan, Lynch. Gradient clock synchronization Attiya,

More information

Chandy-Lamport Snapshotting

Chandy-Lamport Snapshotting Chandy-Lamport Snapshotting COS 418: Distributed Systems Precept 8 Themis Melissaris and Daniel Suo [Content adapted from I. Gupta] Agenda What are global snapshots? The Chandy-Lamport algorithm Why does

More information

Efficient Dependency Tracking for Relevant Events in Shared-Memory Systems

Efficient Dependency Tracking for Relevant Events in Shared-Memory Systems Efficient Dependency Tracking for Relevant Events in Shared-Memory Systems Anurag Agarwal Dept of Computer Sciences The University of Texas at Austin Austin, TX 78712-233, USA anurag@cs.utexas.edu Vijay

More information

Gradient Clock Synchronization

Gradient Clock Synchronization Noname manuscript No. (will be inserted by the editor) Rui Fan Nancy Lynch Gradient Clock Synchronization the date of receipt and acceptance should be inserted later Abstract We introduce the distributed

More information