MAD. Models & Algorithms for Distributed systems -- 2/5 -- download slides at h9p://people.rennes.inria.fr/eric.fabre/

Size: px

Start display at page:

Download "MAD. Models & Algorithms for Distributed systems -- 2/5 -- download slides at h9p://people.rennes.inria.fr/eric.fabre/"

Patricia Sharyl Dawson
5 years ago
Views:

1 MAD Models & Algorithms for Distributed systems -- /5 -- download slides at h9p://peoplerennesinriafr/ericfabre/ 1

2 Today Runs/execuDons of a distributed system are pardal orders of events We introduce logical clocks (Lamport, Fidge-Ma9ern) event structures distributed algorithms to build them Then explore applicadons to money coundng in a distributed transacdonal system the construcdon of snapshots

3 Runs of distributed systems Context We assume processes have UIDs {1,,,n} So far, we had an undirected interacdon graph of processes G=(V,E), where V={1,,,n} Processes are asynchronous (no global clock), don t fail, messages eventually reach their desdnadon We now examine a run of such a distributed system, with local events in each process P i, and message exchanges from P i to P j (where allowed) P 1 a b c d e f g P h i j k l m n o p 3

4 P 1 a b c d e f g P h i j k l m n o p!me t a chronogram view e = local event at P1 a = sending of a message at P1, i = recepdon of this message at P channels need not be FIFO : see jàg and kàf in each procees, events are totally ordered (local clock) the physical Dme can be seen as given by verdcal slices no one knows this physical Dme (we only know it exists up to reladvity!) 4

5 P 1 a b c d e f g a P h i j k l m n o p a m h b i n c d j e k o pf l g unknown physical Dme a chronogram view e = local event at P1 a = sending of a message at P1, i = recepdon of this message at P channels need not be FIFO : see jàg and kàf in each procees, events are totally ordered (local clock) the physical Dme can be seen as given by verdcal slices no one knows this physical Dme (we only know it exists up to reladvity!) 5

6 P 1 a b c d e f g a P h i j k l m n o p a m h b i c n d e j o k pf l g unknown physical Dme a chronogram view e = local event at P1 a = sending of a message at P1, i = recepdon of this message at P channels need not be FIFO : see jàg and kàf in each procees, events are totally ordered (local clock) the physical Dme can be seen as given by verdcal slices no one knows this physical Dme (we only know it exists up to reladvity!) 6

7 P 1 a b c d e f g a P h i j k l m n o p a m h b i c n d e j o k pf l g unknown physical Dme a chronogram view events can slide on their axis, and preserve their ordering in processes, and the emission/recepdon ordering this yields another possible (total) ordering of events in physical Dme, resuldng in the same final global state of the system, but going through different intermediate global states this advocates the modeling of a run as a par7al order of events 7

8 P 1 a b c d e f g a P h i j k l m n o p a m h b i c n d e j o k pf l g unknown physical Dme Ques7ons to address Q : how to (formally) define and handle a run as a pardal order of events, rather than a sequence? Q : the physical Dme is lost, but can we track/compute this pardal order? Q : can we compute one (all) possible total ordering(s)? Q : what are the possible intermediate (global) states along a run? 8

9 Event structures Warning Runs of distributed systems can be modeled in numerous (quite ohen uselessly complex) manners : - one can start from communicadng automata (Lynch) - or more simply from processes with local acdons, emissions and their matching recepdons (Lamport, Fidge, Raynal) - or even more simply from pardally ordered events (Ma9ern, Winskel, MacMillan, Nielsen, Engelfriet) - this goes with simple to more complex proofs for similar results! (simple no7on of) event structure it is a finite DAG (directed acyclic graph) events are parddoned into n subets (processes) E = E 1 ] ] E n E i E =(E,!) events in each are totally ordered (local clock) an event e E i has at most one direct successor/predecessor e 6 E i (to model emission/recepdon of messages) 9

10 P 1 a b c d P e f g h i k l m n pardal order on events : iff in the DAG, ie is the smallest pardal order ( = transidve+irreflexive) reladon generated by past of an event e = predecessors of e for future of an event e = successors of e for e e e! e! a! b! h! i! m! n ) a n concurrency : e?e i e e and e e a?k h?c b?l b 6? m c?m 1

reladon generated by past of an event e = predecessors of e for future of an event e =

11 P 1 P a b c d past of h concurrent with h e f h i g concurrent with h future of h k l m n pardal order on events : iff in the DAG, ie is the smallest pardal order (transidve, irreflexive) reladon generated by past of an event e = predecessors of e for future of an event e = successors of e for e e e! e! a! b! h! i! m! n ) a n concurrency : e?e i e e and e e a?k h?c b?l b 6? m c?m 11

$) E E A cut in is a subset closed for the precedence reladon 8e, e E, e E ^ e e ) e E A cut can be seen as a line/curve, cunng all threads, thus defining a past (E ) and a future (E\E ) The line$

12 P 1 cut not cut a b c d P e f g h i k l m n E =(E,!) E E A cut in is a subset closed for the precedence reladon 8e, e E, e E ^ e e ) e E A cut can be seen as a line/curve, cunng all threads, thus defining a past (E ) and a future (E\E ) The line represents a possible present Interpreta7on: a cut idendfies a possible global state of the distributed process, that could be characterized by the current state of each process, and the messages already sent but not yet received ( in flight messages) Remark: it is generally not possible to have cuts with no pending messages, ie that do not separate emission from recepdon of a message Exercise: build a counter-example 1

13 P 1 a b c d P e f g h i k l m n A linear extension of is a total order < in E preserving : Obtained by recursively adding arcs for some pair of concurrent events,, then compledng by transidvity, undl becomes a total order e?e 8e, e E, e e ) e<e e! e Thm : any linear extension < of is a possible execudon order (in physical Dme) for the events present in the event structure E =(E,!) Proof: trivial, as messages transit Dmes are unknown [See also later] Visually : how to build all such orderings? imagine events are pearls on a necklace, made of n theads, one per process pearls are free to move along each thread but edges can never point back e! e 13

14 Remarks We will see later how to encode pardal orders in convenient data structures, in order to compute with them In modern computer science, event structures are studied per se They are simply event sets E (possibly infinite ) enriched with several reladons like precedence, or causality conflict : different possible outcomes/futures alternadve causes/predecessors of events asymmetric causality (e can appear concurrently or aher e, but not before) etc 14

Logical clock Historically introduced by Lamport in 78 was one of the contribudons modvadng the Turing award easy, pleasant to read, applicadons described, a li9le frustradng

15 Logical clock Historically introduced by Lamport in 78 was one of the contribudons modvadng the Turing award easy, pleasant to read, applicadons described, a li9le frustradng on formalizadon and proofs read it! Objec7ve build one possible total ordering of events, by a9aching a logical Dme to them do this with a distributed asynchronous algorithm 15

16 Objec7ve: tag every event e with a logical clock value C(e), taken in some totally ordered set these Dcks should reflect one linear extension of in run E =(E,!) 8e, e E, e e ) C(e) <C(e ) nodce that it is sufficient to guarantee 8e, e E, e! e ) C(e) <C(e ) and to make sure that C defines a total order we want compute these tags with a distributed algorithm 16

17 Algorithm: if e E i is a new event in process P i if 9e E i,e! e then C(e) =C(e )+1 otherwise C(e) =1 if e E i is the sending of some message m from P i to P j send C(m) =C(e) with message m (piggybacking) if e E i is the recepdon of a message m tagged by C(m) make correcdon C(e) :=max(c(e),c(m) + 1) P 1 P a b c d () () e f 1 g (1) 3 h 4 i 5 (3) (5) k l m n 17

18 Algorithm: if e E i is a new event in process P i if 9e E i,e! e then C(e) =C(e )+1 otherwise C(e) =1 if e E i is the sending of some message m from P i to P j send C(m) =C(e) with message m (piggybacking) if e E i is the recepdon of a message m tagged by C(m) make correcdon C(e) := max(c(e),c(m) + 1) Proper7es clearly ensures 8e, e E, e! e ) C(e) <C(e ) but events may not be totally ordered : concurrent event could have the same tag a total order is obtained by appending index i to C(e) for e E i the total order is the lexicographic order on pairs (C(e),i) each process can order its received messages in a unique manner consistent with what all other processes do and consistent with the causality of events in the run however, this might not be the true order of message producdon in physical Dme which anyway is lost forever! 18

19 Algorithm: if e E i is a new event in process P i if 9e E i,e! e then C(e) =C(e )+1 otherwise C(e) =1 if e E i is the sending of some message m from P i to P j send C(m) =C(e) with message m (piggybacking) if e E i is the recepdon of a message m tagged by C(m) make correcdon C(e) := max(c(e),c(m) + 1) P 1 P a b c d () () e f 1 g (1) 3 h 4 i 5 (3) (5) k l m n a e k b f l c g d h i m n 19

20 Applica7ons Shared objects/states Mutual exclusion Banking problem determine the total amount of money circuladng among a set of actors (banks) local state = their current balance messages = transacdons (money sent) Principle tag events and messages with a logical Dme assume all messages arrive, and message flows never stop decide some logical Dme slice t at which coundng takes place then all processes wait undl they have an event greater than Dme t collect the balance of each bank aher the last event preceding Dme t determine the amount of money in flight at Dme t between all pairs P i and P j (ie sent by P i to P j, but not yet received by P j ) easy : P i knows how much it sent to P j before Dme t P j knows how much it received from P i before Dme t

21 P 1 $ $3 P $ $5 5 $ $ $ $

22 P 1 $ $14 $3 P $ $5 5 $7 8 $ $ $ $3 3 6 $ Before 7me 9 P sent $5+$=$7 to received $5 from P $ are in flight

Applica7ons DefiniDon of a snapshot (checkpoint), ie capture of a consistent global state from where a (failing) distributed computadon could restart General idea : at some logical Dme t, all

23 Applica7ons DefiniDon of a snapshot (checkpoint), ie capture of a consistent global state from where a (failing) distributed computadon could restart General idea : at some logical Dme t, all processes store their current state, and the content of messages that have been sent and are not yet received similar to the banking problem, where in flight messages must also be idendfied and stored Specific case of FIFO channels : see the Chandy-Lamport algorithm ( 85), that uses a marker to separate past messages from new ones in a channel Worth reading : important algorithm + historical interest Paper driven by examples, not a formal presenta!on 3

24 Vector clock Historically introduced independently by Fidge (Aust) and Ma9ern (Germ) in 88 Fidge uses a slightly different construcdon, and is less formalized Ma9ern is a bit more formalized, and uses the nodon of event structure Read MaKern! 4

25 Vector clock Objec7ve recover all possible consistent total orderings of events in a distributed run track the causality reladons among events of a distributed system, with a distributed algorithm 5

26 A drawback of Lamport s logical 7me not all total orderings of events are accessible logical Dme is totally ordered : how to capture only causality? 8e, e E, e e ) C(e) <C(e ) one would like to have : 8e, e E, e e () VC(e) VC(e ) Fidge-Ma]ern s idea one local clock C i per process P i clock Dcks are tuples VC(e) =(C 1 (e),,c n (e)) N n Algorithm: if e E i is a new event in process Pi if 9e E i,e! e then VC(e) =(C 1 (e ),,C i (e )+1,,C n (e )) otherwise VC(e) =(,,, 1,,,) with 1 on the i th coordinate if e E i is the sending of some message m from Pi to Pj send VC(m) =VC(e) with message m (piggybacking) if e E i is the recepdon of a message m tagged by VC(m) make correcdon 8k, C k (e) =max(c k (e),c k (m)) in VC(e) 6

27 P 1 P a b c d e f g h i n k l m

28 Thm: The event structure E =(E,!) and Ma9ern s vector clock values on it generate isomorphic trellises of event subsets [Ma9ern 88] 8e, e E, e e () VC(e) VC(e ) Proof one defines VC(e) VC(e ) by 8i, C i (e) apple C i (e ) and VC(e) VC(e ) by VC(e) VC(e ) ^ VC(e) 6= VC(e ) the theorem expresses that the pardal order of the DAG and the one on vector clock values are idendcal ) ( proof of is the same as for the logical clock proof of is more involved (see hint below) E =(E,!) 8

29 P 1 P a b c d e f g h i n k l m

klmn The text recoding of the Vector Clock captures exactly all

30 P 1 P a a b ab c abc d ef ab ef e f g e h ab i ef efg efgh e efg ab efghi ab efghi ab e efghi k k l kl m klm n abcd ef ab efghi klmn The text recoding of the Vector Clock captures exactly all events that are causal predecessors of some given event in the DAG 3

31 Applica7ons distributed debugging : to keep track of the causality of events versioning system : discover inconsistent versions (concurrent edidng) snapshots (storage of consistent global states), when channels are not FIFO (Chandy-Lamport not applicable) 31

32 Take home messages runs of distributed systems are pardal orders (causal reladons) of events be9er encoded as event structures this pardal order can be tracked by distributed algorithms this is the stardng point of more elaborate funcdons (snapshots, mutual exclusion, detecdon of stable properdes, ) next 7me processes as automata models for distributed systems true concurrency semandcs to capture causality/parallelism 3

MAD. Models & Algorithms for Distributed systems -- 2/5 -- download slides at

MAD. Models & Algorithms for Distributed systems -- 2/5 -- download slides at MAD Models & Algorithms for Distributed systems -- /5 -- download slides at http://people.rennes.inria.fr/eric.fabre/ 1 Today Runs/executions of a distributed system are partial orders of events We introduce