How can one get around FLP? Around FLP in 80 Slides. How can one get around FLP? Paxos. Weaken the problem. Constrain input values

Size: px

Start display at page:

Download "How can one get around FLP? Around FLP in 80 Slides. How can one get around FLP? Paxos. Weaken the problem. Constrain input values"

Brandon Hunt
5 years ago
Views:

1 How can one get around FLP? Around FLP in 80 Slides Weaken termination Weaken the problem use randomization to terminate with arbitrarily high probability guarantee termination only during periods of synchrony Weaken agreement ε - agreement real-valued inputs and outputs agreement within real-valued small positive tolerance ε k-set agreement Agreement: In any execution, there is a subset W of the set of input values, W =k, s.t. all decision values are in W Validity: In any execution, any decision value for any process is the input value of some process How can one get around FLP? Constrain input values Characterize the set of input values for which agreement is possible Paxos Strengthen the system model Introduce failure detectors to distinguish between crashed processes and very slow processes

The Part-Time Parliament Government 101 Parliament determines laws by passing sequence

at arbitrary times No centralized records: each legislator carries a ledger recording

legislators are in the Chamber and no one enters or leaves the Chamber for a

any passed decree appears on the ledger of every legislator Funky equipment! messengers!

Mysterious characters " " " apple apple!! "! apple µ! apple"" and more!

environment that admits crash failures (key ideas already present in earlier work on

2 The Part-Time Parliament Government 101 Parliament determines laws by passing sequence of numbered decrees Direct democracy: Citizens/ Legislators leave and enter the chamber at arbitrary times No centralized records: each legislator carries a ledger recording the approved decrees No two ledgers contain contradictory information If a majority of legislators are in the Chamber and no one enters or leaves the Chamber for a sufficiently long time, then any decree proposed by a legislator is eventually passed any passed decree appears on the ledger of every legislator Funky equipment! messengers! ledger In a world... pen with indelible ink hourglass Political intrigue! Mysterious characters " " " apple apple!! "! apple µ! apple"" and more! Back to the future A protocol for state machine replication in an asynchronous environment that admits crash failures (key ideas already present in earlier work on Viewstamped Replication by Oki and Liskov) Messages: between correct endpoints are eventually received can be lost and duplicated, but not corrupted

3 The big picture The big picture client client-specific command identifier client <, cid, op> <cid, result> Replica Server Replica Server The big picture The big picture client client f +1 Replica 1 Replica 2 Replica 3 Replica 4 Replica 5 f +1 Replica 1 Replica 2 Replica 3 Replica 4 Replica 5

4 The big picture The big picture Replica 1 Replica 2 Replica 3 Replica 4 Replica 5 Replica 1 Replica 2 Replica 3 Replica 4 Replica The big picture Replicas Replica 1 Replica 2 Replica 3 Replica 4 Replica Receive client requests Propose command for lowest unused slot to leaders Upon decision, execute commands in slot order Return result to clients Not necessarily identical at any time! Each replica maintains four variables:.state: the application state.slot num : next slot for which " does not know a decision.proposals: set of <slot number, command> pairs for past proposals.decisions : set of <slot number, command> pairs for decided slots

5 process Replica(leaders, initial state) var state:=initial state, slot num:=1,proposals:=;; decisions:=; switch receive() case <request, p> : propose(p); case <decision, s, p> : decisions:=decisions [ {(s, p)} while 9p 0 : hslot num, p 0 i2decisions do if 9p 00 : hslot num, p 00 i2proposals ^ p 00 6= p 0 propose(p 00 ); perform(p 0 ); end while then Replica function perform(happle,cid,opi) if 9s : s<slotnum ^ hs, happle,cid,opii 2 decisionsthen hnext, resulti:=op(state); atomic state:=next; end atomic send(apple, hresponse,cid,resulti); process Replica(leaders, initial state) var state:=initial state, slot num:=1,proposals:=;; decisions:=; switch receive() case <request, p> : propose(p); case <decision, s, p> : decisions:=decisions [ {(s, p)} while 9p 0 : hslot num, p 0 i2decisions do if 9p 00 : hslot num, p 00 i2proposals ^ p 00 6= p 0 then propose(p 00 ); perform(p 0 ); end while Replica function perform(happle,cid,opi) if 9s : s<slotnum ^ hs, happle,cid,opii 2 decisionsthen hnext, resulti:=op(state); atomic state:=next; end atomic send(apple, hresponse,cid,resulti); function propose(p) : hs, pi 2decisions then s 0 := min{s s 2 N + 0 : hs, p 0 i2proposal [ decisions}; proposals := proposals [ {(s 0,p)}; 8 2 leaders : send(, h propose,s 0,pi); R4. For each, the variable.slot num never decreases function propose(p) : hs, pi 2decisions then s 0 := min{s s 2 N + 0 : hs, p 0 i2proposal [ decisions}; proposals := proposals [ {(s 0,p)}; 8 2 leaders : send(, h propose,s 0,pi); R3. For all replicas,.state is the result of applying the operations in.decisions to the initial state in increasing order of slot number s : 1apples<slot num process Replica(leaders, initial state) var state:=initial state, slot num:=1,proposals:=;; decisions:=; switch receive() case <request, p> : propose(p); case <decision, s, p> : decisions:=decisions [ {(s, p)} while 9p 0 : hslot num, p 0 i2decisions do if 9p 00 : hslot num, p 00 i2proposals ^ p 00 6= p 0 then propose(p 00 ); perform(p 0 ); end while Replica function perform(happle,cid,opi) if 9s : s<slotnum ^ hs, happle,cid,opii 2 decisionsthen hnext, resulti:=op(state); atomic state:=next; end atomic send(apple, hresponse,cid,resulti); process Replica(leaders, initial state) var state:=initial state, slot num:=1,proposals:=;; decisions:=; switch receive() case <request, p> : propose(p); case <decision, s, p> : decisions:=decisions [ {(s, p)} while 9p 0 : hslot num, p 0 i2decisions do if 9p 00 : hslot num, p 00 i2proposals ^ p 00 6= p 0 then propose(p 00 ); perform(p 0 ); end while Replica function perform(happle,cid,opi) if 9s : s<slotnum ^ hs, happle,cid,opii 2 decisionsthen hnext, resulti:=op(state); atomic state:=next; end atomic send(apple, hresponse,cid,resulti); function propose(p) : hs, pi 2decisions then s 0 := min{s s 2 N + 0 : hs, p 0 i2proposal [ decisions}; proposals := proposals [ {(s 0,p)}; 8 2 leaders : send(, h propose,s 0,pi); R2. All commands up to slot num are in the set of decisions function propose(p) : hs, pi 2decisions then s 0 := min{s s 2 N + 0 : hs, p 0 i2proposal [ decisions}; proposals := proposals [ {(s 0,p)}; 8 2 leaders : send(, h propose,s 0,pi); R1. For any given slot, replicas decide the same command

6 Paxos: the basic idea Ballots Leaders compete to create a permanent mapping between slot numbers and proposals The mapping is recorded in Paxos memory at a set of state machines called acceptors Note: Though state machines, acceptors are NOT replicas of each other! A leader never proposes a map that may conflict with what is stored in Paxos memory A leader, before attempting to create a new map between a slot number for which it knows not a decision and a proposal, reads the Paxos memory to check whether such map may already exist Each leader has an infinite supply of ballots The set of ballots of different leaders are disjoint Once a leader learns that a new mapping has become permanent, it informs the replicas Ballots Acceptors Each leader has an infinite supply of ballots The set of ballots of different leaders are disjoint Ballots are lexicographically ordered pairs hseq no, LIdi Send messages only when prompted Can crash......but we assume no more than a minority will Need at least 2f +1 acceptors to tolerate f faults Each acceptor maintains two variables:.ballot num, initially?.accepted, a set of pvalues, initially empty A pvalue e=hb, s, pi is a triple b: ballot number s: slot number p: a proposal accepts e e 2.accepted adopts b.ballot num:=b

7 A mapping is forever......once it is accepted by a majority of acceptors it is then chosen accepts a pvalue only if it includes the ballot most recently adopted by To make mapping hs, pi permanent, needs a majority of acceptors to adopt the ballot of the pvalue that contains hs, pi A1. An acceptor can only adopt strictly increasing ballot numbers A2. An acceptor can only accept hb, s, pi if b = ballot num Acceptor process Acceptor() var ballot num :=?,accepted:= ;; case <p1a,,b> : On receiving <p1a,,b > if b>ballotnum then adopts b iff larger than ballot num ballot num := b; returns to all accepted pvalues send(, h p1b, self(), ballot num, acceptedi); case <p2a,, hb, s, pi> : if b ballot num then On receiving <p2a,, hb, s, pi> ballot num := b; adopts b iff larger than ballot num accepted:=accepted [ {hb, s, pi} accepts e if b equal to ballot num send(, h p2b, self(),ballot numi); returns to the current ballot num Invariants A3. An acceptor can not remove entries from.accepted A4. For a given b and s, at most one proposal can be under consideration by the acceptors: hb, s, pi 2.accepted ^ hb, s, p 0 i2 0.accepted =) p = p 0 Acceptor process Acceptor() var ballot num :=?,accepted:= ;; case <p1a,,b> : On receiving <p1a,,b > if b>ballotnum then adopts b iff larger than ballot num ballot num := b; send(, h p1b, self(), ballot num, acceptedi); returns to all accepted pvalues case <p2a,, hb, s, pi> : if b ballot num then On receiving <p2a,, hb, s, pi> ballot num := b; adopts b iff larger than ballot num accepted:=accepted [ {hb, s, pi} accepts e if b equal to ballot num send(, h p2b, self(),ballot numi); returns to the current ballot num Invariants A5. Suppose a majority of acceptors has hb, s, pi 2.accepted.. If b 0 >b and hb 0,s,p 0 i2 0.accepted, then p = p 0 Commander A leader holding ballot num = b and trying to map slot s to proposal p spawns a new commander thread for hb, s, pi A commander s mission has two possible outcomes: success: the leader learns that the proposed mapping has been permanently established failure: the leader learns that b may no longer be acceptable to a majority of acceptors

8 Commander invariants C1. For any b and s, at most one commander is spawned A4. For a given b and s, at most one proposal can be under consideration by the acceptors process Commander(, acceptors, replicas, hb, s, pi) var waitfor := acceptors, pvalues := ; 8 2 acceptors : send(, h p2a, self (), hb, s, pi); case <p2b,, b0 > : if b0 = b then waitfor := waitfor { }; R1. For any given slot, replicas if waitfor < acceptors /2 then decide the same command 8 2 replicas : send (, h decision, s, pi); ; A5. Suppose a majority of acceptors has send (, hpreempted, b0 i) hb, s, pi 2.accepted. If b0 > b and hb0, s, p0 i 2 0.accepted, then p = p0 Commander Must enforce C2. Suppose a majority of acceptors has hb, s, pi 2.accepted. If a commander is 0 spawned for hb0, s, p0 i : b0 > b, then p = p Commander invariants C2. Suppose a majority of acceptors has hb, s, pi 2.accepted. If a commander is spawned for hb0, s, p0 i : b0 > b, then p = p0 A5. Suppose a majority of acceptors has hb, s, pi 2.accepted. If b0 > b and hb0, s, p0 i 2 0.accepted, then p = p0 process Commander(, acceptors, replicas, hb, s, pi) var waitfor := acceptors, pvalues := ; 8 2 acceptors : send(, h p2a, self (), hb, s, pi); case <p2b,, b0 > : if b0 = b then waitfor := waitfor { }; if waitfor < acceptors /2 then 8 2 replicas : send (, h decision, s, pi); A higher ballot b0 is active: a ; majority of acceptors may no send (, hpreempted, b0 i) longer be willing to accept b Commander

9 process Commander(,acceptors,replicas,hb, s, pi) var waitfor:=acceptors,pvalues:=; 8 2 acceptors : send(, hp2a, self (), hb, s, pi); case <p2b,,b 0 > : if b 0 = b then waitfor :=waitfor { }; if waitfor < acceptors /2 then 8 2 replicas : send(, h decision,s,pi); ; send(, h preempted,b 0 i) Commander Notify the leader and exit Scout Before spawning commanders for ballot b, leader invokes a scout Scouts read the Paxos memory to help leaders propose mappings that satisfy C2. A scout s mission has two possible outcomes: success: the leader learns that the proposed ballot has been adopted by a majority of acceptors and receives all pvalues accepted by that majority failure: the leader learns that b may no longer be acceptable to a majority of acceptors process Scout(, acceptors,b) var waitfor:=acceptors,pvalues:=; 8 2 acceptors : send(, hp1a, self (),b); case <p1b,,b 0,r> : if b 0 = b then pvalues :=pvalues [ r; waitfor :=waitfor { }; if waitfor < acceptors /2 then send(, h adopted,b,pvaluesi); ; send(, h preempted,b 0 i) Scout Scout gets a majority of acceptors to adopt collects all pvalues that acceptors have accepted while adopting ballots no larger than b b process Scout(, acceptors,b) var waitfor:=acceptors,pvalues:=; 8 2 acceptors : send(, hp1a, self (), hbi); case <p1b,,b 0,r> : if b 0 = b then pvalues :=pvalues [ r; waitfor :=waitfor { }; if waitfor < acceptors /2 then send(, h adopted,b,pvaluesi); ; send(, h preempted,b 0 i) Scout A higher ballot is active: a majority of acceptors may no longer be willing to accept b b 0

process Scout(, acceptors, b) var waitfor := acceptors, pvalues := ; 8 2 acceptors : send(, h p1a, self (), hbi); case <p1b,, b0, r > : if b0 = b then pvalues := pvalues [ r; waitfor := waitfor { };

10 process Scout(, acceptors, b) var waitfor := acceptors, pvalues := ; 8 2 acceptors : send(, h p1a, self (), hbi); case <p1b,, b0, r > : if b0 = b then pvalues := pvalues [ r; waitfor := waitfor { }; if waitfor < acceptors /2 then send (, h adopted, b, pvaluesi); ; send (, hpreempted, b0 i) Notify Scout Leader Spawns a scout for initial ballot number Enters a loop waiting for one of three messages: the leader and exit h propose, s, pi from a replica Each leader variables: maintains three.ballot num, initially 0.active, boolean, initially false.proposals, an initially empty map hslot number, proposal i h adopted, ballot num, pvalsi from a scout Leader moves between active and passive mode h preempted, hr0, 0 ii from a commander or a scout in passive mode is waiting for hadopted, ballot num, pvalsi in active mode spawns commanders for each of the proposal it holds How a leader enforces C2 Suppose learns that a majority of acceptors has adopted its ballot b (hadopted, b, pvalsi) CASE 1: if for some slot s there is no value in pvals, then it is impossible that a permanent mapping for a smaller ballot already exists or will ever exist for s : any proposal by will satisfy C2 How a leader enforces C2 Suppose learns that a majority of acceptors has adopted its ballot b (hadopted, b, pvalsi) CASE 2: let hb0, s, pi be the pvalue with the maximum ballot number b0 for s. by induction, no pvalue other than p could have been chosen for s when hb0, s, pi was proposed since a majority of acceptors has adopted b, 0 no pvalues between b and b can be chosen by proposing p with ballot b, enforces C2

11 process Leader (acceptors, replicas) var ballot num := (0, self ()),active = false, proposals := ; spawn(scout(self (), acceptors, ballot num); case <propose, s, p > : : hs, p0 i 2 proposals then proposals := proposals [ {hs, pi} if active then spawn(commander(self (), acceptors, replicas, hballot num, s, pi); x y {hs, pi hs, pi 2 y _ (hs, pi 2 x : hs, p0 i 2 y)} end case pmax (pvals) {hs, pi 9b : hb, s, pi 2 pvals ^ case <adopted, ballot num, pvals > 8b0, p0 : hb0, s, p0 i 2 pvals ) b0 b} proposals = proposals pmax(pvals) 8hs, pi 2 proposals : spawn(commander(self (), acceptors, replicas, hballot num, s, pi); active := true end case case <preempted, r0, 0 > if (r0, 0 ) > ballot num then end case active := false; 0 ballot num := (r + 1, self ()); spawn(scout(self (), acceptors, ballot num); Leader Paxos and FLP Paxos is always safe despite asynchrony Once a leader is elected, Paxos is live. Ciao ciao FLP? To be live, Paxos requires a single leader Leader election is impossible in an asynchronous system (gotcha!) Given FLP, Paxos is the next best thing: always safe, and live during periods of synchrony The Big Picture The Triumph of Randomization Does randomization make for more powerful algorithms? Does randomization expand the class of problems solvable in polynomial time? Does randomization help compute problems fast in parallel in the PRAM model? You tell me!

12 The Triumph of Randomization? A simple randomized algorithm Well, at least for distributed computations! no deterministic 1-crash-resilient solution to Consensus f-resilient randomized solution to consensus ( f<n/2) for crash failures randomized solution for Consensus exists even for Byzantine failures! M. Ben Or. Another advantage of free choice: completely asynchronous agreement protocols (PODC 1983, pp ) exponential number of operations per process BUT more practical protocols exist down to O(nlog 2 n) expected operations/process n 1 resilient Weakening termination Consensus requires correct processes to decide always Ben Or s algorithm only requires termination with probability 1 They are not the same thing! The protocol s structure An infinite repetition of asynchronous rounds in round r, p only handles messages with timestamp r each round has two phases in the first, each p broadcasts an a-value which is a function of the b-values collected in the previous round (the first a-value is the input bit) in the second, each p broadcasts a b-value which is a function of the collected a-values decide stutters

13 Ben Or s Algorithm Validity 1: := input bit; r:= 1; a p 4: send (a p,r) to all 5: Let A be the multiset of the first n f a-values with 6: if ( v {0, 1} : a A : a = v) then := v 7: := b p 9: send (b p,r) to all timestamp rreceived 11: if ( v {0, 1} : b B : b = v) then decide(v); a p := v 12: if ( b B : b ) then := a p 13: := $ { $ is chosen uniformly at random a p b b p 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random Validity A useful observation 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random All identical inputs ( i) Each process set a-value := and broadcasts it to all Since at most f faulty, every correct process receives at least n f identical a-values in round 1 Every correct process sets b-value := i and broadcasts it to all Again, every correct process receives at least n f identical i b-values in round 1 and decides i 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random Lemma For all r, either. b p,r {1, } for all p or. b p,r {0, } for all p

14 A useful observation Agreement 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random Lemma For all r, either. b p,r {1, } for all p or. b p,r {0, } for all p Proof By contradiction. Suppose p and q at round r such that = 0 and = 1 b p,r b q,r From lines 6,7 p received n f distinct 0s, q received n f distinct 1s. Then, 2(n f) n, implying n 2f Contradiction Corollary It is impossible that two processes p and q decide on different values at round r 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random Let r be the first round in which a decision is made Let p be a process that decides in r Agreement Termination I 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random Let r be the first round in which a decision is made Let p be a process that decides in r By the Corollary, no other process can decide on a different value in r To decide, p must have received n f i from distinct processes every other correct process has received i from at least n 2f 1 By lines 11 and 12, every correct process sets its new a-value to for round r+1to i By the same argument used to prove Validity, every correct process that has not decided i in round r will do so by the end of round r+1 4: send (aa p,r) to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random Remember that by Validity, if all (correct) processes propose the same value i in phase 1 of round r., then every correct process decides i in round r. The probability of all processes proposing the same input value (a landslide) in round 1 is Pr[landslide in round 1] = 1/2. n What can we say about the following rounds?

15 Termination II 4: send (aa p,r) p to all 7:! b p :=... 12: if ( b B : b ) then a p := b 13: a p := $ { $ is chosen uniformly at random In round r > 1, the a-values are not necessarily chosen at random! By line 12, some process may set its a-value to a non-random value v By the Lemma, however, all non-random values are identical! Therefore, in every r there is a positive probability (at least 1/2 n ) for a landslide Hence, for any round r Pr[no lanslide at round r] 1 1/2 n. Since coin flips are independent: Pr[no lanslide for first k rounds] (1 1/2 n.) k When k =2 n, this value is about 1/e; then, if k = c2 n Pr[landslide within k rounds] 1 (1 1/2 n ) k 1 1/e c which converges quickly to 1 as c grows

Randomized Protocols for Asynchronous Consensus

Randomized Protocols for Asynchronous Consensus Alessandro Panconesi DSI - La Sapienza via Salaria 113, piano III 00198 Roma, Italy One of the central problems in the Theory of (feasible) Computation is