Seeking Fastness in Multi-Writer Multiple- Reader Atomic Register Implementations

Similar documents
On the Efficiency of Atomic Multi-Reader, Multi-Writer Distributed Memory

Do we have a quorum?

Shared Memory vs Message Passing

ARES: Adaptive, Reconfigurable, Erasure coded, atomic Storage

Unreliable Failure Detectors for Reliable Distributed Systems

Agreement. Today. l Coordination and agreement in group communication. l Consensus

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. CS 249 Project Fall 2005 Wing Wong

Time in Distributed Systems: Clocks and Ordering of Events

Asynchronous Models For Consensus

CS 347 Parallel and Distributed Data Processing

Slides for Chapter 14: Time and Global States

Time. To do. q Physical clocks q Logical clocks

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG

Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation

Building Distributed Computing Abstractions in the Presence of Mobile Byzantine Failures

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation

Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering

Consensus when failstop doesn't hold

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Chapter 11 Time and Global States

Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1.

Eventually consistent failure detectors

Distributed Algorithms Time, clocks and the ordering of events

CptS 464/564 Fall Prof. Dave Bakken. Cpt. S 464/564 Lecture January 26, 2014

DISTRIBUTED COMPUTER SYSTEMS

Time. Today. l Physical clocks l Logical clocks

Time. Lakshmi Ganesh. (slides borrowed from Maya Haridasan, Michael George)

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization

Absence of Global Clock

Distributed Systems Principles and Paradigms

Distributed systems Lecture 4: Clock synchronisation; logical clocks. Dr Robert N. M. Watson

CS505: Distributed Systems

Concurrent Non-malleable Commitments from any One-way Function

RAMBO: A Robust, Reconfigurable Atomic Memory Service for Dynamic Networks

Simultaneous Consensus Tasks: A Tighter Characterization of Set-Consensus

Clocks in Asynchronous Systems

Time is an important issue in DS

Cuts. Cuts. Consistent cuts and consistent global states. Global states and cuts. A cut C is a subset of the global history of H

TECHNICAL REPORT YL DISSECTING ZAB

C 1. Recap: Finger Table. CSE 486/586 Distributed Systems Consensus. One Reason: Impossibility of Consensus. Let s Consider This

LOWER BOUNDS FOR RANDOMIZED CONSENSUS UNDER A WEAK ADVERSARY

Failure detectors Introduction CHAPTER

A subtle problem. An obvious problem. An obvious problem. An obvious problem. No!

Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication

Generalized Consensus and Paxos

Atomic snapshots in O(log 3 n) steps using randomized helping

CS505: Distributed Systems

Efficient Dependency Tracking for Relevant Events in Concurrent Systems

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks

Distributed Systems. 06. Logical clocks. Paul Krzyzanowski. Rutgers University. Fall 2017

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links

Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems

Fault-Tolerant Consensus

Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures

arxiv: v1 [cs.dc] 26 Nov 2018

The Weakest Failure Detector to Solve Mutual Exclusion

416 Distributed Systems. Time Synchronization (Part 2: Lamport and vector clocks) Jan 27, 2017

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

6.852: Distributed Algorithms Fall, Class 24

CS505: Distributed Systems

Distributed Computing. Synchronization. Dr. Yingwu Zhu

On Two Class-Constrained Versions of the Multiple Knapsack Problem

Distributed Consensus

On the weakest failure detector ever

Optimal Resilience Asynchronous Approximate Agreement

Distributed Algorithms (CAS 769) Dr. Borzoo Bonakdarpour

Efficient Notification Ordering for Geo-Distributed Pub/Sub Systems

1 Primals and Duals: Zero Sum Games

7680: Distributed Systems

Consensus and Universal Construction"

Time-Bounding Needham-Schroeder Public Key Exchange Protocol

Clock Synchronization

arxiv: v3 [cs.dc] 25 Apr 2018

DEXON Consensus Algorithm

Distributed Mutual Exclusion Based on Causal Ordering

Petri nets. s 1 s 2. s 3 s 4. directed arcs.

Early stopping: the idea. TRB for benign failures. Early Stopping: The Protocol. Termination

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications:

Reliable Broadcast for Broadcast Busses

Degradable Agreement in the Presence of. Byzantine Faults. Nitin H. Vaidya. Technical Report #

arxiv: v1 [cs.dc] 2 Jul 2015

Some Remarks on Alternating Temporal Epistemic Logic

Byzantine Agreement. Gábor Mészáros. Tatracrypt 2012, July 2 4 Smolenice, Slovakia. CEU Budapest, Hungary

Snapshots. Chandy-Lamport Algorithm for the determination of consistent global states <$1000, 0> <$50, 2000> mark. (order 10, $100) mark

Today. Vector Clocks and Distributed Snapshots. Motivation: Distributed discussion board. Distributed discussion board. 1. Logical Time: Vector clocks

Causality and Time. The Happens-Before Relation

The Weighted Byzantine Agreement Problem

Randomized Protocols for Asynchronous Consensus

Byzantine Agreement. Gábor Mészáros. CEU Budapest, Hungary

(Leader/Randomization/Signature)-free Byzantine Consensus for Consortium Blockchains

Efficient Asynchronous Multiparty Computation with Optimal Resilience

On Verifying Causal Consistency

S1 S2. checkpoint. m m2 m3 m4. checkpoint P checkpoint. P m5 P

Combining Shared Coin Algorithms

Distributed Systems Byzantine Agreement

INF Models of concurrency

Valency Arguments CHAPTER7

Distributed Computing in Shared Memory and Networks

CS505: Distributed Systems

Transcription:

ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ Η ΔΕΣΜΗ 29- ΣΥΓΧΡΗΜΑΤΟΔΟΤΕΙΤΑΙ ΑΠΟ ΤΗΝ ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΚΑΙ ΤΟ ΕΥΡΩΠΑΪΚΟ ΤΑΜΕΙΟ ΠΕΡΙΦΕΡΕΙΑΚΗΣ ΑΝΑΠΤΥΞΗΣ ΤΗΣ ΕΕ Seeking Fastness in Multi-Writer Multiple- Reader Atomic Register Implementations /69/3 Nicolas Nicolaou University of Cyprus & University of Connecticut Funded by the Cyprus Research Promotion Foundation and co-funded by the Republic of Cyprus and the European Regional Development Fund

What is a Distributed Storage System? read() write(v) Distributed Storage Abstraction Data Replication Servers/Disks Survivability and Availability Read/Write operations Consistency Semantics 2

Definition: Operation Relations Precedence Relations for two operations π,π 2 : π precedes π 2 if the response of π happens before the invocation of π 2 π π 2 Time π succeeds π 2 if the invocation of π happens after the response of π 2 π 2 Time π π is concurrent with π 2 if π neither precedes nor succeeds π 2 π π 2 Time 3

Consistency Semantics [Lamport86] Safety write(8) Time read(3) read() read(8) Regularity write(8) Time read(8) read() read(8) Atomicity write(8) Time read(8) read(8) read(8) 4

How to order read/write operations? Based on the value each operation writes/returns Non-unique Values Using the time at which each operation is invoked Clock Synchronization Associate a sequence number with each value written SWMR: timestamps MWMR: tags=<timestamp, wid> 5

Challenges Communication Rounds X p p2 6

Multiple Round-Trips Consider the following example [Attiya et al. 96]: S S2 S3 S4 S5 W() R() S S2 S3 S4 S5 W() R() R2() Atomicity is Violated 7

Complexity Measure Communication Delays (round-trips) Computation Operation Latency 8

What was known Traditional SWMR [Attiya et al. 95] Single round writes Two round reads Phase : Obtain latest value Phase 2: Propagate latest value Folklore belief: Reads must Write Traditional MWMR Two round writes Phase : Discover latest value Phase 2: Order new value after the latest and propagate Belief: Writes must Read Two round reads 9

The Era of Fast Implementations SWMR Fast Single round (fast) writes and reads Bounded readers: R<(S/f )-2 where S servers & f failures Impossible in MWMR model SWMR Semifast Fast writes Only a single complete 2-round (slow) read per write Unbounded readers Impossible in the MWMR model SWMR Weak- Semifast General Quorum System Fast writes and Multiple slow reads per write Allows concurrent fast reads with writes Unknown if applicable in MWMR model Nicolas Nicolaou -- CS Colloquium @ UCY

Model Asynchronous, Message-Passing model Process sets: writers W, readers R, servers S (replica hosts) Reliable Communication Channels Well Formedness Environments: SWMR: W =, R MWMR: W, R Failures: Crash Failures Correctness: Atomicity (safety), Termination (liveness)

Communication Round A process p performs a communication round during an operation π if: p sends a message m to a set of servers for π Any server that receives m replies to p Once p receives responses from a single quorum completes π or proceeds to a next communication round 2

Definition: Quorum systems Servers Q z p p2 Q j X Q i Q i, Q j, Q z are quorums Quorum System is the set {Q i, Q j, Q z } Property: every pair of quorums intersects N-wise quorums systems: every N quorums intersect for N> Every R/W operation communicates with a single quorum Faulty Quorum: Contains a faulty process 3

Algorithm: Simple Write Protocol: two rounds P: Query a single quorum for the latest tag P2: Increment the timestamp in the max tag, and send <newtag, v> to a quorum Read Protocol: two rounds P: Query a single quorum for the latest tag P2: Propagate <maxtag,v> to a single quorum Server Protocol: passive role Receive requests, update local timestamp (if msg.tag>server.tag) and reply with <server.tag,v> 4

Example: Simple (write operations) Assume w i >w k read() Q z w i read() w k Q j Q i 5

Example: Simple (write operations) Assume w i >w k read() Q z w i write(<,w i >,v) read() w k Q j Q i 6

Example: Simple (write operations) Assume w i >w k read() Q z w i write(<,w i >,v) Q j Q i read() write(<,w k >,v) w k Belief: Writes must Read in MW environments 7

Example: Simple (read operation) Assume w i >w k Q z read() r i Q j Q i 8

Example: Simple (read operation) Assume w i >w k Q z read() ret(v) r i Q j write(<,w i >,v) Q i 9 Operation Ordering: w k -> w i -> r i

Why a read performs 2 rounds? Consider the following executions with single round reads: Ex(a) Ex(b) Q z Q z Q j Q i Q j Q i read() read() ret(v) r i ret(v) r i 2

Why a read performs 2 rounds? (Cont.) Extend execution Ex(b) with a read from rj: Q z read() r j Q j Q i ret(<,>, v ) read() ret(<,w k >, v) r i Atomicity is Violated Folklore Belief: Reads must Write in MR environments 2

Question Can we allow reads and writes to be fast (single round) and still guarantee atomicity? Answer: YES!! 22

New Technique - SSO [Englert et. al 9] SSO: Server Side Ordering Tag is incremented by the servers and not by the writer. Generated tags may be different across servers Clients decide operation ordering based on server responses SSO Algorithm Enables Fast Writes and Reads -- first such algorithm Allows Unbounded Participation 23

Traditional Writer-Server Interaction writer server P: read() Find max (t s ) t w = inc(t s ) w reply(t s ) P2: write(t w,v) s reply(max(t w,t s )) Return(OK) 24

SFW Writer-Server Interaction writer server P: write(t w,v) Yes Is t s valid for v? w reply(t s,v) s t s =inc(max(t s,t w )) No t w = max(t s ) P2: write(t w,v) reply(max(t w,t s )) Return(OK) 25

Algorithm: SFW (in a glance) Write Protocol: one or two rounds P: Collect candidate tags from a quorum Exists tag t propagated in a bigger than (n/2-)-wise intersection (PREDICATE PW) YES assign t to the written value and return => FAST NO - propagate the unique largest tag to a quorum => SLOW Read Protocol: one or two rounds P: collect list of writes and their tags from a quorum Exists max write tag t in a bigger than (n/2-2)-wise intersection (PREDICATE PR) YES return the value written by that write => FAST NO is there a confirmed tag propagated to (n-)-wise intersection => FAST NO - propagate the largest confirmed tag to a quorum => SLOW Server Protocol Increment tag when receive write request and send to read/write the latest writes 26

Predicates: Read and Write 27

Lower bounds Theorem: No execution of safe register implementation that use an N-wise quorum system, contains more than N consecutive, quorum shifting, fast writes. Theorem: It is impossible to get MWMR safe register implementations that exploit an N-wise quorum system, if W R > N 28

Remarks Remark: SSO algorithm is near optimal since it allows up to consecutive, quorum shifting, fast writes. N 2 29

The Weak Side of SFW Predicates are Computationally Hard NP-Complete Restriction on the Quorum System Deploys n-wise Quorum Systems Guarantees fastness iff n>3 3

The Good News Approximation Algorithm (APRX-SFW) Polynomial Log-approximation log S times the optimal number of fast operations Algorithm CWFR Based on Quorum Views SWMR prediction tools Fast operations in General Quorum Systems Trades Speed of Write operations Two Round Writes 3

NP-Completeness K-SET-INTERSECTION: (captures both PR and PW) Given a set of elements U, a subset of those elements M U, a set of subsets Q = {Q,...,Q n } s.t. Q i U, and an integer k Q, aseti Q is a k-intersecting set if: I = k, Q I Q M, and Q I Q =. Theorem: K-SET-INTERSECTION is NP-complete (reduction from 3-SAT). 32

k-set-intersection Approximation Greedy algorithm Uses Set Cover greedy approximation algorithm at its core K-SET-COVER: Given a universe U of elements, a collection of subsets of U, S = {S,...,S z }, and a number k, find at most k sets of S such that their union covers all elements in U. 33

k-set-intersection Approximation Given (U, M, Q,k) do: Step : m M,T m = {(U \ M) \ (Q i \ M) :m Q i } Step 2: Run k-set-cover greedy algorithm on 2a: Pick R T m with the maximum uncovered elements 2b: Take the union of every set picked in 2a 2c: If the union is U \ M go to step 3, else if we picked less than k sets go to 2a, else repeat for another m M Step 3: For every set (U \ M) \ (Q i \ M) in the set cover, add Q i in the intersecting set 34

Algorithm Rationale Let for m M,Q i R m,i T m : R m,i =(U \ M) \ (Q i \ M) If we can find k sets such that: R m,... R m,k = U \ M By de Morgan s: R m,... R m,k = R m,i =(Q i \ M) m Q i i [,...,k] Since and for m Q... Q k and Q... Q k M 35

Approximation Algorithm: APRX-SFW Adopt k-set-intersection Approximation: the set of servers is the quorum system the servers that replied with the latest value k the number of quorums required by the predicates Log-Approximation Invalidates RP and WP a factor of log S times What does it mean for SFW? Extra Communication Rounds (esp. for writes) Slower acceptance of a new value Does not affect correctness 36

Unrestricting Quorums APRX-SFW Improves Computation Time Still relies on n-wise Quorum Systems n>3 to allow fast operations Can we allow fast operations in the MWMR when deploying General Quorum Systems? Answer: YES!! 37

Tool: Quorum Views Used in the SWMR [Georgiou et al. 8] Idea: Try to determine the state of the write operation based on the distribution of the latest value in the replied quorum. Write State in the First Round of Read Operation Determinable => Read is Fast Undeterminable => Read is Slow 38

Determinable Write - Qview() All members of a quorum contain maxtag Q z Q j Q i (Potentially) Write Completed 39

Determinable Write - Qview(2) Every intersection contains a member with tag<maxtag Q z Q j Q i (Definitely) Write <maxtag,v> Incomplete 4

Undeterminable Write - Qview(3) There is intersection with all its members with tag=maxtag Q z Q z Q j Q i Q j Q i qv(3) and Incomplete Write qv(3) and Complete Write Undeterminable => second Com. Round 4

What happens in MWMR? MWMR environment Concurrent writes Multiple concurrent values For values <tag,v>, <tag2, v2>, <tag3,v3> Let tag < tag2 < tag3 Q z Q j Q i 42

Idea: Uncover the Past Discover the latest potentially completed write For values <tag,v>, <tag2, v2>, <tag3,v3>: <tag3,v3> not completed (servers possibly contained <tag2, v2>) <tag2, v2> not completed (servers possibly contained <tag,v>) <tag,v> potentially completed Q z Q j Q i 43

Algorithm: CWFR Traditional Write Protocol: two rounds P: Query a single quorum for the latest tag P2: Increment the max tag, send <newtag, v> quorum Read Protocol: one or two rounds Iterate to discover smallest completed write P: receive replies from a quorum Q QView Q () Fast: return maxtag of current iteration QView Q (2) remove servers with maxtag and re-evaluate QView Q (3) Slow: propagate and return maxtag Server Protocol: passive role Receive requests, update local timestamp and return <tag,v> 44

Read Iteration: Discard Incomplete Tags For values <tag,v>, <tag2, v2>, <tag3,v3>: <tag3,v3> not completed: remove servers that contain <tag3,v3> <tag2, v2> not completed: remove servers that contain <tag2, v2> <tag,v> potentially completed in Q i Qview() : all remaining servers contain <tag,v> Q z Q z Q j Q i Q j Q i Server Removal Past Prediction 45

Read Iteration: Discard Incomplete Tags For values <tag,v>, <tag2, v2>, <tag3,v3>: <tag3,v3> not completed: remove servers that contain <tag3,v3> <tag2, v2> potentially completed in Q j Qview(3) : an intersection of the remaining servers contains <tag2, v2> P2: propagate <tag3,v3> to a complete quorum (help <tag3,v3> to complete) Q z Q z Q j Q i Q j Q i Server Removal Past Prediction 46

APRX-SFW CWFR: NS2 Simulation 2.8 Read Latency vs # of Readers: RL.nw.all.PROTO.rounds.maj5.f.data.2D plot SIMPLE CWFR APRX-SFW 2.6 2.4 ReadLatency 2.2 2 Rounds.8 9 % of Slow Reads vs # of Readers: RR.nw.all.PROTO.rounds.maj5.f.data.2D plot SIMPLE CWFR APRX-SFW.6 2 3 4 5 6 7 8 #Readers Latency %2comm-reads 8 7 6 5 4 3 2 4-wise Quorum System 2 3 4 5 6 7 8 #Readers 47

APRX-SFW CWFR: Planetlab.7.65 Read Latency vs # of Readers: RL.nw.planetlab.all.maj5.f.res.2D plot SIMPLE CWFR APRX-SFW.6 ReadLatency.55.5.45 Rounds.4.35 9 % of Slow Reads vs # of Readers: RR.nw.planetlab.all.maj5.f.res.2D plot SIMPLE CWFR APRX-SFW.3 5 2 25 3 35 4 #Readers Latency %2comm-reads 8 7 6 5 4 3 2 4-wise Quorum System 5 2 25 3 35 4 #Readers 48

Conclusions Presented two Atomic Register MWMR implementations Computation and Communication factor Algorithm: APRX-SFW Polynomial-Approximation of SFW predicates log S -approximation Requires n-wise Quorum Systems for n>3 Algorithm: CWFR General Quorum systems Trades the Speed of write operations Experiments on NS2 and Planetlab Both algorithms overperform classic approach Bigger Intersections favor the APRX-SFW 49

5