Do we have a quorum?

Similar documents
Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1.

Unreliable Failure Detectors for Reliable Distributed Systems

Shared Memory vs Message Passing

Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering

Time. To do. q Physical clocks q Logical clocks

Section 6 Fault-Tolerant Consensus

Seeking Fastness in Multi-Writer Multiple- Reader Atomic Register Implementations

Time. Today. l Physical clocks l Logical clocks

Early stopping: the idea. TRB for benign failures. Early Stopping: The Protocol. Termination

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Fault Detection for Byzantine Quorum Systems

On the Efficiency of Atomic Multi-Reader, Multi-Writer Distributed Memory

AGREEMENT PROBLEMS (1) Agreement problems arise in many practical applications:

Fault-Tolerant Consensus

Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation

Optimal Resilience Asynchronous Approximate Agreement

Failure detectors Introduction CHAPTER

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

Eventually consistent failure detectors

DISTRIBUTED COMPUTER SYSTEMS

Degradable Agreement in the Presence of. Byzantine Faults. Nitin H. Vaidya. Technical Report #

The Weakest Failure Detector to Solve Mutual Exclusion

Generalized Consensus and Paxos

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit

Agreement. Today. l Coordination and agreement in group communication. l Consensus

CS505: Distributed Systems

Consensus when failstop doesn't hold

CS505: Distributed Systems

Asynchronous Models For Consensus

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links

ARES: Adaptive, Reconfigurable, Erasure coded, atomic Storage

Cuts. Cuts. Consistent cuts and consistent global states. Global states and cuts. A cut C is a subset of the global history of H

A subtle problem. An obvious problem. An obvious problem. An obvious problem. No!

Replication predicates for dependent-failure algorithms

Time in Distributed Systems: Clocks and Ordering of Events

C 1. Recap: Finger Table. CSE 486/586 Distributed Systems Consensus. One Reason: Impossibility of Consensus. Let s Consider This

Byzantine Agreement. Gábor Mészáros. CEU Budapest, Hungary

Consensus and Universal Construction"

Time. Lakshmi Ganesh. (slides borrowed from Maya Haridasan, Michael George)

Byzantine Agreement. Gábor Mészáros. Tatracrypt 2012, July 2 4 Smolenice, Slovakia. CEU Budapest, Hungary

On the weakest failure detector ever

Recap. CS514: Intermediate Course in Operating Systems. What time is it? This week. Reminder: Lamport s approach. But what does time mean?

The Weighted Byzantine Agreement Problem

DEXON Consensus Algorithm

Generic Broadcast. 1 Introduction

Lower Bounds for Achieving Synchronous Early Stopping Consensus with Orderly Crash Failures

Slides for Chapter 14: Time and Global States

Chapter 11 Time and Global States

arxiv: v2 [cs.dc] 21 Apr 2017

Efficient Notification Ordering for Geo-Distributed Pub/Sub Systems

Easy Consensus Algorithms for the Crash-Recovery Model

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization

Clocks in Asynchronous Systems

Distributed Systems Principles and Paradigms

The Weakest Failure Detector for Wait-Free Dining under Eventual Weak Exclusion

Early consensus in an asynchronous system with a weak failure detector*

Clock Synchronization

Model Checking of Fault-Tolerant Distributed Algorithms

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign

Consensus. Consensus problems

Distributed Systems Byzantine Agreement

CS 347 Parallel and Distributed Data Processing

CS505: Distributed Systems

Dynamic Group Communication

RAMBO: A Robust, Reconfigurable Atomic Memory Service for Dynamic Networks

Termination Detection in an Asynchronous Distributed System with Crash-Recovery Failures

How can one get around FLP? Around FLP in 80 Slides. How can one get around FLP? Paxos. Weaken the problem. Constrain input values

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. CS 249 Project Fall 2005 Wing Wong

Simple Bivalency Proofs of the Lower Bounds in Synchronous Consensus Problems

Distributed Computing in Shared Memory and Networks

On the weakest failure detector ever

The Byzantine Generals Problem Leslie Lamport, Robert Shostak and Marshall Pease. Presenter: Jose Calvo-Villagran

Building Distributed Computing Abstractions in the Presence of Mobile Byzantine Failures

Integrating External and Internal Clock Synchronization. Christof Fetzer and Flaviu Cristian. Department of Computer Science & Engineering

Genuine atomic multicast in asynchronous distributed systems

CS5412: REPLICATION, CONSISTENCY AND CLOCKS

Asynchronous Leasing

Absence of Global Clock

Fault Tolerance. Dealing with Faults

Tolerating Permanent and Transient Value Faults

CS3110 Spring 2017 Lecture 21: Distributed Computing with Functional Processes

Byzantine agreement with homonyms

Distributed Computing. Synchronization. Dr. Yingwu Zhu

ROBUST & SPECULATIVE BYZANTINE RANDOMIZED CONSENSUS WITH CONSTANT TIME COMPLEXITY IN NORMAL CONDITIONS

Reliable Broadcast for Broadcast Busses

Distributed Mutual Exclusion Based on Causal Ordering

Randomized Protocols for Asynchronous Consensus

I R I S A P U B L I C A T I O N I N T E R N E THE NOTION OF VETO NUMBER FOR DISTRIBUTED AGREEMENT PROBLEMS

CS505: Distributed Systems

Asynchronous Communication 2

Distributed Consensus

Distributed Systems. Time, Clocks, and Ordering of Events

Deterministic Consensus Algorithm with Linear Per-Bit Complexity

Eventual Leader Election with Weak Assumptions on Initial Knowledge, Communication Reliability, and Synchrony

arxiv: v1 [cs.dc] 22 Oct 2018

Self-stabilizing Byzantine Agreement

Today. Vector Clocks and Distributed Snapshots. Motivation: Distributed discussion board. Distributed discussion board. 1. Logical Time: Vector clocks

Distributed Algorithms Time, clocks and the ordering of events

Signature-Free Broadcast-Based Intrusion Tolerance: Never Decide a Byzantine Value

High Performance Computing

Transcription:

Do we have a quorum?

Quorum Systems Given a set U of servers, U = n: A quorum system is a set Q 2 U such that Q 1, Q 2 Q : Q 1 Q 2 Each Q in Q is a quorum

How quorum systems work: A read/write shared register X store at each server a (v,ts) pair Write(x,d) Ask servers in some Q for ts Set ts c > max({ts} any previous ts c ) Update some Q with (d,ts c )

How quorum systems work: A read/write shared register X store at each server a (v,ts) pair Write(x,d) Ask servers in some Q for ts Set ts c > max({ts} any previous ts c ) Update some Q with (d,ts c ) Read(x) Ask servers in some Q for (v,ts) Select most recent (v,ts)

How quorum systems work: A read/write shared register X store at each server a (v,ts) pair Write(x,d) Ask servers in some Q for ts Set ts c > max({ts} any previous ts c ) Update some Q with (d,ts c ) Read(x) Ask servers in some Q for (v,ts) Select most recent (v,ts)

What semantics? Safe: A read not concurrent with any write returns the most recently written value Regular: Safe + a read that overlaps with a write obtains either the old or the new value Atomic: Reads and writes are totally ordered so that values returned by reads are the same as if the operations had been performed with no overlapping r 1 r 2 r 3 w 1 (5) w 2 (6) time

What semantics? Safe: A read not concurrent with any write returns the most recently written value Regular: Safe + a read that overlaps with a write obtains either the old or the new value Atomic: Reads and writes are totally ordered so that values returned by reads are the same as if the operations had been performed with no overlapping r 1 r 2 r 3 w 1 (5) w 2 (6) time

What semantics? Safe: A read not concurrent with any write returns the most recently written value Regular: Safe + a read that overlaps with a write obtains either the old or the new value Atomic: Reads and writes are totally ordered so that values returned by reads are the same as if the operations had been performed with no overlapping w 1 (5) r 1 (5) r 2 (?) r 3 (?) w (6) 2 time

What semantics? Safe: A read not concurrent with any write returns the most recently written value Regular: Safe + a read that overlaps with a write obtains either the old or the new value Atomic: Reads and writes are totally ordered so that values returned by reads are the same as if the operations had been performed with no overlapping r 1 r 2 r 3 w 1 (5) w 2 (6) time

What semantics? Safe: A read not concurrent with any write returns the most recently written value Regular: r 2 (6) r 2 (6) r 2 (5) r 2 (5) r 2 (6) r 2 (6) Safe + a read that overlaps with a write obtains either the old or the new value Atomic: Reads and writes are totally ordered so that values returned by reads are the same as if the operations had been performed with no overlapping w 1 (5) r 1 (5) r 2 (5) r 3 (5) w (6) 2 time

What semantics? Safe: A read not concurrent with any write returns the most recently written value Regular: r 2 (6) r 2 (6) r 2 (5) r 2 (5) r 2 (6) r 2 (6) Safe + a read that overlaps with a write obtains either the old or the new value Atomic: Reads and writes are totally ordered so that values returned by reads are the same as if the operations had been performed with no overlapping w 1 (5) r 1 (5) r 2 (5) r 3 (5) w (6) 2 time

System Model Universe U of servers, U = n Byzantine faulty servers modeled as a non-empty fail-prone system B 2 U no B B is contained in another some B B contains all faulty servers Clients are correct (can be weakened) Point-to-point authenticated and reliable channels A correct process q receives a message from another correct process p if and only if p sent it

Masking Quorum System [Malkhi and Reiter, 1998] A quorum system Q is a masking quorum system for a fail-prone system B if the following properties hold: M-Consistency Q 1, Q 2 Q B 1, B 2 B : (Q 1 Q 2 ) \ B 1 B 2 M-Availability B B Q Q : B Q =

Dissemination Quorum System A masking quorum system for self-verifying data client can detect modification by faulty server D-Consistency Q 1, Q 2 Q B B : (Q 1 Q 2 ) B D-Availability B B Q Q : B Q =

f-threshold Masking Quorum Systems B = {B U : B = f} M-Consistency Q 1, Q 2 Q : Q 1 Q 2 2f + 1 D-Consistency Q 1, Q 2 Q : Q 1 Q 2 f + 1 M-Availability Q n f D-Availability Q n f Q = { Q } n + 2f + 1 Q U : Q = 2 { Q } n + f + 1 Q = Q U : Q = 2 n n n 4f + 1 n 3f + 1

A safe read/write protocol Client c executes: Write(d) Ask all servers for their current timestamp t Wait for answer from Q different servers Set ts c > max( {t} any previous ts c ) Send (d,ts c ) to all servers Wait for Q acknowledgments Read() Ask all servers for latest value/timestamp pair Wait for answer from Q different servers Select most recent (v,ts) for which at least f + 1 answers agree (if any) verifiable

Reconfigurable quorums Design a Byzantine data service that monitors environment uses statistical techniques to estimate number of faulty servers adjusts its tolerance capabilities accordingly: fault-tolerance threshold changes within [fmin fmax] range very efficient when no or few failures can cope with new faults as they occur does not require read/write operations to block provides strong semantics guarantees

Managing the threshold Keep threshold value in a variable T Refine assumption on failures: For any operation o, number of failures never exceeds f, the minimum of: a) value of T before o b) any value written to T concurrently with o. Which threshold value should we use to read T? Update T by writing to an announce set A set of servers whose intersection with every quorum (as defined by f in [fmin fmax]) contains sufficiently many correct servers to allow client to determine T s value unambiguously.

The announce set Intersects all quorums in at least 2f max + 1 servers Conservative announce set size: n f max Hence: n + 2f min + 1 2 + (n f max ) n 2f max + 1 n 6f max 2f min + 1

Updating T Client c (with current threshold f) executes: Write(d) Ask all servers for their current timestamp t Wait for answer from Q different servers Set ts c > max({t} any previous ts c ) Send (d,ts c ) to all servers Wait for Q acknowledgements Read() Ask all servers for latest value/timestamp pair Wait for answer from Q different servers Select most recent (v,ts) for which at least f + 1 answers agree (if any)

Updating T Client c (with current threshold f) executes: Write(d) Ask all servers for their current timestamp t Wait for answer from Q different servers an announce set Set ts c > max({t} any previous ts c ) Send (d,ts c ) to all servers Wait for Q acknowledgments from an announce set Read() Ask all servers for latest value/timestamp pair Wait for answer from Q different servers Select most recent (v,ts) for which at least f + 1 answers agree (if any)

Updating T Client c (with current threshold f) executes: Write(d) Ask all servers for their current timestamp t Wait for answer from Q different servers an announce set Set ts c > max({t} any previous ts c ) Send (d,ts c ) to all servers Wait for Q acknowledgments from an announce set Read() Ask all servers for latest value/timestamp pair Wait for answer from Q min different servers Select most recent (v,ts) for which at least f max + 1 answers agree (if any)

A problem f min = 1 f max = 3 n = 17 Q min = 10 announce set = 14 Initially, T = 1

A problem f min = 1 f max = 3 n = 17 Q min = 10 announce set = 14 Threshold write: T = 2

A problem f min = 1 f max = 3 n = 17 Q min = 10 announce set = 14 While a client is performing a threshold write to set T = 3

A problem f min = 1 f max = 3 n = 17 Q min = 10 announce set = 14 another client tries to read T

Countermanding (v,ts) is countermanded if at least f max +1 servers return a timestamp greater than ts Write(f) Ask all servers for their current timestamp t Wait for answer from an announce set Set ts c > max({t} any previous ts c ) Send (d,ts c ) to all servers Wait for acknowledgements from an announce set Read() Ask all servers for latest value/timestamp pair Wait for answer from Q min different servers Select most recent (v,ts) for which at least f max + 1 answers agree (if any) not countermanded

Minimizing quorum size Who cares? Machines are cheap But achieving independent failures is expensive! Independently failing hardware Independently failing software Independent implementations of server Independent implementation of underlying OS Independent versions to maintain

A simple observation Client c (with current threshold f) executes: Write(d) Ask all servers for their current timestamp t Wait for answer from Q different servers Set ts c > max({t} any previous ts c ) Send (d,ts c ) to all servers Wait for Q acknowledgements Read() Ask all servers for latest value/timestamp pair Wait for answer from Q different servers Select most recent (v,ts) for which at least f + 1 answers...agree (if any) (Asynchronous) Authenticated Reliable channels A correct process q receives a message from another correct process p if and only if p sent it

A-Masking Quorum Systems A quorum system Q is an a-masking quorum system for a fail-prone system B if the following properties hold for Qr and Qw: AM-Consistency Q r Q r Q w Q w B 1, B 2 B AM-Availability (Q r Q w ) \ B 1 B 2 : B B Q r Q r : B Q r =

Tradeoffs best known n confirmable non-confirmable self-verifying 3f+1 2f+1 generic 4f+1 3f+1

Tradeoffs best known n confirmable non-confirmable self-verifying and generic 3f+1 2f+1 Lower bound: never two rows again!

The intuition Trade replication in space for replication in time V X V V V? X V V? Traditional: 4f+1 servers Now: 3f+1 servers

The intuition Trade replication in space for replication in time V X V V V? X V V? Traditional: 4f+1 servers Now: 3f+1 servers

The intuition Trade replication in space for replication in time V X V V V? X V V V Traditional: 4f+1 servers Now: 3f+1 servers Both cases: wait until 4th server receives write

The protocol Client c executes: Write(d) Ask all servers for their current timestamp t Wait for answer from Q w different servers Set ts c > max( {t} any previous ts c ) Send (d,ts) to all servers Wait for Q w acknowledgments Q w Q r n + f + 1 n + 3f + 1 2 2 Read() send read-start to server set Q r repeat receive a reply (D, ts) from s in Q r set answer[s,ts] := D until some A in answer[ ][ ] is vouched for by Q w send read-stop to Q r return A servers

The Slim-Fast version 1. Whenever c gets first message from a server, it computes T = {largest f+1 timestamps from distinct servers} 2. (D,ts) from answer[s][] is discarded unless either a) ts T or b) ts is the latest timestamp received from s

The Goodies Theorem The protocol guarantees atomic semantics

Proof: Safety Lemma 1: If it is live, it is atomic a) After write of ts1, no read returns earlier ts Suppose write for ts1 has completed n+f+1 servers acked the write 2 At least n f+1 are correct 2 Remaining n+f 1 servers < Q w 2 b) After c reads ts1, no later read returns earlier ts c reads ts1 n+f+1 servers say 2 ts1 At least n f+1 are correct 2 Remaining n+f 1 servers < Q w 2 Any read that starts after ts1 returns ts ts1

Proof: Liveness Lemma 2: Every operation eventually terminates Write: trivial, because only waits for Q w < n f Read: Consider T after c gets first message from last server. Let t max be the largest timestamp from a correct server in T. A client never removes t max from its answers[s][], for a correct s Eventually, all correct servers see a write with ts = t max and echo client Since Q r = n+3f+1, Q w Q r f and the read terminates 2