Differential Privacy and Verification. Marco Gaboardi University at Buffalo, SUNY

Similar documents
Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

Lecture 1 Introduction to Differential Privacy: January 28

Pufferfish Privacy Mechanisms for Correlated Data. Shuang Song, Yizhen Wang, Kamalika Chaudhuri University of California, San Diego

Proving differential privacy in Hoare logic

Hoare Logic I. Introduction to Deductive Program Verification. Simple Imperative Programming Language. Hoare Logic. Meaning of Hoare Triples

Privacy in Statistical Databases

Answering Numeric Queries

Ranking Verification Counterexamples: An Invariant guided approach

Synthesizing Coupling Proofs of Differential Privacy

Enabling Accurate Analysis of Private Network Data

[Title removed for anonymity]

Calibrating Noise to Sensitivity in Private Data Analysis

The Optimal Mechanism in Differential Privacy

Concept of Statistical Model Checking. Presented By: Diana EL RABIH

Differential Privacy and its Application in Aggregation

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

Approximate and Probabilistic Differential Privacy Definitions

Automatic Resource Bound Analysis and Linear Optimization. Jan Hoffmann

Propositional Calculus

Towards information flow control. Chaire Informatique et sciences numériques Collège de France, cours du 30 mars 2011

Deductive Verification

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Nonlinear Control as Program Synthesis (A Starter)

The TLA + proof system

Normalization by Evaluation

Lecture 11- Differential Privacy

What Can We Learn Privately?

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

The Optimal Mechanism in Differential Privacy

Report on Differential Privacy

SMT BASICS WS 2017/2018 ( ) LOGIC SATISFIABILITY MODULO THEORIES. Institute for Formal Models and Verification Johannes Kepler Universität Linz

Privacy-Preserving Data Mining

Bounded Model Checking with SAT/SMT. Edmund M. Clarke School of Computer Science Carnegie Mellon University 1/39

The State Explosion Problem

Machine Learning for Automated Theorem Proving

Ivy: Safety Verification by Interactive Generalization

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

Interactive Theorem Provers

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

CS Communication Complexity: Applications and New Directions

Axiomatic Semantics. Lecture 9 CS 565 2/12/08

CIS 700 Differential Privacy in Game Theory and Mechanism Design January 17, Lecture 1

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

Boosting and Differential Privacy

Answering Many Queries with Differential Privacy

Towards Lightweight Integration of SMT Solvers

ANALYZING REAL TIME LINEAR CONTROL SYSTEMS USING SOFTWARE VERIFICATION. Parasara Sridhar Duggirala UConn Mahesh Viswanathan UIUC

Certified Roundoff Error Bounds using Semidefinite Programming

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri

Agent-Based HOL Reasoning 1

Lecture 20: Introduction to Differential Privacy

QWIRE Practice: Formal Verification of Quantum Circuits in Coq

From Learning Algorithms to Differentially Private Algorithms

Algorithmic verification

Software Verification using Predicate Abstraction and Iterative Refinement: Part 1

CMPUT651: Differential Privacy

Proofs of Correctness: Introduction to Axiomatic Verification

Data Privacy: Tensions and Opportunities

Formal Verification of Mathematical Algorithms

Lecture 2: Symbolic Model Checking With SAT

LOGIC PROPOSITIONAL REASONING

Dynamic Semantics. Dynamic Semantics. Operational Semantics Axiomatic Semantics Denotational Semantic. Operational Semantics

Information Flow Analysis via Path Condition Refinement

Differential Privacy

Privacy-Preserving Inference in Crowdsourcing Systems. Liyao Xiang Supervisor: Baochun Li Oct. 9, 2017 University of Toronto

1 Differential Privacy and Statistical Query Learning

A heuristic prover for real inequalities

Differential Privacy II: Basic Tools

Differentially Private Linear Regression

Introduction. Pedro Cabalar. Department of Computer Science University of Corunna, SPAIN 2013/2014

On Differentially Private Longest Increasing Subsequence Computation in Data Stream

On Computing Backbones of Propositional Theories

SAT Solvers: Theory and Practice

Software Verification

Verifiable Security of Boneh-Franklin Identity-Based Encryption. Federico Olmedo Gilles Barthe Santiago Zanella Béguelin

Probabilistic Guarded Commands Mechanized in HOL

From Electrical Switched Networks to Hybrid Automata. Alessandro Cimatti 1, Sergio Mover 2, Mirko Sessa 1,3

Assertions and Measurements for Mixed-Signal Simulation

Propositional Reasoning

MACHINE COMPUTING. the limitations

02 Propositional Logic

Tutorial 1: Modern SMT Solvers and Verification

Testing System Conformance for Cyber-Physical Systems

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

A Certified Denotational Abstract Interpreter (Proof Pearl)

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

Artificial Intelligence Chapter 7: Logical Agents

From SAT To SMT: Part 1. Vijay Ganesh MIT

Chapter 2. Reductions and NP. 2.1 Reductions Continued The Satisfiability Problem (SAT) SAT 3SAT. CS 573: Algorithms, Fall 2013 August 29, 2013

Floyd-Hoare Style Program Verification

Turing Machine Recap

Introduction to Complexity Theory. Bernhard Häupler. May 2, 2006

Matching Logic: Syntax and Semantics

Termination of LCTRSs

Computational Complexity of Bayesian Networks

Computational Complexity of Inference

The SAT Revolution: Solving, Sampling, and Counting

Alan Bundy. Automated Reasoning LTL Model Checking

Solving Quantified Verification Conditions using Satisfiability Modulo Theories

Differentially Private Chi-squared Test by Unit Circle Mechanism

Transcription:

Differential Privacy and Verification Marco Gaboardi University at Buffalo, SUNY

Given a program P, is it differentially private?

P Verification Tool yes? no?

Proof P Verification Tool yes? no?

Given a differentially private program P, does it maintain its accuracy promises?

Given a differentially private program P that maintains its accuracy promises, can we guarantee that it is also efficient?

An algorithm Algorithm 2 DualQuery Input: Database D 2 R X (normalized) and linear queries q 1,...,q k 2 {0, 1} X. Initialize: Let Q = S k j=1 q j [ q j, Q 1 uniform distribution on Q, 16 log Q T = 2, = 48 log 2 X T 4, s = 2. For t =1,...,T: Sample s queries {q i } from Q according to Q t. Let q := 1 P s i q i. Find x t with hq, x t i max x hq, xi /4. Update: For each q 2 Q: Q t+1 q := exp( hq, x t Di) Q t q. Normalize Q t+1. Output synthetic database D b := S T t=1 xt.

A program https://github.com/ejgallego/dualquery/

Some issues Are the algorithms bug-free? Do the implementations respect their specifications? Is the system architecture bug-free? Is the code efficient? Do the optimization preserve privacy and accuracy? Is the actual machine code correct? Is the full stack attack-resistant?

Outline Few more words on program verification, Challenges in the verification of differential privacy, Few verification methods developed so far, Looking forward.

Algorithm 1 An instantiation of the SVT proposed in this paper. Input: D, Q,, T = T 1,T 2,,c. 1: ϵ 1 = ϵ/2, ρ = Lap ( /ϵ 1 ) 2: ϵ 2 = ϵ ϵ 1, count = 0 3: for each query q i Q do 4: ν i = Lap (2c /ϵ 2 ) 5: if q i (D)+ν i T i + ρ then 6: Output a i = 7: count = count + 1, Abort if count c. 8: else 9: Output a i = Algorithm 2 SVT in Dwork and Roth 2014 [8]. Input: D, Q,,T,c. 1: ϵ 1 = ϵ/2, ρ = Lap (c /ϵ 1 ) 2: ϵ 2 = ϵ ϵ 1, count = 0 3: for each query q i Q do 4: ν i = Lap (2c /ϵ 1 ) 5: if q i (D)+ν i T + ρ then 6: Output a i =, ρ = Lap (c /ϵ 2 ) 7: count = count + 1, Abort if count c. 8: else 9: Output a i = Algorithm 3 SVT in Roth s 2011 Lecture Notes [15]. Input: D, Q,,T,c. 1: ϵ 1 = ϵ/2, ρ = Lap ( /ϵ 1 ), 2: ϵ 2 = ϵ ϵ 1, count = 0 3: for each query q i Q do 4: ν i = Lap (c /ϵ 2 ) 5: if q i (D)+ν i T + ρ then 6: Output a i = q i (D)+ν i 7: count = count + 1, Abort if count c. 8: else 9: Output a i = Algorithm 4 SVT in Lee and Clifton 2014 [13]. Input: D, Q,,T,c. 1: ϵ 1 = ϵ/4, ρ = Lap ( /ϵ 1 ) 2: ϵ 2 = ϵ ϵ 1, count = 0 3: for each query q i Q do 4: ν i = Lap ( /ϵ 2 ) 5: if q i (D)+ν i T + ρ then 6: Output a i = 7: count = count + 1, Abort if count c. 8: else 9: Output a i = Algorithm 5 SVT in Stoddard et al. 2014 [18]. Input: D, Q,,T. 1: ϵ 1 = ϵ/2, ρ = Lap ( /ϵ 1 ) 2: ϵ 2 = ϵ ϵ 1 3: for each query q i Q do 4: ν i =0 5: if q i (D)+ν i T + ρ then 6: Output a i = 7: 8: else 9: Output a i = Algorithm 6 SVT in Chen et al. 2015 [1]. Input: D, Q,, T = T 1,T 2,. 1: ϵ 1 = ϵ/2, ρ = Lap ( /ϵ 1 ) 2: ϵ 2 = ϵ ϵ 1 3: for each query q i Q do 4: ν i = Lap ( /ϵ 2 ) 5: if q i (D)+ν i T i + ρ then 6: Output a i = 7: 8: else 9: Output a i = Alg. 1 Alg. 2 Alg. 3 Alg. 4 Alg. 5 Alg. 6 ϵ 1 ϵ/2 ϵ/2 ϵ/2 ϵ/4 ϵ/2 ϵ/2 Scale of threshold noise ρ /ϵ 1 c /ϵ 1 /ϵ 1 /ϵ 1 /ϵ 1 /ϵ 1 Reset ρ after each output of (unnecessary) Yes Scale of query noise ν i 2c /ϵ 2 2c /ϵ 2 c /ϵ 1 /ϵ 2 0 /ϵ 2 Outputting q i + ν i instead of (not private) Yes Outputting unbounded s (not private) ( Yes Yes Privacy Property ϵ-dp ϵ-dp -DP 1+6c ϵ ) -DP -DP -DP 4

Some successful stories - 1 CompCert - a fully verified C compiler, Sel4, CertiKOS - formal verification of OS kernel A formal proof of the Odd order theorem, A formal proof of Kepler conjecture (lead by T. Hales).

Some successful stories - 1 CompCert - a fully verified C compiler, Sel4, CertiKOS - formal verification of OS kernel A formal proof of the Odd order theorem, A formal proof of Kepler conjecture (lead by T. Hales). Years of work from very specialized researchers!

Some successful stories - II Automated verification for Integrated Circuit Design. Automated verification for Floating point computations, Automated verification of Boeing flight control - Astree, Automated verification of Facebook code - Infer.

Some successful stories - II Automated verification for Integrated Circuit Design. Automated verification for Floating point computations, Automated verification of Boeing flight control - Astree, Automated verification of Facebook code - Infer. The years of work go in the design of the techniques!

Verification trade-offs required expertise granularity of the analysis expressivity

What program verification isn t Algorithm design, Trial and error, Program testing, System engineering, A certification process.

What program verification can help with Designing languages for non-experts, Guaranteeing the correctness of algorithms, Guaranteeing the correctness of code, Designing automated techniques for guaranteeing differential privacy, Help providing tools for certification process.

The challenges of differential privacy Given ε,δ 0, a mechanism M: db O is (ε,δ)-differentially private iff b1, b2 :db at distance one and for every S O: Pr[M(b1) S] exp(ε) Pr[M(b2) S] + δ Relational reasoning, Probabilistic reasoning, Quantitative reasoning

A 10 thousand ft view on program verification + expert provided annotations verification tools (semi)-decision procedures (SMT solvers, ITP)

Work-flow mechanism described as a program expert manually adds assertions program annotated with assertions proof checker generates VCs whole set of VCs automatic solver checks VCs VCs not solvable automatically interactive solver checks VCs verified mechanism VCs = Verification Conditions

Semi-decision procedures Require a good decomposition of the problem, Handle well logical formulas, numerical formulas and their combination, Limited support for probabilistic reasoning (usually through decision procedures for counting).

Compositional Reasoning about the privacy budget Sequential Composition Let M i be i -di erentially private (1 apple i apple k). Then M(x) =(M 1 (x),...,m k (x)) is P k i=0 i. We can reason about DP programs by monitoring the privacy budget, If we have basic components for privacy we can just focus on counting, It requires a limited reasoning about probabilities, This way of reasoning adapt to other compositions.

Iterated - CDF CDF(X) = number of records with value X. Joe 29 19144 diabets Black Bob 48 19146 tumor Jim 25 34505 flue Alice Smey 62 19144 diabets CDF(Xn)... CDF(50)=4 CDF(40)=3 Bill 39 16544 anemia Sam 61 19144 diabets... List of buckets CDF(30)=2

PINQ-like languages - McSherry it-cdf (raw : data) (budget : R) (buckets : list) (ε: R) { } : list var agent = new PINQAgentBudget(budget); var db = new PINQueryable<data>(rawdata, agent); foreach (var b in buckets) b = db.where(y => y.val b).noisycount(ε); yield return b;

PINQ-like languages - McSherry it-cdf (raw : data) (budget : R) (buckets : list) (ε: R) { } : list var agent = new PINQAgentBudget(budget); var db = new PINQueryable<data>(rawdata, agent); foreach (var b in buckets) b = db.where(y => y.val b).noisycount(ε); yield return b; agent is responsible for the budget

PINQ-like languages - McSherry it-cdf (raw : data) (budget : R) (buckets : list) (ε: R) { } : list var agent = new PINQAgentBudget(budget); var db = new PINQueryable<data>(rawdata, agent); foreach (var b in buckets) b = db.where(y => y.val b).noisycount(ε); yield return b; raw data are accessed through a PINQueryable

PINQ-like languages - McSherry it-cdf (raw : data) (budget : R) (buckets : list) (ε: R) { } : list var agent = new PINQAgentBudget(budget); var db = new PINQueryable<data>(rawdata, agent); foreach (var b in buckets) b = db.where(y => y.val b).noisycount(ε); yield return b; we have transformations (scaling factor)

PINQ-like languages - McSherry it-cdf (raw : data) (budget : R) (buckets : list) (ε: R) { : list var agent = new PINQAgentBudget(budget); var db = new PINQueryable<data>(rawdata, agent); foreach (var b in buckets) b = db.where(y => y.val b).noisycount(ε); yield return b; } we have transformations (scaling factor) aggregate operations (actual budget consumption)

Enough budget? it-cdf (raw : data) (budget : R) (buckets : list) (ε: R) { } : list var agent = new PINQAgentBudget(budget); var db = new PINQueryable<data>(rawdata, agent); foreach (var b in buckets) b = db.where(y => y.val b).noisycount(ε); yield return b; We can check local vs global budget

Compositional reasoning about sensitivity GS(f) = max v v 0 f(v) f(v0 ) It allows to decompose the analysis/construction of a DP program, A metric property of the function (DMNS06), It requires a limited reasoning about probabilities, Similar worst case reasoning as basic composition.

Fuzz-like languages - Penn it-cdf (b : data) (buckets : list) : list { case buckets of nil => nil x::xs => size (filter (fun y => y x) b))) :: it-cdf xs b }

Fuzz-like languages - Penn How sensitive? it-cdf (b :[??] data) (buckets : list) : list { case buckets of nil => nil x::xs => size (filter (fun y => y x) b))) :: it-cdf xs b }

Fuzz-like languages - Penn it-cdf (b :[??] data) (buckets : list) : list { case buckets of nil => nil x::xs => size (filter (fun y => y x) b))) :: it-cdf xs b } Let s assume - size : [1]data --o R

Fuzz-like languages - Penn it-cdf (b :[??] data) (buckets : list) : list { case buckets of nil => nil x::xs => size (filter (fun y => y x) b))) :: it-cdf xs b } Similarly, - filter : [ ]prop --o [1]data --o R

Fuzz-like languages - Penn it-cdf (b :[??] data) (buckets : list) : list { (b :[0] data) case buckets of nil => nil x::xs => size (filter (fun y => y x) b))) :: it-cdf xs b } (b :[n+1] data)

Fuzz-like languages - Penn n-sensitive! it-cdf (b :[n] data) (buckets : list[n]) : list { case buckets of nil => nil x::xs => size (filter (fun y => y x) b))) :: it-cdf xs b }

Fuzz-like languages - Penn it-cdf (b :[ε*n] data) (buckets : list[n]) (ε:num): nlist { case buckets of nil => nil x::xs => Lap ε size (filter (fun y => y x) b))) :: it-cdf xs b } adding Noise!

Reasoning about DP via probabilistic coupling - BGGHS For two (sub)-distributions µ 1,µ 2 2 Dist(A) we have an approximate coupling µ 1 C, (R) µ 2 i there exists µ 2 Dist(A A) s.t. suppµ R i µ apple µ i max A ( i µ e µ i,µ i e i µ) apple Generalize indistinguishability to other relations allowing more general relational reasoning, More involved reasoning about probability distances and divergences, Preserving the ability to use semi-decision logical and numerical procedures.

prhl-like languages CDF example similar to the previous ones b b 0 ) (itcdf bl ) C,0 (=) (itcdf b 0 l )

prhl-like languages CDF example similar to the previous ones b b 0 ) (itcdf bl ) C,0 (=) (itcdf b 0 l ) Having two copies of the program allows to compare different parts of the same program.

Why this helps? It allows to internalize better the properties of Laplace (Lap (1/ ) v 1 ) C k+v1 v 2,0(x 1 + k = x 2 )(Lap (1/ ) v 2 ) can be used to assert symbolically several facts about probabilities.

Why this helps? (Lap (1/ ) v 1 ) C v1 v 2,0(x 1 = x 2 )(Lap (1/ ) v 2 ) expresses log Pr(Lap (1/ ) v1 = r) Pr(Lap (1/ ) v 2 = r) apple v 1 v 2

Why this helps? v 1 v 2 apple k ) (Lap (1/ ) v 1 ) C 2k,0 (x 1 + k = x 2 )(Lap (1/ ) v 2 ) expresses v 1 v 2 apple k ) log Pr(Lap (1/ ) v1 = r + k) Pr(Lap (1/ ) v 2 = r) apple 2k

Why this helps? (Lap (1/ ) v 1 ) C 0,0 (x 1 x 2 = v 1 v 2 )(Lap (1/ ) v 2 ) expresses log Pr(Lap (1/ ) v2 + k = r + k) Pr(Lap (1/ ) v 2 = r) apple 0

Other works Bisimulation based methods (Tschantz&al - Xu&al) Fuzz with distributed code (Eigner&Maffei) Satisfiability modulo counting (Friedrikson&Jha) Bayesian Inference (BFGGHS) Adaptive Fuzz (Penn) Accuracy bounds (BGGHS) Continuous models (Sato) Lightweight verification - injective function argument (Zhang&Kifer) Relational symbolic execution for R - generating DP counterexamples (Chong&Farina&Gaboardi) Formalizing the local model (Ebadi&Sands) zcdp (BGHS)

Challenges All of these tools are research projects and most of them are usable only by experts. Can we use them to certify correct a library of basic mechanism? Which non-expert we should aim for?

Other Challenges Are there other fundamental principles that we can use? How can we extend them to verify accuracy and efficiency? There are several works on the verification of randomness, floating points, SMC, etc. Can we combine the different approaches? How can we internalize more involved data models assumptions? From benchmarks to certification?