Calibrating Noise to Sensitivity in Private Data Analysis

Similar documents
Privacy in Statistical Databases

Differential Privacy

Privacy in Statistical Databases

Lecture 1 Introduction to Differential Privacy: January 28

A Note on Differential Privacy: Defining Resistance to Arbitrary Side Information

Privacy-Preserving Data Mining

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

CSE 598A Algorithmic Challenges in Data Privacy January 28, Lecture 2

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

What Can We Learn Privately?

Differential Privacy and its Application in Aggregation

Differential Privacy II: Basic Tools

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

Why (Bayesian) inference makes (Differential) privacy easy

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Smooth Sensitivity and Sampling in Private Data Analysis

Enabling Accurate Analysis of Private Network Data

The Power of Linear Reconstruction Attacks

1 Differential Privacy and Statistical Query Learning

Differential Privacy and Robust Statistics

Boosting and Differential Privacy

Lecture 20: Introduction to Differential Privacy

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

The Bernstein Mechansim

Lecture 11- Differential Privacy

Approximate and Probabilistic Differential Privacy Definitions

1 Hoeffding s Inequality

Answering Numeric Queries

Answering Many Queries with Differential Privacy

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

The Limits of Two-Party Differential Privacy

Broadening the scope of Differential Privacy using metrics

Analyzing Graphs with Node Differential Privacy

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

[Title removed for anonymity]

Bounds on the Sample Complexity for Private Learning and Private Data Release

Privacy-preserving logistic regression

Probabilistically Checkable Arguments

Pufferfish Privacy Mechanisms for Correlated Data. Shuang Song, Yizhen Wang, Kamalika Chaudhuri University of California, San Diego

Database Privacy: k-anonymity and de-anonymization attacks

An Epistemic Characterization of Zero Knowledge

PoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2

CSE 105 THEORY OF COMPUTATION

Lecture 18: Shanon s Channel Coding Theorem. Lecture 18: Shanon s Channel Coding Theorem

Communication-efficient and Differentially-private Distributed SGD

Information-theoretic foundations of differential privacy

CIS 700 Differential Privacy in Game Theory and Mechanism Design January 17, Lecture 1

Report on Differential Privacy

An Epistemic Characterization of Zero Knowledge

On Differentially Private Frequent Itemsets Mining

Spectral Norm of Random Kernel Matrices with Applications to Privacy

arxiv: v1 [cs.cr] 28 Apr 2015

Secret Sharing CPT, Version 3

CMPUT651: Differential Privacy

Distributed Private Data Analysis: Simultaneously Solving How and What

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

Lecture 7: Passive Learning

On the Relation Between Identifiability, Differential Privacy, and Mutual-Information Privacy

Differential Privacy and Verification. Marco Gaboardi University at Buffalo, SUNY

Lecture 16: Perceptron and Exponential Weights Algorithm

Error Correcting Codes Questions Pool

CS Communication Complexity: Applications and New Directions

The Optimal Mechanism in Differential Privacy

The Exponential Mechanism (and maybe some mechanism design) Frank McSherry and Kunal Talwar Microsoft Research, Silicon Valley Campus

Majority is incompressible by AC 0 [p] circuits

Privacy and Computer Science (ECI 2015) Day 4 - Zero Knowledge Proofs Mathematics

Quantitative Approaches to Information Protection

Computational Complexity of Bayesian Networks

On the Complexity of Differentially Private Data Release

Differentially Private ANOVA Testing

Extremal Mechanisms for Local Differential Privacy

Homing and Synchronizing Sequences

Acknowledgements. Today s Lecture. Attacking k-anonymity. Ahem Homogeneity. Data Privacy in Biomedicine: Lecture 18

Lecture 3: Error Correcting Codes

Lecture 14: Secure Multiparty Computation

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols

On Achieving the Best of Both Worlds in Secure Multiparty Computation

Differentially Private Real-time Data Release over Infinite Trajectory Streams

The natural numbers. Definition. Let X be any inductive set. We define the set of natural numbers as N = C(X).

The Optimal Mechanism in Differential Privacy

Bounds on the Sample Complexity for Private Learning and Private Data Release

Generalization in Adaptive Data Analysis and Holdout Reuse. September 28, 2015

How to Encrypt with the LPN Problem

Foundations of Differentially Oblivious Algorithms

CS294: Pseudorandomness and Combinatorial Constructions September 13, Notes for Lecture 5

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 1: Quantum circuits and the abelian QFT

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

COMMUNICATION VS. COMPUTATION

Faster Algorithms for Privately Releasing Marginals

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Publishing Search Logs A Comparative Study of Privacy Guarantees

Generating Private Synthetic Data: Presentation 2

Testing and Reconstruction of Lipschitz Functions with Applications to Data Privacy

Differential Privacy and Minimum-Variance Unbiased Estimation in Multi-agent Control Systems

Differentially Private Linear Regression

Exploiting Metric Structure for Efficient Private Query Release

Non-Approximability Results (2 nd part) 1/19

Transcription:

Calibrating Noise to Sensitivity in Private Data Analysis Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith Mamadou H. Diallo 1

Overview Issues: preserving privacy in statistical databases Assumption: trusted server General Approach: Perturb the true answer to a query by adding random noise generated by a distribution The query function f maps databases to reals Previous Work 1. Revealing information while preserving privacy: (minimum perturbation level, ε= n, consider one column database containing a single binary attribute) 2. Privacy-preserving datamining on vertically partitioned databases: (generalize 1 to multi-attributes and vertically partitioned DB) In this paper General functions f map databases to vectors of reals Privacy definition: ε-indistinguishability Privacy preservation: calibration of the standard deviation of the noise according to the sensitivity of the function f A simple method for adding noise based on Laplace distribution 2

Concepts Statistical databases: a database used for statistical analysis purposes, a query-response mechanism Interactive scheme: only the result of a query is perturbed Non-interactive scheme: the whole database in perturbed before any query Adaptive interactive scheme: queries based on answers given thus far Probabilistic Turing machine: randomly chooses between the available transitions at each point according to some probability distribution. Laplace distribution: It is the distribution of differences between two independent variates with identical exponential distributions Probability density function: a function that describes the relative likelihood for this random variable to occur at a given point in the observation space. Hamming distance (2 strings): the number of positions at which the corresponding symbols are different Transcript: results of an interaction between a user and a privacy mechanism (query + response) 3

Adversary Definitions Modeled as probabilistic interactive Turing machine with an advice tape Probabilistic Turing machine: add randomness in transition choices Notations San: a database access protocol A: adversary x: a particular database T San,A (x): a transcript, a random variable x = <x 1, x 2,, x n >, where x i D = {0,1} d or R d, n = number entries d H (.,.) over D n : Hamming distance Pr[A=a]: probability density for continuous and discrete random variables Definition A mechanism is ε-indistinguishable if for all pairs x, xʼ in D n, which differ in only one entry, for all adversaries A, and for all transcripts t: ln(pr[t A (x)=t] / Pr[T A (xʼ)=t]) ε, ε = leakage ε small ---> ln(1+ε) ε and (1-ε) Pr[T A (x)=t] / Pr[T A (xʼ)=t] (1+ε), for all transcripts t 4

Definitions Laplace distribution: Lap(λ) Density function: h(y) = e (- y /λ) Mean = 0 Standard deviation = λ Example: Noisy Sum x in {0,1} n User wants to learn f(x), where f(x) = Σ i x i (total number of 1ʼs in DB) Noise: Y~ Lap(1/ε) (Laplace distribution) T(x1,, xn) = Σ i x i + Y ε-indistinguishable mechanism h(y)/h(yʼ) e ε y-yʼ for any real numbers y, yʼ difference (x, xʼ) = 1 ===> difference (f(x), f(xʼ)) = 1 Pr(T(x) = t) / Pr(T(xʼ) = t) = h(t-f(x)) / h(t-f(xʼ)) e ε f(x)-f(xʼ) e ε 5

Definitions Non-negligible leakage and the choice of distance measure Non-negligible leakage Noisy Sum: ε = Ω(1/n) for a constant factor approximation to f(x) Leakage is inherent for statistical utility If ε = ο(1/n) for close databases, then ε=ο(1) for two databases ===> no utility Average-case distance measures Example: statistical distance No meaningful privacy guarantee when ε = Ω(1/n) Example T(x 1,, x n ) = (i, x i ), where i {1,, n} If difference (x, xʼ) =1, then difference (T(x), T(xʼ)) = 1/n But, all transcripts reveal private information about some individual Definition not satisfied: If x, xʼ differ in the (i, xi), Pr = 0 for xʼ 6

Sensitivity and Privacy Sensitivity Sensitivity of query function f: S(f) Noise: Lap(S(f)/ε) Definition (L1 Sensitivity). The L1 sensitivity of a function f : D n ---> R d is the smallest number S(f) such that for all x, x D n which differ in a single entry, f(x) f(x ) 1 S(f) Lipschitz condition on f: if d H (.,.) is Hamming metric on D n, then f(x) f(x ) 1 / d H (x, x ) S(f) Example: Sum and Histograms (1)- If D = {0, 1} and f(x) = Σx i, i {1,, n}, then S L1 (f) = 1 (2)- If D = {B 1,, B d }, then f: D n ---> Z d is a histogram S L1 (f) = 2 - independent of d 7

Sensitivity and Privacy Calibrating noise according to S(f) If Y drawn from Laplace distribution, then h(y)/h(yʼ) exp( y-yʼ /λ) If Y = <L1,, Ld>, then d(y) ~ exp(- y 1/λ) Consequence Pr(z+Y=t) / Pr(z +Y=t) exp(± z-z 1/λ) To release f(x) with privacy, add Laplace noise with sd=s(f)/ε in all coordinates Proposition (Non-interactive Output Perturbation) For all f : D n ---> R d, the following mechanism is ε indistinguishable: San f (x) = f(x) + (Y 1,..., Y d ), where the Yi are drawn from Lap(S(f)/λ) 8

Sensitivity and Privacy Adaptive interactive scheme t = [Q1, a1,q2, a2...,qd, ad] Q = f1, f2,, fd, where fi: Dn ---> R (adaptive sequence of queries) San = None if S(ft) > treshold, otherwise San = fi(x) + Lap(λ) Theorem For an arbitrary adversary A, let f t (x) : D n ---> R d be its query function as parameterized by a transcript t. If λ = max t S(f t )/ε, the mechanism above is ε-indistinguishable. 9

Sensitivity and Privacy Sensitivity in general metric spaces Intuition: insensitive functions of a database can be released privately is not specific to L1 Definition: Let M be a metric space with a distance function d M (, ). The sensitivity S M (f) of a function f: D n ---> M is the amount that the function value varies when a single entry of the input is changed. S M (f) (def) = sup d M (f(x), f(xʼ)), x,xʼ: d H (x, xʼ)=1 Probability density function: z M Sampling function: (approximating) Theorem. In a metric space where h f(x),ε () is well-defined, adding noise to f(x) as above yields an ε-indistinguishable scheme. 10

Interactive vs. Non-Interactive Mechanisms Notations D = {0, 1} d g: boolean function r = (r1, r2,..., rn), ri in {0, 1} d g r (i, x) = ri Θ x (inner product, modulo 2, of ri and x) Theorem 3 (Non-interactive Schemes Require Large Databases) Suppose that San is an ε-indistinguishable non-interactive mechanism with domain D= {0, 1} d. For at least 2/3 of the functions of the form f g (x) = Σ i g(i, x i ), the following two distributions have statistical difference O(n 4/3 ε 2/3 2 d/3 ): Distribution 0: T San (x) where x R {x D n : f g (x) = 0} Distribution 1: T San (x) where x R {x D n : f g (x) = n} If n = o(2 d/4 / ε), for most g r (i, x) = r i Θ x, it is impossible to learn the relative frequency of database items satisfying the predicates g r (i, x). 11

Interactive vs. Non-Interactive Mechanisms A Stronger Separation for Randomized Response Schemes Randomized response (data perturbed individually) Randomization operator Z : D ---> {0, 1}* such that T San (x1,..., xn) = Z(x1),...,Z(xn) Strengthening previous theorem Consider fg where the same predicate g: D ---> {0, 1} is applied to all the entries gr(x) = r Θ x is difficult to learn from Z(x) f(x) difficult to learn from Tsan(x) if n is not very large Proposition (Randomized Response) Suppose that San is a ε-indistinguishable randomized response mechanism. For at least 2/3 of the values r {0, 1} d \{0 d }, the following two distributions have statistical difference O(nε 2/3 2 d/3 ): Distribution 0: T San (x) where each x i {x {0, 1} d : r Θ x = 0} Distribution 1: T San (x) where each x i {x {0, 1} d : r Θ x = 1} if n = o(2 d/3 /ε 2/3 ), no user can learn the relative frequency of database items satisfying the predicate gr(x) = r Θ x, for most values r. 12

Questions? 13