CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

Similar documents
1 Distributional problems

Report on Differential Privacy

Privacy in Statistical Databases

Lecture 1 Introduction to Differential Privacy: January 28

Lecture 11- Differential Privacy

1 Cryptographic hash functions

1 Adjacency matrix and eigenvalues

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

Quick Sort Notes , Spring 2010

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

CSC 5170: Theory of Computational Complexity Lecture 13 The Chinese University of Hong Kong 19 April 2010

Answering Many Queries with Differential Privacy

1 Cryptographic hash functions

14.1 Finding frequent elements in stream

Lower Bounds for Testing Bipartiteness in Dense Graphs

CS261: Problem Set #3

Differential Privacy and its Application in Aggregation

Approximate and Probabilistic Differential Privacy Definitions

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

Differential Privacy

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

Lecture 11: Quantum Information III - Source Coding

CS 188: Artificial Intelligence Spring Today

Testing Problems with Sub-Learning Sample Complexity

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

The Optimal Mechanism in Differential Privacy

P (E) = P (A 1 )P (A 2 )... P (A n ).

Impagliazzo s Hardcore Lemma

CIS 700 Differential Privacy in Game Theory and Mechanism Design January 17, Lecture 1

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

Online Learning, Mistake Bounds, Perceptron Algorithm

Lecture 1: Introduction to Sublinear Algorithms

Lecture 20: Introduction to Differential Privacy

1 Hoeffding s Inequality

Calibrating Noise to Sensitivity in Private Data Analysis

Machine Learning: Homework 5

Lecture: Aggregation of General Biased Signals

2 Making the Exponential Mechanism Exactly Truthful Without

How Well Does Privacy Compose?

1 Differential Privacy and Statistical Query Learning

1 Indistinguishability for multiple encryptions

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

Lecture 6 - Random Variables and Parameterized Sample Spaces

[Title removed for anonymity]

Math 115 Spring 11 Written Homework 10 Solutions

Effective computation of biased quantiles over data streams

Homework 4 Solutions

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Faster Algorithms for Privately Releasing Marginals

CSCI3390-Lecture 16: NP-completeness

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

1 Secure two-party computation

Chapter 24. Comparing Means

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures

Estimating a Population Mean. Section 7-3

Privacy-Preserving Data Mining

Resistant Measure - A statistic that is not affected very much by extreme observations.

CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes

MA 1125 Lecture 33 - The Sign Test. Monday, December 4, Objectives: Introduce an example of a non-parametric test.

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle

CSC 5170: Theory of Computational Complexity Lecture 5 The Chinese University of Hong Kong 8 February 2010

Question 1. The Chinese University of Hong Kong, Spring 2018

Boosting and Differential Privacy

Lecture 4: September Reminder: convergence of sequences

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

Improved Direct Product Theorems for Randomized Query Complexity

What Can We Learn Privately?

Differential Privacy: a short tutorial. Presenter: WANG Yuxiang

Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

1 A Lower Bound on Sample Complexity

Oblivious and Adaptive Strategies for the Majority and Plurality Problems

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Lecture 4: Proof of Shannon s theorem and an explicit code

Notes on Complexity Theory Last updated: November, Lecture 10

CPSC 467: Cryptography and Computer Security

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Differential Privacy: An Economic Method for Choosing Epsilon

Lecture 23: Alternation vs. Counting

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

February 17, Simplex Method Continued

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS264: Beyond Worst-Case Analysis Lecture #15: Smoothed Complexity and Pseudopolynomial-Time Algorithms

Lecture 4. 1 Examples of Mechanism Design Problems

CMPUT651: Differential Privacy

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Understanding Generalization Error: Bounds and Decompositions

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Lecture 26: Arthur-Merlin Games

Differential Privacy and Robust Statistics

Notes on the Exponential Mechanism. (Differential privacy)

The Optimal Mechanism in Differential Privacy

1 Randomized complexity

CS 151 Complexity Theory Spring Solution Set 5

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

1 Maintaining a Dictionary

ECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture)

Transcription:

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring 2015 3 February 2015 The interactive online mechanism that we discussed in the last two lectures allows efficient approximation of answers to arbitrary averaging queries. We showed this mechanism is differentially private as long as the database has sufficiently many rows. If we disregard efficiency considerations, this mechanism can be generalized to handle not only counting queries, but arbitrary queries that take values in R or R d (e.g. histograms; see Chapter 5 of the survey). However, the privacy loss grows with the sensitivity of the queries. In general, the mechanisms that we discussed so far do not give any privacy guarantee for queries of large or unbounded sensitivity. One example of such a query is the median. In the worst case, this number can change dramatically if we change one row of our data: If 51 of our database participants make zero HKD and 50 make 1 million HKD then changing the salary of one person can cause the value of the median to go up by 1 million. What should a data release mechanism do in such a case? One option is to refuse providing an answer when the answer reveals too much private information. One might hope that such cases happen rarely and do not affect the typical utility of the mechanism. In this setting, the role of the mechanism is to determine whether providing an answer to a given query might harm the privacy of the participants in the database and to provide an answer if no harm is caused. The design of these mechanisms from first principles is not particularly difficult, although I am not sure how realistic the typicality assumption they make about their data is. 1 Median queries As an instructive example let s talk about the median query. Recall that the median of a sequence of numbers is the middle number in the sequence after the sequence is ordered. (We ll assume that the sequence has an odd number of entries.) Intuitively, if we have a sequence of numbers like 1, 1, 2, 3, 3, 3, 3, 4, 5, 5, 5 then releasing the median should not be particularly harmful to privacy because changing a single entry in this sequence does not change the value of the median. This observation suggests the following mechanism for median queries given a database x consisting of a sequence of numbers: If for all rows of the database, changing the value of this row does not change the median then output median(x), otherwise output (i.e. refuse to answer). This mechanism is deterministic, so it is not differentially private. Where does the privacy loss occur? The reason this mechanism is not private is that making the decision whether to release or refuse public entails a loss of privacy. One solution is to randomize this decision. The mechanism below incorporates a tradeoff between privacy and stability, i.e. the minimum number of rows that need to be replaced in order to change the value of the median. 1

2 Stable median mechanism SM ed. Given a database x of numbers: Calculate the minimum number (x) of rows of x that need to be replaced in order to change median(x). Sample a Lap(1/ε) random variable N. If (x) + N > t/ε, output median(x). Otherwise, output. We show that this mechanism is almost always differentially private: Theorem 1. For ε 1, the stable median mechanism is (ε, O(e t ))-differentially private. Proof. If x and x are adjacent databases. Then median(x) can be changed by first replacing one row of x to get x, then replacing (x ) rows of x, so (x) (x ) + 1. By symmetry we obtain (x) (x ) 1, so is 1-Lipschitz. We now consider two cases. If (x) > 1, then x and x have the same median m so the possible answers of the mechanism on databases x and x are m and. In particular Pr[SMed(x) = m] = Pr[ (x) + N > t/ε] Pr[( (x ) + 1) + N > t/ε] e ε Pr[SMed(x ) = m] and by similar reasoning Pr[SMed(x) = ] e ε Pr[SMed(x ) = ]. If (x) = 1, then and Pr[SMed(x) ] = Pr[1 + N > t/ε] = Pr[N > t/ε 1] = O(e t ) Pr[SMed(x ) ] Pr[2 + N > t/ε] = O(e t ) by the tail bound on the Laplace distribution. By going over all possible values over the set S (, {median(x)}, { }, {median(x), }) we conclude that for all S, Pr[SMed(x) S] Pr[SMed(x ) S] + O(e t ). We conclude that in either case, the mechanism is (ε, O(e t ))-differentially private. The utility of this mechanism on a particular database x is determined by the parameter (x). The probability that the mechanism outputs the median equals the probability that a Lap(1/ε) random variable takes value greater than t/ε (x). For example, if (x) 2t/ε, then by the large deviation bound from Homework 1 the median is output with probability at least 1 O(e t ). More generally, this type of mechanism can be used on any type of query for which the value (x) tends to be reasonably large. One computational issue that can come up in the general setting is that (x) may be difficult to calculate (or approximate to within sufficient accuracy). For median queries, this calculation can be done in time roughly linear in the size of x.

3 2 The subsample and aggregate mechanism The propose-test-release mechanism is a clever conceptual solution to the problem of statistical query release that bypasses the sensitivity issue altogether. So far we have viewed all queries to the data as legitimate, and the goal of the mechanism was to answer every query as accurately as possible without compromising data privacy. Let us go one step further and ask what is the point of querying a data set? Often the reason we query data is because we need to draw some statistical conclusion (e.g. New Territories districts tend to vote for the DAB; cancer is more likely among smokers). It is often assumed that the more data we have, the more accurate our conclusions should be. A consequence of this belief is that there comes a point at which having additional data does not improve the accuracy of the results; that is, results of a similar quality can be obtained even if some of the data was discarded. Thus, a query whose answer is affected substantially by subsampling the data may not be particularly useful in the first place, and so there shouldn t be much harm in refusing to answer such a query. The subsample and aggregate mechanism answers all other queries that is, those that are robust to subsampling in a differentially private way. We begin with a definition that captures this notion of robustness. Our defintion requires that the answer stays the same with high probability when the data is subsampled. A more realistic notion might allow for a small change in the answer. One should be able to prove similar results in that case, so for simplicity I ll stick to the more basic definition. Definition 2. Query q is m-stable for database x D n if Pr y [q(x) = q(y)] 3/4 where y is obtained by sampling m independent uniformly distributed rows of x. The constant 3/4 is somewhat arbitrary; the discrepancy between this constant and 1/2 only affects various constants in the description of the mechanism. The subsample and aggregate mechanism takes several independent subsamples of the data, queries the data on each subsample, and releases the answer to the query if there tends to be agreement among the subsamples. The mode of a sequence of numbers is the most frequent value occurring among them (breaking ties arbitrarily). The frequency of the mode is the number of times it occurs. For instance the mode of (1, 2, 4, 3, 4, 4, 2) is 4 and its frequency is 3. Subsample and aggregate mechanism SA. Given a database x D n, Create k = ε(n/m) 3 i.i.d. databases x 1,..., x k D m, where x i is a uniform sample of m rows of x with repetition. Let f be frequency of the mode of (q(x 1 ),..., q(x k )). Sample a Lap(2km/εn) random variable N. If f + N > 5k/8, output the mode of (q(x 1 ),..., q(x k )). Otherwise output.

4 In the analysis, we will assume that εn is larger than Km for a sufficiently large constant K. The utility of this mechanism will follow from the stability of q: Typically about 3/4 of the answers q(x i ) will be the same as q(x), and in such a case the stable mode mechanism is likely to output the correct answer. Theorem 3. If q is m-stable for x then SA(x) outputs q(x) with probability at least 1 e Ω(εn/m). Proof. Let X i be an indicator random variable for the event q(x i ) = q(x). These are i.i.d. random variables with mean at least 3/4. By the Chernoff bound, the probability that fewer than 5k/8 of them are equal to q(x) is at most 2 Ω(k) = 2 Ω(εn/m). Assuming this is the case, the probability that f + N 5k/8 is at most the probability that N k/8, which is at most e Ω(εn/m). The theorem follows by taking a union bound. For privacy, we will argue that any specific row of x is unlikely to be represented in too many of the samples x 1,..., x k, so it is unlikely too many values in the sequence (q(x 1 ),..., q(x k )) are affected by the change of a single row of x. We then apply a similar analysis to the one for the stable median mechanism. Theorem 4. Assume m n/64. Then SA is (ε, e Ω(εn/m) )-differentially private. Proof. Let x, x D n be adjacent databases. Let X i be an indicator random variable for the event that the row r on which x and x differ is present in the database x i. These are independent random variables with mean less than m/n, so by the Chernoff bound the probability that their sum exceeds (2m/n)k is at most e 2km2 /n 2 = e Ω(εn/m) by our choice of k. For the rest of the proof we will assume that r is present in at most (2m/n)k of the databases x i. Under this assumption the sequences s = (q(x 1 ),..., q(x k )) and s = (q(x 1 ),..., q(x k )) can differ in at most (2m/n)k entries. Let f and f denote the frequency of the mode of s and s, respectively. We now consider two cases. If f > 9k/16 then the mode of s and s must be the same as the two sequences differ in at most (2m/n)k k/16 of their entries. In this case, the only possible output of the mechanism on x and x is either the mode m of both sequences or. Then Pr[SM(x) = m] = Pr[f + N > 5k/8] Pr[f (2m/n)k + N > 5k/8] e ε Pr[f + N > 5k/8] = e ε Pr[SM(x ) = m] because N is a Laplace random variable with parameter 2km/εn. Using a similar calculation we conclude that Pr[SM(x) = ] e ε Pr[SM(x) = ]. If f 9k/16, then SA(x) outputs whenever N > k/16, which happens with probability at least 1 2 Ω(εn/m). Also, f f +(2m/n)k 31k/32, so SM(x ) outputs whenever N > k/32, which also happens with probability at least 1 e Ω(εm/n). As in the proof of Theorem 1, we can conclude that for every event S, Pr[SA(x) S] Pr[SA(x ) S] + e Ω(εn/m) and so SA is (ε, e Ω(εn/m) )-differentially private.

5 References These notes are based on Chapter 7 of the survey The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth.