Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Similar documents
Locally Differentially Private Protocols for Frequency Estimation

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Privacy in Statistical Databases

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

CS Communication Complexity: Applications and New Directions

Recitation 6. Randomization. 6.1 Announcements. RandomLab has been released, and is due Monday, October 2. It s worth 100 points.

Local Differential Privacy

Notes on Zero Knowledge

CMPUT651: Differential Privacy

Communication-efficient and Differentially-private Distributed SGD

Machine Learning Recitation 8 Oct 21, Oznur Tastan

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Calibrating Noise to Sensitivity in Private Data Analysis

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Lecture 1 Introduction to Differential Privacy: January 28

Probability Theory and Simulation Methods

Lecture 13: Covariance. Lisa Yan July 25, 2018

A Measurement of Randomness in Coin Tossing

Lecture 11- Differential Privacy

6.3 Bernoulli Trials Example Consider the following random experiments

Differential Privacy

Privacy in Statistical Databases

Lecture 2 Sep 5, 2017

Variance Reduction and Ensemble Methods

Privacy-Preserving Data Mining

Discrete Distributions

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Stephen Scott.

The Optimal Mechanism in Differential Privacy

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

Report on Differential Privacy

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Cryptography CS 555. Topic 25: Quantum Crpytography. CS555 Topic 25 1

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

What Can We Learn Privately?

Lecture 9: Conditional Probability and Independence

A flexible two-step randomised response model for estimating the proportions of individuals with sensitive attributes

Shannon's Theory of Communication

Decision Tree Learning Lecture 2

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

Limits to Approximability: When Algorithms Won't Help You. Note: Contents of today s lecture won t be on the exam

6.842 Randomness and Computation Lecture 5

What is a random variable

Relaxed Locally Correctable Codes in Computationally Bounded Channels

CS 361: Probability & Statistics

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

arxiv: v2 [cs.db] 31 Dec 2018

Slides for Data Mining by I. H. Witten and E. Frank

CS246 Final Exam, Winter 2011

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Review of probabilities

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Historical cryptography. cryptography encryption main applications: military and diplomacy

Basic Probability. Introduction

9. Distance measures. 9.1 Classical information measures. Head Tail. How similar/close are two probability distributions? Trace distance.

Privacy-preserving Data Mining

1 Maintaining a Dictionary

CS 543 Page 1 John E. Boon, Jr.

Non-Interactive ZK:The Feige-Lapidot-Shamir protocol

Topics in Computer Mathematics

Ad Placement Strategies

The Optimal Mechanism in Differential Privacy

Solutions for week 1, Cryptography Course - TDA 352/DIT 250

A brief review of basics of probabilities

Quantum Information Theory and Cryptography

Independence Solutions STAT-UB.0103 Statistics for Business Control and Regression Models

Supervised Learning via Decision Trees

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Non-Approximability Results (2 nd part) 1/19

Maximum-Likelihood Estimation: Basic Ideas

Performance Evaluation and Hypothesis Testing

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Decision Trees. Tirgul 5

What does independence look like?

Poisson Chris Piech CS109, Stanford University. Piech, CS106A, Stanford University

Independence. CS 109 Lecture 5 April 6th, 2016

Probability Models of Information Exchange on Networks Lecture 1

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Discrete Structures for Computer Science

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Cryptology. Vilius Stakėnas autumn

Expectation Maximization Mixture Models HMMs

Chapter 2. A Look Back. 2.1 Substitution ciphers

Lecture 3: Randomness in Computation

Differentially Private Real-time Data Release over Infinite Trajectory Streams

Probabilistically Checkable Arguments

Secure Multiparty Computation from Graph Colouring

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Cryptography and Security Final Exam

The Central Limit Theorem

2. The Sample Mean and the Law of Large Numbers

Transcription:

Locally Differentially Private Protocols for Frequency Estimation Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Differential Privacy

Differential Privacy Classical setting

Differential Privacy Classical setting Data DataData Data Data

Differential Privacy Classical setting +Noise Data DataData Data Data

Differential Privacy Classical setting Database +Noise Data DataData Data Data

Differential Privacy Database Data mining Statistical queries +Noise Classical setting Data DataData Data Data

Differential Privacy Database Data mining Statistical queries +Noise Differential Privacy Interpretation: Classical setting The decision to include/exclude individual s record has minimal (ε) influence on the outcome. Data DataData Smaller Data εdata Stronger Privacy

Differential Privacy Database Data mining Statistical queries +Noise Trusted Data DataData Data Data

Local Differential Privacy Database Data mining Statistical queries Data+Noise Data+Noise Data+Noise

Local Differential Privacy Database Data mining Statistical queries Data+Noise Data+Noise Data+Noise No worry about untrusted server

Local Differential Privacy Database Data mining Statistical queries Data+Noise Data+Noise Data+Noise No worry about untrusted server

Local Differential Privacy Database Data mining Statistical queries Data+Noise Data+Noise Data+Noise No worry about untrusted server

The Warner Model (1965) Survey technique for private questions Survey people: Are you communist party? Each person: Flip a secret coin Answer truth if head (w/p 0.5) Answer randomly if tail E.g., a communist will answer yes w/p 75%, and no w/p 25% To get unbiased estimation of the distribution: If n v out of n people are communist, we expect to see E[ I v ] = 0.75n v + 0.25(n n v ) yes answers c(n v ) = I v 0.25n 0.5 Since E[c(n v )] = E[I v] 0.25n 0.5 is the unbiased estimation of number of communists = n v Provide deniability: Seeing answer, not certain about the secret.

The Warner Model (1965) Survey technique for private questions Survey people: We say the protocol is ε -LDP iff Are you communist party? for any v and v from yes and no, Each person: Pr P v = v Flip a secret coin Provide deniability: Answer truth if head (w/p Pr 0.5) P v = v eε Answer randomly if tail E.g., a communist will answer yes w/p 75%, and no w/p 25% This only handles binary attribute. To get unbiased estimation We want to of the handle distribution: the more general setting. If n v out of n people are communist, we expect to see E[ I v ] = 0.75n v + 0.25(n n v ) yes answers c(n v ) = I v 0.25n is the unbiased estimation of number of communists 0.5 Since E[c(n v )] = E[I v] 0.25n 0.5 = n v Seeing answer, not certain about the secret.

Abstract LDP Protocol x E(v) takes input value v from domain D and outputs an encoded value x y P(x) takes an encoded value x and outputs y. P satisfies ε-ldp y We focus on frequency estimation c Est( y ) takes reports {y} from all users and outputs estimations c(v) for any value v in domain D

Frequency Estimation Protocols Direct Encoding (Generalized Random Response) [Warner 65] Generalize binary attribute to arbitrary domain Unary Encoding (Basic One-time RAPPOR) [Erlingsson et al 14] Encode into a bit-vector and perturb each bit Binary Local Hash [Bassily and Smith 15] Encode by hashing and then perturb

Direct Encoding (Random Response) User: Encode x = v (suppose v from D = {1,2,, d}) Toss a coin with bias p If it is head, report the true value y = x Otherwise, report any other value with probability q = 1 p p = Aggregator: eε, q = 1 e ε +d 1 e ε +d 1 Pr P v =v Pr P v =v = p q = eε d 1 Suppose n v users possess value v, I v is the number of reports on v. E[I v ] = n v p + n n v q Unbiased Estimation: c(v) = I v n q p q (uniformly at random)

Direct Encoding (Random Response) User: Encode x = v (suppose v from D = {1,2,, d}) Toss a coin with bias p If it is head, report the true value y = x Otherwise, report Intuitively, any other the value higher with probability p, the more q = 1 p accurate p = Aggregator: eε, q = 1 e ε +d 1 e ε +d 1 Pr P v =v Pr P v =v = p q = eε d 1 However, when d is large, p becomes small Suppose n v users possess value v, I v is the number of reports on v. E[I v ] = n v p + n n v q Unbiased Estimation: c(v) = I v n q p q (uniformly at random)

Unary Encoding (Basic RAPPOR) Encode the value v into a bit string x 0, x v 1 e.g., D = 1,2,3,4, v = 3, then x = [0,0,1,0] Perturb each bit independently p = eε/2, q = 1 e ε/2 +1 e ε/2 +1 Pr P(E v )=x = i Pr[x[i] v] Pr P(E v )=x Pr[x[i] v ] i = p (1 q) q (1 p) = eε Since x is unary encoding of v, x and x differ in two locations Intuition: By unary encoding, each location can only be 0 or 1, effectively reducing d in each location to 2. When d is large, UE is better. To estimate frequency of each value, do it for each bit.

Binary Local Hash The protocol description itself is more complicated Now we describe a simpler equivalent Each user uses a random hash function from D to 0,1 The user then perturbs the bit with probabilities p = eε e ε +g 1 = eε, q = 1 = 1 e ε +1 e ε +g 1 e ε +1 Pr P(E v )=b Pr P(E v )=b = p q = eε The user then reports the bit and the hash function The aggregator increments the reported group E[I v ] = n v p + n n v ( 1 2 q + 1 2 p) Unbiased Estimation: c(v) = I v n 1 2 p 1 2 v = 2 + 1 + 1 [0, 0, 0, 0] Group 1={2,4} Group 1 D = {1, 2, 3, 4} Group 0

Takeaway Key Question: Maximize utility of frequency estimation under LDP Key Idea: A framework to generalize and optimize these protocols Results: Optimized Unary Encoding and Local Hash By improving the frequency estimator, results in other more complicated settings that use LDP can be improved, e.g., private learning, frequent itemset mining, etc.

Method We measure utility of a mechanism by its variance E.g., in Random Response, Var c v = Var I v n q p q = Var[I v] p q 2 n q (1 q) p q 2 We propose a framework called pure and cast existing mechanisms into the framework. For each output y, define a set of input v called Support Intuition: each output votes for a set of input After perturbation, output y will support input from Support y E.g., In BLH, Support(y) = v H v = y Define p and q such that P(E v ) support v w/p p and don t w/p q E.g., In Random Response, p = p, q = q Pure means this holds for all input-output pairs

Method We measure utility of a mechanism by its variance E.g., in Random Response, Var c v = Var I v n q p q = Var[I v] p q 2 n q (1 q) p q 2 We propose a framework called min pure and cast existing mechanisms into the framework. q Var c v or min q n q (1 q ) For each output y, define a set of input v called Support p q 2 Intuition: each output votes for a set of input After perturbation, output y will support input from Support y where p, q satisfy ε-ldp E.g., In BLH, Support(y) = v H v = y Define p and q such that P(E v ) support v w/p p and don t w/p q E.g., In Random Response, p = p, q = q Pure means this holds for all input-output pairs

Optimized UE In the original UE, each bit is perturbed independently p = eε/2, q = 1 e ε/2 +1 e ε/2 +1 We want to make p higher. Key Insight: We perturb 0 and 1 differently! There are more 0, so we perturb with greater p; there is a single 1, so we perturb with smaller p For bit 0: p 0 = eε, q e ε +1 0 = 1 e ε +1 For bit 1: p 1 = 1 2, q 1 = 1 2 Pr P(E v )=x = i Pr[x[i] v] Pr P(E v )=x Pr[x[i] v ] i = p 0 (1 q 0 ) q 0 (1 p 0 ) = eε

Optimized Local Hash (OLH) In original BLH, secret is compressed into a bit, perturbed and transmitted. p = eε, q = 1 e ε +g 1 e ε +g 1 Pr P(E v )=x Pr P(E v )=x = p q = eε (g = 2 groups) Two steps that cause information loss: Compressing: loses much Perturbation: pretty accurate Key Insight: We want to make a balance between the two steps: By compressing into more groups, the first step carries more information Variance is optimized when g = e ε + 1 Read our paper for details!

Comparison of Different Mechanisms Direct Encoding has greater variance with larger d OUE and OLH have the same variance But OLH has smaller communication cost

Conclusion We survey existing LDP protocols on frequency estimation We propose a pure framework and cast existing protocols into it We optimize UE and BLH and come up with OUE and OLH Limitations Variance is linear in n, which seems inevitable Therefore, requires large number of users Cannot handle large domains Future Work Handling large domains Handling set-values

Backup: Experiments Highlights Dataset: Kosarak dataset (also on Rockyou dataset and a Synthetic dataset) Competitors: RAPPOR, BLH, OLH Randomized Response is not compared because the domain is large Key Results: OLH performs magnitudes better, especially when ε is large This also confirms our analytical conclusion

More Accuracy Backup: Accuracy on Frequent Values RAPPOR1 ε = 4.39 RAPPOR2 ε = 7.78 More Privacy

Backup: On Information Quality

Backup: On answering multiple questions Previously works (including centralized DP) suggest splitting privacy budget For example, when a user answers two questions, privacy budgets are ε/2 and ε/2 (assuming the two questions are of equal importance) In the centralized setting, there are sequential composition and parallel composition By partitioning users, one uses to parallel composition By split privacy budget, one uses sequential composition The two can basically produce equivalent results What about the local setting?

Backup: On answering multiple questions Measure the frequency accuracy (normalize since two approach have different number of users) Assuming OLH is used: Var c v /n = q (1 q) n p q 2 = 4eε n e ε 1 2 Two settings: Split privacy budget: Var c v /n = Partition users: Var c v / 1 2 n = 4e ε/2 n e ε/2 1 2 8e ε n e ε 1 2 Algebra shows that it is better to partition users

Thanks to my coauthors //?