RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Size: px

Start display at page:

Download "RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response"

Paulina Cook
5 years ago
Views:

1 RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova Google & USC Presented By: Pat Pannuto

2 RAPPOR, What is is good for? (Absolutely something!) 1. Google wants to collect user metrics 2. Google doesn t want to be creepy Or subject to subpoenas etc etc 3. Generic tool to collect pretty much an information of interest Booleans Ordinals Numeric values Arbitrary strings (!)

3 (Refresher): Randomized Survey Mechanism Consider a potentially embarrassing question: Did you vote for Donald Trump? 1. Flip a coin If heads: Say Yes. If tails: Flip coin again If heads: Say no. If tails: Answer truthfully 2. P(Y Y) = ; P(Y N) =.5 + 0; P(N Y) = ; P(N N) = But what if I ask the same question again tomorrow?

4 Memoization enables privacy tradeoff The idea: Play the randomized response game twice For actual answer, A, generate a permanent randomized response R Client saves a permanent mapping of A s -> R s for all time For every query, generate a noisy response randomly from R Longitudinal attacks reveal R not A Noisy responses mitigate short term tracking Not protected: Long-term widespread tracking ( big data ) Protected by policy, e.g. data retention rules

5 Memoization alone is not sufficient Guarantees weaken as true value changes Report the number of days old you are every day

The RAPPOR Algorithm 1. Given actual value v, use h hashes to populate Bloom filter size k 2. For each bit i in Bloom filter: ß Permanent Response B # = 1 with prob.5f; 0 with prob.

6 The RAPPOR Algorithm 1. Given actual value v, use h hashes to populate Bloom filter size k 2. For each bit i in Bloom filter: ß Permanent Response B # = 1 with prob.5f; 0 with prob.5f; B # with prob 1 f Where f is a parameter that controls longitudinal privacy guarantee v 3. For each bit i in response S: ß Instantaneous Response P S # = 1 = q, if B # = 1; p, if B # = 0 B B S

7 Variations on RAPPOR (aka: When Pat wonders if this isn t what Google is really doing in practice ) One-Time RAPPOR Skip generation of S, just report B Basic RAPPOR No Bloom filter (i.e. direct map responses to bits; equivalently h = 1) Basic One-Time RAPPOR Combine the above Key: The one-time s don t actually memoize (fixes space problem at expense of longitudinal privacy )

8 How Private (and proofs!*)? *in the paper.. Permanent Randomized Response ε : = 2h ln(?@a B C A B C ) Small note: Note there is no k here, aka Bloom filters do not provide differential privacy Instantaneous Randomized Response Probabilities of seeing a 1 given B set or not set: q = P S # = 1 b # = 1 =? f p + q + 1 f q U p = P S # = 1 b # = 0 =? f p + q + 1 f p U ε? = h log ( W (?@X ) X (?@W ) )

9 Undoing all that hard work: Learning from RAPPOR-collected data Mitigate hash collisions via cohorts For each cohort, attempt to reconstruct aggregate real Bloom filters Count of times bit i set in S for cohort j t #Y = Z [\@(X] A B B CX)^\ (?@C)(W@X) Number of reports for cohort j Estimate of times bit i set in hidden B for all reporters in cohort j Consolidate into a vector Y of all t #Y s -- i 1, k ; j [1, m] Create a design matrix, X of size km x M, where M is candidate strings Columns of X contain hm 1 s, concatonation of all m cohorts Bloom filters Lasso regression for Y ~ X, then least squares, then Bonferroni correction of 0.05/M [or Benjamini-Hochberg]

10 RAPPOR parameter selection Must choose f, p, q k, h, m Recall, k and m do not affect privacy bounds ε : = 2h ln(?@a B C A B C )

11 What can Basic One-Time RAPPOR learn? For f=0, p=0.5, q=0.75, and confidence = And a uniform distribution of strings Uniform -> SNR problem For ln(3)-differential privacy: Roughly N/10 strings for N samples 1% frequency -> 1 million samples 0.01% -> 10 billion No theoretical analytics for real RAPPOR / non-uniform samples

12 Trade-off: False Discovery Rate vs Rare String Detection

13 Simulating learning a normal distribution q = 0.75, p = 0.5, ε = ln (3), f = 0

14 Exponential distribution of 1 million strings Query: Is string present? p = 0.5, q = 0.75, f = 0.5, h = 2, k = 128, m = 16 Also two false positives The point The tail is hard Caught everything > 1%

15 Real-world data Windows Process Names Chrome Homepages - 187k reports; 10k machines - unexpected frequency - ~2% have BADAPPLE - how did they search??

16 Final thoughts High-level concept simple and intuitive 2-level randomized response Extracting information requires know it is there Unclear how well client-side permanent random response scales

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Locally Differentially Private Protocols for Frequency Estimation Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha Differential Privacy Differential Privacy Classical setting Differential Privacy