Generating Private Synthetic Data: Presentation 2

Size: px

Start display at page:

Download "Generating Private Synthetic Data: Presentation 2"

Randell Perkins
6 years ago
Views:

1 Generating Private Synthetic Data: Presentation 2 Mentor: Dr. Anand Sarwate July 17, 2015

2 Overview 1 Project Overview (Revisited) 2 Sensitivity of Mutual Information 3 Simulation 4 Results with Real Data 5 Future Work

3 Recall the process Our goal is to publish synthetic data: we learn patterns and correlations in the population in a privacy-preserving way and then generate fake individuals/records so that the population statistics are similar. Given a dataset, study the patterns and correlations in the population. Build a probabilistic graphical model. Generate synthetic data that yields similar population stats. Preserve Privacy!

4 Chow-Liu Algorithm Variables of interest X 1, X 2,..., X m with dataset of size n. Create a complete graph using X 1, X 2,..., X m as vertices. For each pair of X i, X j calculate the mutual information I (X i : X j ), then use mutual information as edge weights. Find the maximum spanning tree

5 CLNode Python Class Iˆp (X i ; X j ) = a,b ˆP(X i = a, X j = b) log ˆP(X i = a, X j = b) ˆP(X i = a) ˆP(X j = b) CLNode.py Input: Range, Data Output: Empirical Distribution, Joint Empirical Distribution, Mutual Information

6 Differential Privacy Let M be a randomized data release algorithm. We say that M is ɛ differentially private if for all D 1, D 2 that differ in only one element and any event S, Pr(M(D 1 ) S) e ɛ Pr(M(D 2 ) S)

7 Differential Privacy Let M be a randomized data release algorithm. We say that M is ɛ differentially private if for all D 1, D 2 that differ in only one element and any event S, Pr(M(D 1 ) S) e ɛ Pr(M(D 2 ) S) Differential privacy ensures that any sequence of outputs (responses to queries) is essentially equally likely to occur, independent of the presence or absence of any individual.

8 Sensitivity Let f : D R be a function. The sensitivity of f is defined as, S(f ) = max f (D) f D,D D (D ) where D and D only differ by 1 element.

9 Sensitivity of Mutual Information Let X, Y be discrete random variables with dataset of size n then S n (I (X ; Y )) 2( 1+1/n 2 log( 1+1/n 2 ) 1 1/n 2 log( 1 1/n 2 ) 1 n log( 1 n ))

10 Sensitivity of Mutual Information Let X, Y be discrete random variables with dataset of size n then S n (I (X ; Y )) 2( 1+1/n 2 log( 1+1/n 2 ) 1 1/n 2 log( 1 1/n 2 ) 1 n log( 1 n )) worst case

11 Sensitivity of Mutual Information

12 The Laplace Mechanism Given any function f : D R the Laplace mechanism is defined as: M L (D, f ( ), ɛ) = f (D) + Z where Z Lap(S(f )/ɛ) Theorem: The Laplace mechanism preserves ɛ differential privacy.

13 Doubly Symmetric Binary Source Let X 1, X 2 be binary variables, if p X1,X 2 (x 1, x 2 ) is given by, 1 a 2 a 2 a 2 1 a 2 then X 1, X 2 form a doubly symmetric binary source. X 1 Bernoulli(0.5) X 2 = (X 1 + Z 12 ) mod2, where Z 12 Bernoulli(a) X1 z12 z13 z14 X3 X2 X4 z47 z25 z26 X7 X5 X6 simulation tree

14 Simulation Results No noise added, non-private

15 Simulation Results No noise added, non-private Noise added, ɛ = 0.3

16 Simulation Results Fraction of correctly recovering the tree

17 Simulation Results Average number of edges missed

18 Adult Data Set Sample Size: (after deleting entries with missing values) 9 Variables Selected: Age, Workclass, Education, Marital Status, Occupation, Race, Gender Annual Income >50K, Working hours per week.

19 Results Hours per Week Age Occupation Race Marital Status Workclass Education Gender Income50K No noise added, non-private

20 Results Hours per Week Hours per Week Age Occupation Race Age Occupation Race Marital Status Workclass Education Marital Status Workclass Education Gender Income50K Gender Income50K No noise added, non-private Noise added, ɛ = 1, 100%

21 Results Hours per Week Hours per Week Age Occupation Race Age Occupation Race Marital Status Workclass Education Marital Status Workclass Education Gender Income50K Gender Income50K No noise added, non-private Noise added, ɛ = 0.75, 90%

22 Results Hours per Week Hours per Week Age Occupation Race Age Occupation Race Marital Status Workclass Education Marital Status Workclass Education Gender Income50K Gender Income50K No noise added, non-private Noise added, ɛ = 0.5, 60%

23 Results Hours per Week Hours per Week Age Occupation Race Age Occupation Marital Status Workclass Education Marital Status Workclass Education Gender Income50K Gender Income50K Race No noise added, non-private Noise added, ɛ = 0.25, 25%

24 Results Hours per Week Hours per Week Age Occupation Race Age Occupation Marital Status Workclass Education Marital Status Workclass Education Race Gender Income50K Gender Income50K No noise added, non-private Noise added, ɛ = 0.1, 30%

25 Retrospect Race Age Workclass Education Marital-Status Occupation Gender Income>50K Hours per Week Mutual information btw Race and others

26 Retrospect Race Age Workclass Education Marital-Status Occupation Gender Income>50K Hours per Week Mutual information btw Race and others Education Age Workclass Marital-Status Occupation Race Gender Income>50K Hours per Week Mutual information btw Education and others

27 The Next Step Generate Synthetic Data from the tree structure Use empirical conditional distributions Logistic regression Measure the performance of data generated Extend the scope into continuous data

28 Thank You

t-closeness: Privacy beyond k-anonymity and l-diversity

t-closeness: Privacy beyond k-anonymity and l-diversity paper by Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian presentation by Caitlin Lustig for CS 295D Definitions Attributes: explicit identifiers,