Enabling Accurate Analysis of Private Network Data

Similar documents
arxiv: v4 [cs.db] 4 Feb 2010

Boosting the Accuracy of Differentially Private Histograms Through Consistency

On Node-differentially Private Algorithms for Graph Statistics

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

Boosting the Accuracy of Differentially Private Histograms Through Consistency

Zero-Knowledge Private Computation of Node Bridgeness in Social Networks

Privacy-preserving logistic regression

Lecture 20: Introduction to Differential Privacy

What Can We Learn Privately?

Answering Many Queries with Differential Privacy

Pufferfish Privacy Mechanisms for Correlated Data. Shuang Song, Yizhen Wang, Kamalika Chaudhuri University of California, San Diego

Semantic Security: Privacy Definitions Revisited

Privacy-Preserving Data Mining

Towards Accurate Histogram Publication under Differential Privacy

1 Differential Privacy and Statistical Query Learning

A Differentially Private Estimator for the Stochastic Kronecker Graph Model

A Simple and Practical Algorithm for Differentially Private Data Release

Lecture 11- Differential Privacy

Faster Algorithms for Privately Releasing Marginals

Optimal Noise Adding Mechanisms for Approximate Differential Privacy

Lecture 1 Introduction to Differential Privacy: January 28

Calibrating Noise to Sensitivity in Private Data Analysis

CSE 598A Algorithmic Challenges in Data Privacy January 28, Lecture 2

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

On the Relation Between Identifiability, Differential Privacy, and Mutual-Information Privacy

Privacy in Statistical Databases

Differentially Private ANOVA Testing

Near-optimal Differentially Private Principal Components

Differential Privacy: a short tutorial. Presenter: WANG Yuxiang

Iterative Constructions and Private Data Release

Differential Privacy II: Basic Tools

PoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2

New Statistical Applications for Differential Privacy

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Bounds on the Sample Complexity for Private Learning and Private Data Release

Smooth Sensitivity and Sampling in Private Data Analysis

Analysis Based on SVM for Untrusted Mobile Crowd Sensing

Analysis of a Privacy-preserving PCA Algorithm using Random Matrix Theory

Differentially Private Real-time Data Release over Infinite Trajectory Streams

Efficient Algorithm for Privately Releasing Smooth Queries

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

CMPUT651: Differential Privacy

The Optimal Mechanism in Differential Privacy

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

Wishart Mechanism for Differentially Private Principal Components Analysis

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Differential Privacy

Bounds on the Sample Complexity for Private Learning and Private Data Release

Progress in Data Anonymization: from k-anonymity to the minimality attack

The Optimal Mechanism in Differential Privacy

Generating Private Synthetic Data: Presentation 2

Differential Privacy and Verification. Marco Gaboardi University at Buffalo, SUNY

[Title removed for anonymity]

Analyzing Graphs with Node Differential Privacy

Rényi Differential Privacy

Privacy-preserving Data Mining

A Note on Differential Privacy: Defining Resistance to Arbitrary Side Information

Finding Frequent Items in Probabilistic Data

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri

Extremal Mechanisms for Local Differential Privacy

Information-theoretic foundations of differential privacy

arxiv: v1 [cs.ds] 18 Jul 2011

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate

Privacy in Statistical Databases

Wishart Mechanism for Differentially Private Principal Components Analysis

Differentially Private Learning of Structured Discrete Distributions

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

Report on Differential Privacy

Differentially Private Linear Regression

Approximate and Probabilistic Differential Privacy Definitions

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington

Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate

Differentially Private Data Analysis of Social Networks via Restricted Sensitivity

Testing and Reconstruction of Lipschitz Functions with Applications to Data Privacy

Answering Queries from Statistics and Probabilistic Views. Nilesh Dalvi and Dan Suciu, University of Washington.

Privacy, Accuracy, and Consistency Too: A Holistic Solution to Contingency Table Release

Quantitative Approaches to Information Protection

Database Privacy: k-anonymity and de-anonymization attacks

Private Learning and Sanitization: Pure vs. Approximate Differential Privacy

A Learning Theory Approach to Noninteractive Database Privacy

Sample complexity bounds for differentially private learning

Gaussian Mixture Models for Classification and Hypothesis Tests Under Differential Privacy

Differentially Private Sequential Data Publication via Variable-Length N-Grams

How Well Does Privacy Compose?

Sparse PCA in High Dimensions

Privacy: Theory meets Practice on the Map

Distributed Private Data Analysis: Simultaneously Solving How and What

Differential Privacy and its Application in Aggregation

The Bernstein Mechansim

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann

Data Privacy: Tensions and Opportunities

Optimizing Linear Counting Queries Under Differential Privacy

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

t-closeness: Privacy Beyond k-anonymity and l-diversity

Optimal Error of Query Sets under the Differentially-Private Matrix Mechanism

Boosting and Differential Privacy

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained ERM

A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing

Transcription:

Enabling Accurate Analysis of Private Network Data Michael Hay Joint work with Gerome Miklau, David Jensen, Chao Li, Don Towsley University of Massachusetts, Amherst Vibhor Rastogi, Dan Suciu University of Washington October 8, 2009 1

Analysis of social networks Social network derived from mobile phone call records [Onnela, PNAS 07].6M nodes 7.0M edges edge if reciprocal phone calls during 18 week interval 2

Analysis of social networks Social network derived from mobile phone call records [Onnela, PNAS 07].6M nodes 7.0M edges Can we enable analysts to study networks in a way edge if reciprocal phone calls during 18 week interval that protects sensitive information about participants? 2

How to achieve both privacy and utility? DATA OWNER ANALYST C D E F G Y B A J I Z W X M N L K P O H V U R T S what is diameter? what is maximum degree? Aa Ee Bb Hh Ff Gg Cc Dd Private network how many 3 cliques? is Alice connected X to Bob? Allow aggregate statistics provided facts about individuals are not disclosed 3

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G A = (G) + noise

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise 11

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise 11 + 2.3

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise 11 + 2.3 13.3

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise 11 + 2.3 13.3 Dwork, McSherry, Nissim, Smith [Dwork, TCC 06] have described an answer perturbation mechanism satisfying differential privacy. Comparatively few results for these techniques applied to graphs.

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise 11 + 2.3 13.3

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise G A = (G ) + noise

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise Pr[ A = x μ = (G) ] G A = (G ) + noise

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise Pr[ A = x μ = (G) ] = p G A = (G ) + noise

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise Pr[ A = x μ = (G) ] = p G A = (G ) + noise Pr[ A = x μ = (G ) ]

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise Pr[ A = x μ = (G) ] = p G A = (G ) + noise Pr[ A = x μ = (G ) ] = q

[Dwork, TCC 06] uery answer perturbation DATA OWNER ANALYST G how many edges? A = (G) + noise Pr[ A = x μ = (G) ] = p G differ by at most factor of e ε A = (G ) + noise Pr[ A = x μ = (G ) ] = q

[Dwork, TCC 06] Calibrating noise The following algorithm for answering is ε-differentially private: A (G) + Noise(Δ / ε) Δ: Max change in, over any two graphs differing by single edge 5

[Dwork, TCC 06] Calibrating noise The following algorithm for answering is ε-differentially private: A (G) + Noise(Δ / ε) true answer Δ: Max change in, over any two graphs differing by single edge 5

[Dwork, TCC 06] Calibrating noise The following algorithm for answering is ε-differentially private: A (G) + Noise(Δ / ε) true answer sample from Laplace distribution Δ: Max change in, over any two graphs differing by single edge 5

[Dwork, TCC 06] Calibrating noise The following algorithm for answering is ε-differentially private: privacy parameter A (G) + Noise(Δ / ε) true answer sample from Laplace distribution Δ: Max change in, over any two graphs differing by single edge 5

[Dwork, TCC 06] Calibrating noise The following algorithm for answering is ε-differentially private: sensitivity of privacy parameter A (G) + Noise(Δ / ε) true answer sample from Laplace distribution Δ: Max change in, over any two graphs differing by single edge 5

Positive results in differential privacy Some common analyses have low sensitivity: contingency tables, histograms [Dwork, TCC 06] Data mining algorithms implemented using only low sensitivity queries: PCA, k-means, Decision Trees [Blum, PODS 05] Learning theory: possible to learn any concept class with polynomial VC dimension; half-space queries can be learned efficiently [Blum, STOC 08] Many challenges remain... Beyond tabular data Optimal query strategies? 6

[ICDM 09] Accurate degree distribution estimation is possible Degree distribution: the frequency of each degree in graph. A widely studied property of networks. G 3 2 1 0 Histogram 0 1 2 3 5 Degree sequence as a vector [1,1,2,2,,,,] 8 6 2 0 3 2 1 0 CCDF 0 1 2 3 5 Degree sequence 1 2 3 5 6 7 8 7

[ICDM 09] Accurate degree distribution estimation is possible Degree distribution: the frequency of each degree in graph. A widely studied property of networks. G 3 2 1 0 Histogram 0 1 2 3 5 Degree sequence as a vector [1,1,2,2,,,,] 8 6 2 0 3 2 1 0 CCDF 0 1 2 3 5 Degree sequence 1 2 3 5 6 7 8 7

[ICDM 09] Accurate degree distribution estimation is possible Degree distribution: the frequency of each degree in graph. A widely studied property of networks. G 3 2 1 0 Histogram 0 1 2 3 5 Degree sequence as a vector [1,1,2,2,,,,] 8 6 2 0 3 2 1 0 CCDF 0 1 2 3 5 Degree sequence 1 2 3 5 6 7 8 7

[ICDM 09] Accurate degree distribution estimation is possible Degree distribution: the frequency of each degree in graph. A widely studied property of networks. G 3 2 1 0 Histogram 0 1 2 3 5 Degree sequence as a vector [1,1,2,2,,,,] 8 6 2 0 3 2 1 0 CCDF 0 1 2 3 5 Degree sequence 1 2 3 5 6 7 8 7

[ICDM 09] Accurate degree distribution estimation is possible Degree distribution: the frequency of each degree in graph. A widely studied property of networks. G 3 2 1 0 Histogram 0 1 2 3 5 Degree sequence as a vector [1,1,2,2,,,,] 8 6 2 0 3 2 1 0 CCDF 0 1 2 3 5 Degree sequence 1 2 3 5 6 7 8 7

Using the sort constraint S(G) = [10, 10,...10, 10, 1, 18,18,18,18] 8

Using the sort constraint 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 S(G) = [10, 10,...10, 10, 1, 18,18,18,18] 8

Using the sort constraint 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 8

Using the sort constraint 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 The output of the sorted degree query is not (in general) sorted. We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8

Using the sort constraint 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 The output of the sorted degree query is not (in general) sorted. We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8

Using the sort constraint 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence = 19th smallest degree + noise 0 5 10 15 20 25 The output of the sorted degree query is not (in general) sorted. We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8

Using the sort constraint 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence = 19th smallest degree + noise 0 5 10 15 20 25 The output of the sorted degree query is not (in general) sorted. We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8

Experimental results livejournal n=5.3m orkut n=3.1m original noisy inferred powerlaw α=1.5, n=5m ε=.001 0.0 0.2 0. 0.6 0.8 1.0 0.0 0.2 0. 0.6 0.8 1.0 0.0 0.2 0. 0.6 0.8 1.0 0 1 9 19 9 0 1 9 19 9 99 0 9 9 99 ε=.01 0.0 0.2 0. 0.6 0.8 1.0 0.0 0.2 0. 0.6 0.8 1.0 0.0 0.2 0. 0.6 0.8 1.0 0 1 9 19 9 0 1 9 19 9 99 0 9 9 99 9

After inference, noise only where needed 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 Standard Laplace noise is sufficient but not necessary for differential privacy. By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. Improvement in accuracy will depend on sequence 10

After inference, noise only where needed 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 Standard Laplace noise is sufficient but not necessary for differential privacy. By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. Improvement in accuracy will depend on sequence 10

After inference, noise only where needed 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 Standard Laplace noise is sufficient but not necessary for differential privacy. By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. Improvement in accuracy will depend on sequence 10

After inference, noise only where needed 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence 0 5 10 15 20 25 Standard Laplace noise is sufficient but not necessary for differential privacy. By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. Improvement in accuracy will depend on sequence 10

After inference, noise only where needed 0 5 10 15 20 S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence Before inference After inference Mean Squared Error Θ(n/ɛ 2 ) O(dlog 3 n/ɛ 2 ) number of distinct degrees 0 5 10 15 20 25 Standard Laplace noise is sufficient but not necessary for differential privacy. By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. Improvement in accuracy will depend on sequence 10

Conclusion Possible to accurately estimate degree distributions while providing strong guarantees of privacy Other findings Some network analyses cannot be accurately answered under differential privacy (clustering coefficient, motif analysis [Nissim, STOC 07] [PODS 09] ) Apply inference to other queries (e.g. histograms [CoRR 09]) Future work: generate accurate synthetic networks under differential privacy? 11

uestions? Additional details on our work may be found here: [ICDM 09] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of private networks. In International Conference on Data Mining (ICDM), To appear, 2009. [CoRR 09] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially-private queries through consistency. CoRR, abs/090.092, 2009. (under review) [PODS 09] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship privacy: Output perturbation for queries with joins. In Principles of Database Systems (PODS), 2009. [VLDB 08] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural identification in anonymized social networks. In VLDB Conference, 2008. http://www.cs.umass.edu/~mhay/ 12

References [Blum, PODS 05] A. Blum C. Dwork, F. McSherry, and K. Nissim. Practical Privacy: The SuL Framework. PODS, 2005. [Blum, STOC 08] A. Blum, K. Ligett, and A. Roth. Learning Theory Approach to Non- Interactive Database Privacy, STOC 2008. [Onnela, PNAS 07] J. Onnela et al. Structure and tie strengths in mobile communication networks, PNAS 2007. [Liu, SIGMOD 08] K. Liu and E. Terzi. Towards identity anonymization on graphs. In SIGMOD, 2008. [Zhou, ICDE 08] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In ICDE, 2008. [Zou, VLDB 09] L. Zou, L. Chen, and T. Ozsu. K-automorphism: A general framework for privacy preserving network publication. In Proceedings of VLDB Conference, 2009. [Ying, SDM 2008] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. In SIAM International Conference on Data Mining, 2008. 13

References (con t) [Cormode, VLDB 09] G. Cormode, D. Srivastava, S. Bhagat, and B. Krishnamurthy. Class-based graph anonymization for social network data. In VLDB Conference, 2009. [Campan, PinKDD 08] A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social networks. In PinKDD, 2008. [Rastogi, VLDB 07] V. Rastogi, S. Hong, and D. Suciu. The boundary between privacy and utility in data publishing. In VLDB, pages 531 52, 2007. [Dwork, TCC 06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Third Theory of Cryptography Conference, 2006. [Nissim, STOC 07] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75 8, 2007. 1