Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

Similar documents
Probabilistic Graphical Models

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

Bayesian Machine Learning - Lecture 7

The Optimal Mechanism in Differential Privacy

Differentially Private Real-time Data Release over Infinite Trajectory Streams

9 Forward-backward algorithm, sum-product on factor graphs

Markov Random Fields

Calibrating Noise to Sensitivity in Private Data Analysis

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Enabling Accurate Analysis of Private Network Data

Dynamic Approaches: The Hidden Markov Model

Conditional Random Field

Undirected Graphical Models

The Optimal Mechanism in Differential Privacy

4 : Exact Inference: Variable Elimination

What is (certain) Spatio-Temporal Data?

Personalized Social Recommendations Accurate or Private

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Note Set 5: Hidden Markov Models

Linear Dynamical Systems

Bayesian Networks Inference with Probabilistic Graphical Models

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

[Title removed for anonymity]

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Data Mining 2018 Bayesian Networks (1)

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

11 : Gaussian Graphic Models and Ising Models

3 : Representation of Undirected GM

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri

Probabilistic Graphical Models

ECE521 Lecture 19 HMM cont. Inference in HMM

Probabilistic Graphical Models

Multivariate Bayesian Linear Regression MLAI Lecture 11

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Pufferfish Privacy Mechanisms for Correlated Data. Shuang Song, Yizhen Wang, Kamalika Chaudhuri University of California, San Diego

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Why (Bayesian) inference makes (Differential) privacy easy

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Graphical Models

Variational Inference (11/04/13)

Notes on Markov Networks

Chapter 16. Structured Probabilistic Models for Deep Learning

Introduction to Artificial Intelligence (AI)

Markov Chains and MCMC

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

Artificial Intelligence Bayes Nets: Independence

Learning in Bayesian Networks

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning for Data Science (CS4786) Lecture 19

The Monte Carlo Method: Bayesian Networks

Artificial Intelligence

CS Lecture 3. More Bayesian Networks

Link Analysis Ranking

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Privacy in Statistical Databases

probability of k samples out of J fall in R.

Directed Graphical Models or Bayesian Networks

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Need for Sampling in Machine Learning. Sargur Srihari

State Space and Hidden Markov Models

Gradient Estimation for Attractor Networks

arxiv: v1 [cs.db] 29 Sep 2016

Probabilistic Graphical Models

CPSC 540: Machine Learning

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Graphical Models 101. Biostatistics 615/815 Lecture 12: Hidden Markov Models. Example probability distribution. An example graphical model

Lecture 21: Spectral Learning for Graphical Models

Estimating Large Scale Population Movement ML Dublin Meetup

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

F O R SOCI AL WORK RESE ARCH

Towards a Bayesian model for Cyber Security

Intelligent Systems (AI-2)

15-780: Grad AI Lecture 19: Graphical models, Monte Carlo methods. Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

4.1 Notation and probability review

Privacy-Preserving Data Mining

Probabilistic Graphical Models

Differential Privacy and Verification. Marco Gaboardi University at Buffalo, SUNY

COMP90051 Statistical Machine Learning

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Graphical Models and Kernel Methods

STA 4273H: Statistical Machine Learning

CMPUT651: Differential Privacy

Intelligent Systems (AI-2)

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

2 : Directed GMs: Bayesian Networks

CSC 412 (Lecture 4): Undirected Graphical Models

Introduction. Chapter 1

Bayes Nets: Independence

Rapid Introduction to Machine Learning/ Deep Learning

Based on slides by Richard Zemel

Basic math for biology

Intelligent Systems:

Quantitative Approaches to Information Protection

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Recap: Bayes Nets. CS 473: Artificial Intelligence Bayes Nets: Independence. Conditional Independence. Bayes Nets. Independence in a BN

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Transcription:

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks Yang Cao Emory University 207..5

Outline Data Mining with Differential Privacy (DP) Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems

Data Mining Company Institute sensitive database*! attack Public Attacker

Privacy-Preserving Data Mining (PPDM)* How? ε-differential Privacy! Institute Or Sensitive data X noisy data attack Adversary

What is Differential Privacy Privacy: the right to be forgotten. DP: output of an algorithm should NOT be significantly affected by individual s data. D M( Q( 0 )) M( Q( 0 )) Formally, M satisfies ε-dp if log Pr(M Q ( D ) Pr(M Q ( D ) ( ) = r) ( ) = r) ε e.g., Laplace mechanism: add Lap(/ε) noise to Q(D) D Sequential Composition. e.g., run M twice 2ε-DP 0 ε, privacy. e.g. 2ε-DP means more privacy loss than ε-dp.

An open problem of DP on Correlation Data When data are independent: D M( Q( 0 )) M( Q( 0 )) D 0 ε-dp When data are correlated (e.g. u and u3 always same): D M( Q( 0 )) / M( Q( 0 )) D 0 0?-DP It is still controversial [*][**] about the guarantee of DP [*] Differential Privacy as a Causal Property, https://arxiv.org/abs/70.05899 [**] https://github.com/frankmcsherry/blog/blob/master/posts/206-08-29.md

Quantifying DP on Correlated Data A few recent papers [Cao7][Yang5][Song7] use a Quantification approach to achieve ε-dp (protecting each user private data value) Traditional approach (if attacker knows correlations, ε-dp may not hold): sensitive data Laplace Mechanism Lap(/ε) ε-dp data Quantification approach (protect against attackers with knowledge of correlation): sensitive data model data correlations attacker inference Laplace Mechanism Lap(/ε ) ε-dp data [Cao7]: Markov Chain [Yang5]: Gaussian Markov Random Field (GMRF) [Song7]: Bayesian Network

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems

Spatiotemporal Data Mining with DP Sensitive data Private data D D 2 D 3 ε-dp ε-dp ε-dp r r 2 r 3 t= 2 3 u loc3 loc loc u2 loc2 loc4 loc5 u3 loc2 loc4 loc5 u4 loc4 loc5 loc3 Count Query t= 2 3.. loc 0 2 2.. loc2 2 0 0.. loc3 0.. loc4 2 0.. loc5 0 2.. Laplace Noise Lap(/ε) t= 2 3.. loc 0 3.. loc2 3 0.. loc3 0.. loc4 2 0.. loc5 3 3.. (a) Location Data (b) True Counts (c) Private Counts

What types of data correlations? loc3 loc5 colleague u2 couple loc4 u u3 (a) Road Network (b) Social Ties temporal correlation for single user spatial correlation for user-user D D 2 D 3 7:00 8:00 9:00 u loc3 loc loc u2 loc2 loc loc u3 loc2 loc4 loc5 u4 loc4 loc5 loc3 (a) Location Data

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations - what is MC - how can (attacker) learn MC from data - how can (attacker) infer private data using MC Gaussian Random Markov Field for user-user correlations Summary and open problems

What is Markov Chain A Markov chain is a stochastic process with the Markov property. -order Markov property: the state at time t only depends on the state at time t- Pr(x_t x_t-)=pr(x_t x_t-,,x_) Time-homogeneous: the transition matrix is the same after each step t>0, Pr(x_t+ x_t)=pr(x_t+2 x_t+) 7:00 8:00 9:00 u loc loc3 loc2 u2 loc2 loc2 loc2 u3 loc3 loc loc u4 loc loc2 loc2 Raw Trajectories t loc loc2 loc3 t+ loc loc2 loc3 0.2 0. 0.7 0. 0.2 0.7 0.3 0.4 0.3 Transition Matrix

How can (attacker) learn MC If attacker knows partial user trajectory, he can directly learn transition matrix by Maximum Likelihood estimation If attacker knows road network, he may learn MC using google-like model [*] [*] E. Crisostomi, S. Kirkland, and R. Shorten, A Google-like model of road network dynamics and its application to regulation and control, International Journal of Control, vol. 84, no. 3, pp. 633 65, Mar. 20.

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Model temporal correlations using Markov Chain e.g., user i : loc loc3 loc2 (a) Transition Matrix Pr(l i t l i t ) (b) Transition Matrix Pr(l i t l i t ) time t- time t loc loc2 loc3 loc loc2 loc3 time t loc 0. 0.2 0.7 loc2 0 0 time t- loc 0.2 0.3 0.5 loc2 0. 0. 0.8 loc3 0.3 0.3 0.4 loc3 0.6 0.2 0.2 Backward Temporal Correlation P i B Forward Temporal Correlation P i F

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL DP can protect against the attacker with knowledge of all tuples except the one of victim + Temporal Correlation? D t= u loc3 u2 loc2 u3 loc2 u4 loc4? li } D K A i (D K ) A i T (D K,P i B,P i F ) (i) A T i (D K,P B i, ) (ii) A T i (D K,,P F i ) (iii) A T i (D K,P B i,p F i )

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Recall the definition of DP: if PL 0 (M) ε, then M satisfies ε-dp. Definition of TPL:

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Definition of TPL: If no temporal correlation TPL = PL0 Eqn(2)= log Pr(r l t i, D t k ) +...+ log Pr(rt l t i, D t k ) +...+ log Pr(rT l t i, D t k ) Pr(r l i tʹ, D t k ) Pr(r t l i tʹ, D t k ) Pr(r T l i tʹ, D t k ) { 0 PL0{ { 0

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Definition of TPL: If with temporal correlation TPL =? Hard to quantify Eqn(2) Eqn(2)= log Pr(r l t i, D t k ) +...+ log Pr(rt l t i, D t k ) +...+ log Pr(rT l t i, D t k ) Pr(r l i tʹ, D t k ) Pr(r t l i tʹ, D t k ) Pr(r T l i tʹ, D t k ) {? PL0{ {?

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL (i) A i T (D K,P i B, ) (BPL) (ii) A i T (D K,,P i F ) (FPL) (iii) A i T (D K,P i B,P i F ) (i) (ii) r. r t- r t r t+. r T

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL BPL Analyze BPL Backward temporal correlations Eqn(6)= Backward privacy loss function. how to calculate it?

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL FPL Analyze FPL Forward temporal correlations Forward privacy loss function. how to calculate it?

Calculating BPL & FPL Privacy Quantification Upper bound We convert the problem of BPL/FPL calculation to finding an optimal solution of a linear-fractional programming problem. This problem can be solved by simplex algorithm in O(2 n ). We designed a O(n 2 ) algorithm for quantifying BPL/FPL.

Calculating BPL & FPL Privacy Quantification Upper bound Example of BPL under different temporal corr. (i) Strong temporal corr. (ii) Moderate temporal corr. (iii) No temporal corr..0 Privacy Loss 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.0 0.8 0.25 0.30 0.35 0.39 0.42 0.45 0.48 0.50 t= 2 3 4 5 6 7 8 9 0 Time

Refer to Theorem 5 in our paper Calculating BPL & FPL Privacy Quantification Upper bound Privacy Loss BPL 0.8 0.6 0.4 0.2 q = 0.8; d = 0.; ε = 0.23 q=0.8, d=0., ε=0.23 (a) case B 0.8 0.2 Pi=( ) 0. 0.9 BPL.2.0 0.8 0.6 0.4 0.2 q = 0.8; d = 0; ε = 0.5 q=0.8, d=0, ε=0.5 (b) case 2 B 0.8 0.2 Pi=( ) 0 BPL 3.5 3.0 2.5 2.0.5.0 0.5 20 40 60 80 00 t q = 0.8; d = 0; ε = 0.23 q=0.8, d=0, ε=0.23 (c) time case 3 B 0.8 0.2 Pi=( 0 ) 20 40 60 80 00 t BPL 20 5 0 5 20 40 60 80 00 t q = ; d = 0; ε = 0.23 q=, d=0, ε=0.23 (d) case 4 B 0 Pi=( ) 20 40 60 80 00 t 0

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems - what is GMRF - how can (attacker) learn GMRF from data - how can (attacker) infer private data using GMRF

What is GMRF Gaussian Markov Random Field is a probabilistic graphical model, in which each node is Gaussian variable and the (undirected) edges indicate the dependencies between variables. Data Correlation = (Joint) Distribution over Data We choose Gaussian Markov random field - Rich representation - Easy to construct - Unified form-gaussian - East to compute

how can (attacker) learn GMRF If attacker knows user-location frequency: If attacker knows user-user social network, we can construct GMRF from weighted undirected graph A B C D 2 3 5 5 5 0 0 5 9 3 0 3 2 0 3 2 5 - ø ö ç ç ç ç ç è æ - - - - - - - - S = ø ö ç è æ S - µ - - x x x 2 exp ) ( T i x i p Laplacian matrix

How can (attacker) infer private data using MC Model Attacker Define SPL x x4 3 2 G known unknown victim x2 x3 x x5 x6 x4 x5 x7 x8 x9 3 2 G x2 3 3 4 x3 x6 x7 x8 x9 (a) R(G,/3) (b) R (G,0.5) each user can define their privacy level as R(G,δ). G: subgraph of GMRF δ: percentage of the data is known by adversaries

How can (attacker) infer private data using MC Model Attacker Define SPL Bayesian Inference what is the probability of return r, if the victim is at loci? Pr(r loc i ) = Pr(r D)Pr(D u loc i, D k ) D u Du ~ unknown, try to infer Dk ~ known to adversary On GMRF SPL = Pr(r loc i ) Pr(r loc i ) = Du Du Pr(r D)Pr(D u loc i, D k ) Pr(r D )Pr(D u loc' i, D k ) what is the difference of the probability between the victim is at loci and the victim is at loc i Marginalizing Du Du Pr(r D)Pr(D u loc i, D k ) Marginalization Laplace Mechanism Data Correlation (GMRF)

impact of Pi s: parameter of Laplacian smoothing. The smaller s, the high prior knowledge about Pi 00 s=0.0. s=0.05. s=0.005. s=.0. 00 s=0.0. s=0.05. s=0.005. s=.0. 0 0 0 t=0 2 4 6 8 0 2 4 (b) TPL for ε= 0 t=5 20 35 50 65 80 95 0 2540 (b) TPL for ε=0.

impact of Gi,m p: density of graph SPL vs. p: PL on a sparser graphs tend to be higher. SPL vs. Gi : PL on larger Gi tends to be higher. higher ε= 40 ε= 30 Average SPL 30 20 0 G0.8_50 Brightkite_50 G0.2_50 22.5 5 7.5 G0.8 G0.5 G0.2 0 0 m=40 30 20 0 Gi =20 30 40 50 (a) SPL vs. p (b) SPL vs. Gi (m=0)

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems

Summary When mining private data using DP, we need to model data correlations. In spatiotemporal data, naturally we model temporal correlations using Markov Chain; model user-user correlations using GMRF as attacker s knowledge. Based on the inference/computation on MC and GMRF, we can calibrate privacy parameter for DP. sensitive data model data correlations attacker inference Laplace Mechanism Lap(/ε ) ε-dp data

Open Problems How to model spatiotemporal correlations as a whole? Or the appropriate combination between MC and GMRF? Data correlations may degrade the utility of the miningd private data (need to add more noisy for the same level of privacy): how can data miner takes advantage of data correlations to improve data utility?