Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks Yang Cao Emory University 207..5

Outline Data Mining with Differential Privacy (DP) Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems

Data Mining Company Institute sensitive database*! attack Public Attacker

Privacy-Preserving Data Mining (PPDM)* How? ε-differential Privacy! Institute Or Sensitive data X noisy data attack Adversary

What is Differential Privacy Privacy: the right to be forgotten. DP: output of an algorithm should NOT be significantly affected by individual s data. D M( Q( 0 )) M( Q( 0 )) Formally, M satisfies ε-dp if log Pr(M Q ( D ) Pr(M Q ( D ) ( ) = r) ( ) = r) ε e.g., Laplace mechanism: add Lap(/ε) noise to Q(D) D Sequential Composition. e.g., run M twice 2ε-DP 0 ε, privacy. e.g. 2ε-DP means more privacy loss than ε-dp.

An open problem of DP on Correlation Data When data are independent: D M( Q( 0 )) M( Q( 0 )) D 0 ε-dp When data are correlated (e.g. u and u3 always same): D M( Q( 0 )) / M( Q( 0 )) D 0 0?-DP It is still controversial [*][**] about the guarantee of DP [*] Differential Privacy as a Causal Property, https://arxiv.org/abs/70.05899 [**] https://github.com/frankmcsherry/blog/blob/master/posts/206-08-29.md

Quantifying DP on Correlated Data A few recent papers [Cao7][Yang5][Song7] use a Quantification approach to achieve ε-dp (protecting each user private data value) Traditional approach (if attacker knows correlations, ε-dp may not hold): sensitive data Laplace Mechanism Lap(/ε) ε-dp data Quantification approach (protect against attackers with knowledge of correlation): sensitive data model data correlations attacker inference Laplace Mechanism Lap(/ε ) ε-dp data [Cao7]: Markov Chain [Yang5]: Gaussian Markov Random Field (GMRF) [Song7]: Bayesian Network

Spatiotemporal Data Mining with DP Sensitive data Private data D D 2 D 3 ε-dp ε-dp ε-dp r r 2 r 3 t= 2 3 u loc3 loc loc u2 loc2 loc4 loc5 u3 loc2 loc4 loc5 u4 loc4 loc5 loc3 Count Query t= 2 3.. loc 0 2 2.. loc2 2 0 0.. loc3 0.. loc4 2 0.. loc5 0 2.. Laplace Noise Lap(/ε) t= 2 3.. loc 0 3.. loc2 3 0.. loc3 0.. loc4 2 0.. loc5 3 3.. (a) Location Data (b) True Counts (c) Private Counts

What types of data correlations? loc3 loc5 colleague u2 couple loc4 u u3 (a) Road Network (b) Social Ties temporal correlation for single user spatial correlation for user-user D D 2 D 3 7:00 8:00 9:00 u loc3 loc loc u2 loc2 loc loc u3 loc2 loc4 loc5 u4 loc4 loc5 loc3 (a) Location Data

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations - what is MC - how can (attacker) learn MC from data - how can (attacker) infer private data using MC Gaussian Random Markov Field for user-user correlations Summary and open problems

What is Markov Chain A Markov chain is a stochastic process with the Markov property. -order Markov property: the state at time t only depends on the state at time t- Pr(x_t x_t-)=pr(x_t x_t-,,x_) Time-homogeneous: the transition matrix is the same after each step t>0, Pr(x_t+ x_t)=pr(x_t+2 x_t+) 7:00 8:00 9:00 u loc loc3 loc2 u2 loc2 loc2 loc2 u3 loc3 loc loc u4 loc loc2 loc2 Raw Trajectories t loc loc2 loc3 t+ loc loc2 loc3 0.2 0. 0.7 0. 0.2 0.7 0.3 0.4 0.3 Transition Matrix

How can (attacker) learn MC If attacker knows partial user trajectory, he can directly learn transition matrix by Maximum Likelihood estimation If attacker knows road network, he may learn MC using google-like model [*] [*] E. Crisostomi, S. Kirkland, and R. Shorten, A Google-like model of road network dynamics and its application to regulation and control, International Journal of Control, vol. 84, no. 3, pp. 633 65, Mar. 20.

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Model temporal correlations using Markov Chain e.g., user i : loc loc3 loc2 (a) Transition Matrix Pr(l i t l i t ) (b) Transition Matrix Pr(l i t l i t ) time t- time t loc loc2 loc3 loc loc2 loc3 time t loc 0. 0.2 0.7 loc2 0 0 time t- loc 0.2 0.3 0.5 loc2 0. 0. 0.8 loc3 0.3 0.3 0.4 loc3 0.6 0.2 0.2 Backward Temporal Correlation P i B Forward Temporal Correlation P i F

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL DP can protect against the attacker with knowledge of all tuples except the one of victim + Temporal Correlation? D t= u loc3 u2 loc2 u3 loc2 u4 loc4? li } D K A i (D K ) A i T (D K,P i B,P i F ) (i) A T i (D K,P B i, ) (ii) A T i (D K,,P F i ) (iii) A T i (D K,P B i,p F i )

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Recall the definition of DP: if PL 0 (M) ε, then M satisfies ε-dp. Definition of TPL:

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Definition of TPL: If no temporal correlation TPL = PL0 Eqn(2)= log Pr(r l t i, D t k ) +...+ log Pr(rt l t i, D t k ) +...+ log Pr(rT l t i, D t k ) Pr(r l i tʹ, D t k ) Pr(r t l i tʹ, D t k ) Pr(r T l i tʹ, D t k ) { 0 PL0{ { 0

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL Definition of TPL: If with temporal correlation TPL =? Hard to quantify Eqn(2) Eqn(2)= log Pr(r l t i, D t k ) +...+ log Pr(rt l t i, D t k ) +...+ log Pr(rT l t i, D t k ) Pr(r l i tʹ, D t k ) Pr(r t l i tʹ, D t k ) Pr(r T l i tʹ, D t k ) {? PL0{ {?

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL (i) A i T (D K,P i B, ) (BPL) (ii) A i T (D K,,P i F ) (FPL) (iii) A i T (D K,P i B,P i F ) (i) (ii) r. r t- r t r t+. r T

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL BPL Analyze BPL Backward temporal correlations Eqn(6)= Backward privacy loss function. how to calculate it?

How can (attacker) infer private data using MC Model Attacker Define TPL Find structure of TPL FPL Analyze FPL Forward temporal correlations Forward privacy loss function. how to calculate it?

Calculating BPL & FPL Privacy Quantification Upper bound We convert the problem of BPL/FPL calculation to finding an optimal solution of a linear-fractional programming problem. This problem can be solved by simplex algorithm in O(2 n ). We designed a O(n 2 ) algorithm for quantifying BPL/FPL.

Calculating BPL & FPL Privacy Quantification Upper bound Example of BPL under different temporal corr. (i) Strong temporal corr. (ii) Moderate temporal corr. (iii) No temporal corr..0 Privacy Loss 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.0 0.8 0.25 0.30 0.35 0.39 0.42 0.45 0.48 0.50 t= 2 3 4 5 6 7 8 9 0 Time

Refer to Theorem 5 in our paper Calculating BPL & FPL Privacy Quantification Upper bound Privacy Loss BPL 0.8 0.6 0.4 0.2 q = 0.8; d = 0.; ε = 0.23 q=0.8, d=0., ε=0.23 (a) case B 0.8 0.2 Pi=( ) 0. 0.9 BPL.2.0 0.8 0.6 0.4 0.2 q = 0.8; d = 0; ε = 0.5 q=0.8, d=0, ε=0.5 (b) case 2 B 0.8 0.2 Pi=( ) 0 BPL 3.5 3.0 2.5 2.0.5.0 0.5 20 40 60 80 00 t q = 0.8; d = 0; ε = 0.23 q=0.8, d=0, ε=0.23 (c) time case 3 B 0.8 0.2 Pi=( 0 ) 20 40 60 80 00 t BPL 20 5 0 5 20 40 60 80 00 t q = ; d = 0; ε = 0.23 q=, d=0, ε=0.23 (d) case 4 B 0 Pi=( ) 20 40 60 80 00 t 0

Outline Data Mining with Differential Privacy Scenario: Spatiotemporal Data Mining using DP Markov Chain for temporal correlations Gaussian Random Markov Field for user-user correlations Summary and open problems - what is GMRF - how can (attacker) learn GMRF from data - how can (attacker) infer private data using GMRF

What is GMRF Gaussian Markov Random Field is a probabilistic graphical model, in which each node is Gaussian variable and the (undirected) edges indicate the dependencies between variables. Data Correlation = (Joint) Distribution over Data We choose Gaussian Markov random field - Rich representation - Easy to construct - Unified form-gaussian - East to compute

how can (attacker) learn GMRF If attacker knows user-location frequency: If attacker knows user-user social network, we can construct GMRF from weighted undirected graph A B C D 2 3 5 5 5 0 0 5 9 3 0 3 2 0 3 2 5 - ø ö ç ç ç ç ç è æ - - - - - - - - S = ø ö ç è æ S - µ - - x x x 2 exp ) ( T i x i p Laplacian matrix

How can (attacker) infer private data using MC Model Attacker Define SPL x x4 3 2 G known unknown victim x2 x3 x x5 x6 x4 x5 x7 x8 x9 3 2 G x2 3 3 4 x3 x6 x7 x8 x9 (a) R(G,/3) (b) R (G,0.5) each user can define their privacy level as R(G,δ). G: subgraph of GMRF δ: percentage of the data is known by adversaries

How can (attacker) infer private data using MC Model Attacker Define SPL Bayesian Inference what is the probability of return r, if the victim is at loci? Pr(r loc i ) = Pr(r D)Pr(D u loc i, D k ) D u Du ~ unknown, try to infer Dk ~ known to adversary On GMRF SPL = Pr(r loc i ) Pr(r loc i ) = Du Du Pr(r D)Pr(D u loc i, D k ) Pr(r D )Pr(D u loc' i, D k ) what is the difference of the probability between the victim is at loci and the victim is at loc i Marginalizing Du Du Pr(r D)Pr(D u loc i, D k ) Marginalization Laplace Mechanism Data Correlation (GMRF)

impact of Pi s: parameter of Laplacian smoothing. The smaller s, the high prior knowledge about Pi 00 s=0.0. s=0.05. s=0.005. s=.0. 00 s=0.0. s=0.05. s=0.005. s=.0. 0 0 0 t=0 2 4 6 8 0 2 4 (b) TPL for ε= 0 t=5 20 35 50 65 80 95 0 2540 (b) TPL for ε=0.

impact of Gi,m p: density of graph SPL vs. p: PL on a sparser graphs tend to be higher. SPL vs. Gi : PL on larger Gi tends to be higher. higher ε= 40 ε= 30 Average SPL 30 20 0 G0.8_50 Brightkite_50 G0.2_50 22.5 5 7.5 G0.8 G0.5 G0.2 0 0 m=40 30 20 0 Gi =20 30 40 50 (a) SPL vs. p (b) SPL vs. Gi (m=0)

Summary When mining private data using DP, we need to model data correlations. In spatiotemporal data, naturally we model temporal correlations using Markov Chain; model user-user correlations using GMRF as attacker s knowledge. Based on the inference/computation on MC and GMRF, we can calibrate privacy parameter for DP. sensitive data model data correlations attacker inference Laplace Mechanism Lap(/ε ) ε-dp data

Open Problems How to model spatiotemporal correlations as a whole? Or the appropriate combination between MC and GMRF? Data correlations may degrade the utility of the miningd private data (need to add more noisy for the same level of privacy): how can data miner takes advantage of data correlations to improve data utility?