Differentially Private Real-time Data Release over Infinite Trajectory Streams

Differentially Private Real-time Data Release over Infinite Trajectory Streams Kyoto University, Japan Department of Social Informatics Yang Cao, Masatoshi Yoshikawa 1

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment results Conclusion & Future work 2

Motivation: Opportunity A great opportunity to utilize personal real-life data Easy to collect Life log data, along with people s trajectory. <uid,time,loc, > Trajectory streams are consisting of many people s trajectories Statistics of trajectory streams are useful e.g., Count: How many people at Pittsburgh station now? A trajectory: time sequenced locations e.g., Count: How many people at Pittsburgh station with Heart Rate>100 now? Health-aware Navigation System Marketing Analysis Intelligent Transportation System Leverage statistics of trajectory streams to data-based innovations! 4

Motivation: Privacy Risk Publish Statistics (of personal data) is risky http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 5

Motivation: Privacy Risk Publish Statistics (of personal data) is risky E.g., Release the data of Q1: COUNT(Sex=Female)=A. Q2: COUNT(Sex=Female OR What if B=A+1? Name Age Sex Dis. u1 40 M HIV u2 30 M - (Age=40 & Sex=Male & Employer= u1 )) = B http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 6

Motivation: Privacy Risk Publish Statistics (of personal data) is risky E.g., Release the data of Q1: COUNT(Sex=Female)=A. Q2: COUNT(Sex=Female OR If B=A+1 (Age=40 & Sex=Male & Employer= u1 )) = B Q3: COUNT(Sex=Female OR (Age=42 & Sex=Male & Employer= u1 ) & Diagnosis= HIV ) = C C = 1 or 0 Name Age Sex Dis. u1 40 M HIV u2 30 M - http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 7

Motivation: Privacy Risk Publish Statistics (of personal data) is risky E.g., Release the data of Q1: COUNT(Sex=Female)=A. Q2: COUNT(Sex=Female OR If B=A+1 (Age=40 & Sex=Male & Employer= u1 )) = B Q3: COUNT(Sex=Female OR (Age=42 & Sex=Male & Employer= u1 ) & Diagnosis= HIV ) = C C = 1 or 0 Positively or negatively compromised! Name Age Sex Dis. u1 40 M HIV u2 30 M - http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 8

Motivation: Privacy Risk Publish Statistics (of personal data) is risky Linkage attack on anonymized data [1][2] Released database An attacker s database [1] L. Sweeney, Simple demographics often identify people uniquely, Health (San Francisco), 2000. [2] C. Dwork, A Firm Foundation for Private Data Analysis, Commun. ACM, Jan. 2011. 9

Motivation: Privacy Risk Publish Statistics (of personal data) is risky Linkage attack [1][2] Personal Trajectory data is highly sensitive! Four spatiotemporal data points can identify 95% of the individuals [3] Goal: to publish statistics of trajectory streams by a Privacy Preserving Data Publishing (PPDP) method, for open data utilizing untrusted cloud services data mining outsourcing [1] L. Sweeney, Simple demographics often identify people uniquely, Health (San Francisco), 2000. [2] C. Dwork, A Firm Foundation for Private Data Analysis, Commun. ACM, Jan. 2011. [3]Y.-A. de Montjoye et al, Unique in the Crowd: The privacy bounds of human mobility, Sci. Rep.,Mar. 2013. 10

Our contributions A rigorous and flexible PPDP framework over infinite trajectory streams The first definition of personalized privacy model for spatiotemporal data The protection is based on ε-differential Privacy. rigorous Designed algorithms to publish counts in real-time. E.g., Counts: How many people at Pittsburgh station now? Published data utility is better than the previous results. flexible real-time trajectory data users privacy preferences Privacy Model & PPDP algorithm sensitive data raw statistics in real-time noisy data publishable! Private Data 11

Our contributions A rigorous and flexible PPDP framework over infinite trajectory streams The first definition of personalized privacy model for spatiotemporal data The protection is based on ε-differential Privacy. rigorousness Designed algorithms to publish counts in real-time. E.g., Counts: How many people at Pittsburgh station now Personlized privacy & Published data utility is better. flexibility real-time trajectory data users privacy preferences Privacy Model & PPDP algorithm sensitive data raw statistics in real-time noisy data publishable! Private Data 12

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Conclusion & Future work 13

Problem Definition PPDP over infinite trajectory streams Data collection process as follows: Trusted Server uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data safely publish risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics (a) raw data 14

Problem Definition PPDP over infinite trajectory streams How to transform (c) to safe version (c), while keeping it similar to (c) as much as possible? Trusted Server safely publish uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics 15

Problem Definition PPDP over infinite trajectory streams How to transform (c) to safe version (c), while keeping it similar to (c) as much as possible? Ad-hoc methods CANNOT provide a reliable privacy guarantee e.g., method of deleting the values of 1 It is hard to model the attacker s background knowledge in this big data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk Ad-hoc PPDP algorithm locs t1 t2 t3 t4 t5 park 2 0 0 0 0 office 0 0 0 0 0 bar 0 0 0 0 2 gym 0 0 0 0 0 (c) safe(?) statistics 16

Differential Privacy: A rigorous privacy definition ε-differential Privacy (ε-dp) [4] As a de facto privacy standard for statistical data publishing. Randomized algorithm A achieve ε-dp, if it satisfies Pr[A(Q(D))] Pr[A(Q(D*))] eε, ε > 0 D*:database except any one individual s data then A achieves ε-dp ; ε is a given positive parameter. D Q( ) D* Q( ) ε: Privacy budget unitary privacy level control. Robust under attacker with arbitrary background knowl. Laplace Mechanism & Exponential Mechanism can be used as sub-procedure of A [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. 17

Differential Privacy: A rigorous privacy definition ε-differential Privacy (ε-dp) [4] As a de facto privacy standard for statistical data publishing. Randomized algorithm A achieve ε-dp, if it satisfies Pr[A(Q(D))] Pr[A(Q(D*))] eε, ε > 0 D*:database except any one individual s data then A achieves ε-dp ; ε is a given positive parameter. D A( Q( )) D* A( Q( )) ε: Privacy budget unitary privacy level control. Robust under attacker with arbitrary background knowl. Laplace Mechanism & Exponential Mechanism can be used as sub-procedure of A [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. Less noise More High Data utility Low Low Privacy Level High ε 0 18

Differential Privacy: A rigorous privacy definition ε-differential Privacy (ε-dp) [4] As a de facto privacy standard for statistical data publishing. Randomized algorithm A achieve ε-dp, if it satisfies Pr[A(Q(D))] Pr[A(Q(D*))] eε, ε > 0 D*:database except any one individual s data then A achieves ε-dp ; ε is a given positive parameter. D A( Q( )) D* A( Q( )) ε: Privacy budget unitary privacy level control. Robust under Linkage attack. Laplace Mechanism [4] & Exponential Mechanism [5] Less noise High Data utility can be used as sub-procedure of A [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. More Low Low Privacy Level High ε 0 19

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! 20

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! 21

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. However, it is not safe that protecting only 1 data point uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data ε ε ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics 22

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. However, it is unrealistic for infinite trajectory streams uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics 23

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Using Laplace Mechanism [4] as sub-procedure: uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. 24

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Using Laplace Mechanism [4] as sub-procedure: risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics adding special Laplace noise to counts PPDP algo. A locs t1 t2 t3 t4 t5 park 2.8 4.1-0.1 2 0.9 office 0.1 2.1 2.1 2.2 0.1 bar gym 0.2 1.9 1.2 0.7 6.1 0.9-1.1 3.5-0.9 1.2 (c) ε-dp statistics [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. 25

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Our observation: In real-life, individuals may have different requirements on privacy. ε ε ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (1) protecting any one patiotemporal data point ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (2) protecting all length of trajectories 26

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Our observation: In real-life, individuals may have different requirements on privacy. 2-trajectory 3-trajectory ll1 ll2 ll3 privacy preference of user i overall privacy : ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (our method) to protect any ll-trajectory ll successive spatio-temporal data points 27

Related work Differential Privacy on finite streams (1) to protect data of each one timestamp. event-level privacy [8] (2) to protect each user s data of all timestamps. user-level privacy [8] FAST [7]: Laplace + Kalman filter (predict/correct the noisy data) Differential Privacy on infinite streams w-event privacy [6] [6]G. Kellaris et al, Differentially Private Event Sequences over Infinite Streams, VLDB 14. [7]L. Fan et al, FAST: Differentially Private Real-time Aggregate Monitor with Filtering and Adaptive Sampling, SIGMOD 13. [8] C. Dwork, Differential Privacy in New Settings., SODA, pp.174 183, 2010. 28

Related work Differential Privacy on finite streams (1) to protect data of each one timestamp. event-level privacy [8] (2) to protect each user s data of all timestamps. user-level privacy [8] FAST [7]: Laplace + Kalman filter (predict/correct the noisy data) Differential Privacy on infinite streams w-event privacy [6] ε ε ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (1) event-level privacy ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (2) user-level privacy [7]L. Fan et al, FAST: Differentially Private Real-time Aggregate Monitor with Filtering and Adaptive Sampling, SIGMOD 13. [8] C. Dwork, Differential Privacy in New Settings., SODA, pp.174 183, 2010. 29

Related work Differential Privacy on finite streams (1) to protect data of each one timestamp. event-level privacy [8] (2) to protect each user s data of all timestamps. user-level privacy [8] FAST [7]: Laplace + Kalman filter (predict/correct the noisy data) Differential Privacy on infinite streams w-event privacy [6] t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar ll-trajectory privacy(ll=3) (this study) t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar w-event privacy (w=3) [6] [6]G. Kellaris et al, Differentially Private Event Sequences over Infinite Streams, VLDB 14. 30

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Conclusion & Future work 31

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism re-publish strategy loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 32

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism re-publish strategy loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 33

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism GA re-publish strategy loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 34

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism GA re-publish strategy MMD loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 35

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism GA re-publish strategy MMD loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 36

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) ε i is privacy budget variables at each ti the sum of ε i at timestamp i of any ll-trajectory should be less than ε uid tim loc. u1 t1 e park t1 t2 t3 t4 t5 u3 t1 park u2 t2 bar u1 park bar ll1 u3 t2 office u2 bar park ll2 u3 t3 gym u1 t5 bar u3 park office gym bar ll3 u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 (b) trajectory representation of raw data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 37

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) ε i is privacy budget variables at each ti the sum of ε i at timestamp i of any ll-trajectory should be less than ε ε1+ε5 ε ε1+ε2+ε3 ε uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data ε2+ε5 ε ε2+ε3+ε5 ε locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 38

uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) If we know all data in advance: ε i is privacy budget variables at each ti using Linear Programming the sum of ε i at timestamp i of any Maximize: ll-trajectory should be less than ε ε1+ε5 ε ε1+ε2+ε3 ε privacy budget variables: ε2+ε5 ε ε1 ε2 ε3 ε4 ε5 ε2+ε3+ε5 ε t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data approximate Maximum Utility locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 39

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) If we know all data in advance: privacy budget variables ε However, we need: i at each ti using Real-time Linear publishing Programming & the sum of ε i at timestamp i of any for Maximize: ll-trajectory Infinite streams should be less than ε ε1+ε5 ε ε1+ε2+ε3 ε privacy budget variables: ε2+ε5 ε ε1 ε2 ε3 ε4 ε5 ε2+ε3+ε5 ε uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 40

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) privacy budget variables ε i at each ti the sum of ε i at timestamp i of any ll-trajectory should be less than ε ε1+ε5 ε? ε1+ε2+ε3 ε privacy budget variables: ε2+ε5 ε? ε1 ε2 ε3 ε4 ε5 ε2+ε3+ε5 ε? uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 41

PPDP algorithm: GA GA: Greedy Algorithm to get approximately optimal ε i by incomplete information uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data ε1+ε5 ε? ε1+ε2+ε3 ε ε2+ε5 ε? ε2+ε3+ε5 ε? locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 42

PPDP algorithm: GA GA: Greedy Algorithm to get approximately optimal ε i by incomplete information idea: exponential decay (1) set? =0, then compute ε3=w (2) use w/2 as the value of ε3 (reserve w/2 for? ) uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data ε1+ε5 ε? ε1+ε2+ε3 ε ε2+ε5 ε? ε2+ε3+ε5 ε? locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 43

PPDP algorithm: MMD Republishing strategy to improve data utility of counts idea: in real-life data, the data has periodically repeated pattern re-publish Adjacent noisy (Adj) data (has been studied in [6]) re-publish noisy data who is holding Minimum Manhattan Distance (MMD) to real data of current timestamp (using Exponential Mechanism[5]) published noisy data n1 ~ nt-1 the current real data rt counts 5 2.5 0 time=1 3 5 7 9 11 13 15 17 19 nmmd or nadj [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. [6]G. Kellaris et al, Differentially Private Event Sequences over Infinite Streams, VLDB 14. 44

Personalized privacy control Proposed framework dynamic budget allocation (GA) Private Approximation strategy PPDP framework collected database at timestamp t ε 1 ε t-1 us 1 us t n 1 n t-1 r t GA Dynamic Budget Allocation set of uids us t real statistics r t ε t,1 ε t,2 D t Private Approximation Strategy Private Publishing stream Adj or MMD Private publishing ε t private statistics n t capable of publishing diverse statistical data over infinite trajectory streams Adj / MMD is optimized for counts publishing 45

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Future work 46

Experiments Four real-life trajectory datasets Desc. PeopleFlow 1 Geolife 2 T-Drive 3 WorldCup98 4 people moving; people moving; diverse type of mobility; sparse;beijing taxis moving; Beijing webpage click streams Locs. Amt. 18 56 21 1,000 users Amt. 11,406 170 2,698 550,762 TimeStamps Amt. 1,694 1,440 886 722 Interval of TS 5 mins 1 min 10 mins 1 hour Length of TS 6 days 24 hours* ~7 days ~35 days Datapoints Amt. 102,468 240,990 37,255 1,258,542 1.http://pflow.csis.u-tokyo.ac.jp/ 2.http://research.microsoft.com/en-us/projects/geolife/ 3.http://research.microsoft.com/en-us/projects/tdrive/ 4.http://ita.ee.lbl.gov/html/contrib/WorldCup.html *.Aggregating 50,176 hours timestamps to 24 hours by omitting the date. 47

Experiments Budget Allocation Approximation Strategy realtime Utility Evaluation Uniform uniformly X O bad LP Approximately global optimised X X normal GA+Adj dynamically republish Adj O better GA+MMD dynamically republish MMD O Best FAST fix [7] uniformly _ O better,not stable [7]L. Fan et al, FAST: Differentially Private Real-time Aggregate Monitor with Filtering and Adaptive Sampling, SIGMOD 13. 48

Experiments Real data v.s. Noisy data(ll=20,ε=1) 300 225 150 real data GA+MMD FAST Uniform PeopleFlow counts 75 0-75 -150-225 150 155 160 165 170 175 180 185 190 195 real data GA+MMD 200 205 210 215 220 225 230 235 240 245 250 255 FAST [7] timestamp 260 265 270 275 280 285 290 295 300 49

Experiments Metrics Mean of Absolute Error (MAE) MAE(R, N) = 1 T * locs T locs i=1 j=1 r i [ j] n i [ j] All of them are the lower, the better Mean of Square Error (MSE) sensitivity to large error MSE(R, N) = 1 T * locs T locs i=1 j=1 r i [ j] n i [ j] 2 KL-divergence similarity of two distribution (the lower the more similar ) D KL (R N) = 1 T T locs! ln r i[ j] * $ # & r " n i [ j] * i [ j] * % i=1 j=1 50

Experiments MAE/MSE by varying ll (ε=1) PeopleFlow Geolife T-Drive WorldCup98 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD MAE 1E+01 1E+01 1E+01 1E+01 1E+00 1E+00 10 20 30 40 50 60 70 80 90 100 1E+00 10 20 30 40 50 60 70 80 90 100 1E+00 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 ll ll ll ll PeopleFlow Geolife T-Drive WorldCup98 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD MSE 1E+04 1E+03 1E+04 1E+03 1E+04 1E+03 1E+04 1E+03 1E+02 1E+02 1E+02 1E+02 1E+01 1E+01 10 20 30 40 50 60 70 80 90 100 ll 1E+01 10 20 30 40 50 60 70 80 90 100 ll 1E+01 10 20 30 40 50 60 70 80 90 100 ll 10 20 30 40 50 60 70 80 90 100 ll 51

Experiments MAE/MSE by varying ε (ll=20) PeopleFlow Geolife T-Drive WorldCup98 1E+06 1E+05 1E+04 UNIFORM FAST GA+Adj GA+MMD 1E+06 1E+04 UNIFORM FAST GA+Adj GA+MMD 1E+06 1E+04 UNIFORM FAST GA+Adj GA+MMD 1E+06 1E+04 UNIFORM FAST GA+Adj GA+MMD MAE 1E+03 1E+02 1E+02 1E+02 1E+02 1E+01 1E+00 1E+00 1E+00 1E+00 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. ε ε ε ε PeopleFlow Geolife T-Drive WorldCup98 1E+12 1E+09 UNIFORM FAST GA+Adj GA+MMD 1E+12 1E+08 UNIFORM FAST GA+Adj GA+MMD 1E+12 1E+09 UNIFORM FAST GA+Adj GA+MMD 1E+12 1E+09 UNIFORM FAST GA+Adj GA+MMD MSE 1E+06 1E+06 1E+06 1E+03 1E+04 1E+03 1E+03 1E+00 1E+00 1E+00 1E+00 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. ε ε ε ε 52

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Conclusion & Future work 53

Conclusion Proposed a rigorous and flexible privacy model for spatio-temporal data: l-trajectory privacy; Personal Data application: PPDM Privacy Models business model: privacy as services/money PPDP algorithms PPDM algorithms PPDP Framework for spatio-temporal data; safe OpenData safe MiningResult Algorithms GA+MMD for publishing private counts with high utility in real-time. 54

Future Work Improve algorithm GA+MMD for private counts the counts should be non-negative & integer More flexible privacy models ll-trajectory privacy Location-based privacy model PPDP/ PPDM for mining personal spatio-temporal data 55

Thank you! Any question? 56