Differentially Private Real-time Data Release over Infinite Trajectory Streams

Similar documents
Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

The Optimal Mechanism in Differential Privacy

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing

The Optimal Mechanism in Differential Privacy

Enabling Accurate Analysis of Private Network Data

CMPUT651: Differential Privacy

Human resource data location privacy protection method based on prefix characteristics

Lecture 11- Differential Privacy

Differentially Private Event Sequences over Infinite Streams

Pufferfish Privacy Mechanisms for Correlated Data. Shuang Song, Yizhen Wang, Kamalika Chaudhuri University of California, San Diego

Database Privacy: k-anonymity and de-anonymization attacks

Rényi Differential Privacy

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

Report on Differential Privacy

Personalized Social Recommendations Accurate or Private

Differential Privacy and its Application in Aggregation

On Node-differentially Private Algorithms for Graph Statistics

Calibrating Noise to Sensitivity in Private Data Analysis

Differential Privacy: a short tutorial. Presenter: WANG Yuxiang

Privacy in Statistical Databases

What Can We Learn Privately?

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Information-theoretic foundations of differential privacy

Lecture 20: Introduction to Differential Privacy

Differentially Private Linear Regression

PoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2

Differentially Private Publication of Location Entropy

Semantic Security: Privacy Definitions Revisited

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

Differentially Private ANOVA Testing

Analysis Based on SVM for Untrusted Mobile Crowd Sensing

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

New Statistical Applications for Differential Privacy

Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies

Encapsulating Urban Traffic Rhythms into Road Networks

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri

How Well Does Privacy Compose?

Jun Zhang Department of Computer Science University of Kentucky

Diagnosing New York City s Noises with Ubiquitous Data

TASK ASSIGNMENT WITH RIGOROUS PRIVACY PROTECTION IN SPATIAL CROWDSOURCING. Hien To. Submitted in Partial Fulfillment of the Requirements

1 Differential Privacy and Statistical Query Learning

Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services

Local Differential Privacy

A Computational Movement Analysis Framework For Exploring Anonymity In Human Mobility Trajectories

Tuan V. Dinh, Lachlan Andrew and Philip Branch

Mobility Analytics through Social and Personal Data. Pierre Senellart

CSCI 5520: Foundations of Data Privacy Lecture 5 The Chinese University of Hong Kong, Spring February 2015

Spatial Data Science. Soumya K Ghosh

Exploring Human Mobility with Multi-Source Data at Extremely Large Metropolitan Scales. ACM MobiCom 2014, Maui, HI

Estimating Large Scale Population Movement ML Dublin Meetup

Lecture 1 Introduction to Differential Privacy: January 28

An Overview of Traffic Matrix Estimation Methods

Differential Privacy in an RKHS

Northrop Grumman Concept Paper

1 Hoeffding s Inequality

ArcGIS is Advancing. Both Contributing and Integrating many new Innovations. IoT. Smart Mapping. Smart Devices Advanced Analytics

Privacy-Preserving Data Mining

Extremal Mechanisms for Local Differential Privacy

Privacy-preserving Data Mining

CHARTING SPATIAL BUSINESS TRANSFORMATION

H E L S I N K I 3D+ City Models and Smart Projects. Project Manager/ Architect/MSc (Civ.Eng) Jarmo Suomisto

Spatial Crowdsourcing: Challenges and Applications

Towards information flow control. Chaire Informatique et sciences numériques Collège de France, cours du 30 mars 2011

Time Series Data Cleaning

Jun Zhang Department of Computer Science University of Kentucky

Detecting Origin-Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area

Data-driven characterization of multidirectional pedestrian trac

Visualisation of Spatial Data

Collaborative topic models: motivations cont

Differentially Private Sequential Data Publication via Variable-Length N-Grams

TRAITS to put you on the map

Differential Privacy and Verification. Marco Gaboardi University at Buffalo, SUNY

Demographic Data in ArcGIS. Harry J. Moore IV

Differentially Private Sequential Data Publication via Variable-Length N-Grams

Publishing Search Logs A Comparative Study of Privacy Guarantees

Crime Analysis. GIS Solutions for Intelligence-Led Policing

[Title removed for anonymity]

PAC-learning, VC Dimension and Margin-based Bounds

Wishart Mechanism for Differentially Private Principal Components Analysis

Notes on the Exponential Mechanism. (Differential privacy)

* Abstract. Keywords: Smart Card Data, Public Transportation, Land Use, Non-negative Matrix Factorization.

DPT: Differentially Private Trajectory Synthesis Using Hierarchical Reference Systems

Cognitive Engineering for Geographic Information Science

Biggest problem with data analytics. tons of data, not much insights.

Differentially Private Password Frequency Lists

Uncertain Time-Series Similarity: Return to the Basics

Answering Many Queries with Differential Privacy

Differential Privacy Models for Location- Based Services

Lesson 16: Technology Trends and Research

On Differentially Private Frequent Itemsets Mining

An Industry Perspective. Bryn Fosburgh Vice President Trimble

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Oman NSDI Supporting Economic Development. Saud Al-Nofli Director of Spatial Data Directorate General of NSDI, NCSI

Release Connection Fingerprints in Social Networks Using Personalized Differential Privacy

Lessons From the Trenches: using Mobile Phone Data for Official Statistics

Uncovering the Digital Divide and the Physical Divide in Senegal Using Mobile Phone Data

Progress in Data Anonymization: from k-anonymity to the minimality attack

Accessibility as an Instrument in Planning Practice. Derek Halden DHC 2 Dean Path, Edinburgh EH4 3BA

Transcription:

Differentially Private Real-time Data Release over Infinite Trajectory Streams Kyoto University, Japan Department of Social Informatics Yang Cao, Masatoshi Yoshikawa 1

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment results Conclusion & Future work 2

Motivation: Opportunity A great opportunity to utilize personal real-life data Easy to collect Life log data, along with people s trajectory. <uid,time,loc, > Trajectory streams are consisting of many people s trajectories Statistics of trajectory streams are useful e.g., Q count {how many people at Pittsburgh station now?} A trajectory: time sequenced locations 3

Motivation: Opportunity A great opportunity to utilize personal real-life data Easy to collect Life log data, along with people s trajectory. <uid,time,loc, > Trajectory streams are consisting of many people s trajectories Statistics of trajectory streams are useful e.g., Count: How many people at Pittsburgh station now? A trajectory: time sequenced locations e.g., Count: How many people at Pittsburgh station with Heart Rate>100 now? Health-aware Navigation System Marketing Analysis Intelligent Transportation System Leverage statistics of trajectory streams to data-based innovations! 4

Motivation: Privacy Risk Publish Statistics (of personal data) is risky http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 5

Motivation: Privacy Risk Publish Statistics (of personal data) is risky E.g., Release the data of Q1: COUNT(Sex=Female)=A. Q2: COUNT(Sex=Female OR What if B=A+1? Name Age Sex Dis. u1 40 M HIV u2 30 M - (Age=40 & Sex=Male & Employer= u1 )) = B http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 6

Motivation: Privacy Risk Publish Statistics (of personal data) is risky E.g., Release the data of Q1: COUNT(Sex=Female)=A. Q2: COUNT(Sex=Female OR If B=A+1 (Age=40 & Sex=Male & Employer= u1 )) = B Q3: COUNT(Sex=Female OR (Age=42 & Sex=Male & Employer= u1 ) & Diagnosis= HIV ) = C C = 1 or 0 Name Age Sex Dis. u1 40 M HIV u2 30 M - http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 7

Motivation: Privacy Risk Publish Statistics (of personal data) is risky E.g., Release the data of Q1: COUNT(Sex=Female)=A. Q2: COUNT(Sex=Female OR If B=A+1 (Age=40 & Sex=Male & Employer= u1 )) = B Q3: COUNT(Sex=Female OR (Age=42 & Sex=Male & Employer= u1 ) & Diagnosis= HIV ) = C C = 1 or 0 Positively or negatively compromised! Name Age Sex Dis. u1 40 M HIV u2 30 M - http://www.mathcs.emory.edu/~lxiong/cs573_s12/ 8

Motivation: Privacy Risk Publish Statistics (of personal data) is risky Linkage attack on anonymized data [1][2] Released database An attacker s database [1] L. Sweeney, Simple demographics often identify people uniquely, Health (San Francisco), 2000. [2] C. Dwork, A Firm Foundation for Private Data Analysis, Commun. ACM, Jan. 2011. 9

Motivation: Privacy Risk Publish Statistics (of personal data) is risky Linkage attack [1][2] Personal Trajectory data is highly sensitive! Four spatiotemporal data points can identify 95% of the individuals [3] Goal: to publish statistics of trajectory streams by a Privacy Preserving Data Publishing (PPDP) method, for open data utilizing untrusted cloud services data mining outsourcing [1] L. Sweeney, Simple demographics often identify people uniquely, Health (San Francisco), 2000. [2] C. Dwork, A Firm Foundation for Private Data Analysis, Commun. ACM, Jan. 2011. [3]Y.-A. de Montjoye et al, Unique in the Crowd: The privacy bounds of human mobility, Sci. Rep.,Mar. 2013. 10

Our contributions A rigorous and flexible PPDP framework over infinite trajectory streams The first definition of personalized privacy model for spatiotemporal data The protection is based on ε-differential Privacy. rigorous Designed algorithms to publish counts in real-time. E.g., Counts: How many people at Pittsburgh station now? Published data utility is better than the previous results. flexible real-time trajectory data users privacy preferences Privacy Model & PPDP algorithm sensitive data raw statistics in real-time noisy data publishable! Private Data 11

Our contributions A rigorous and flexible PPDP framework over infinite trajectory streams The first definition of personalized privacy model for spatiotemporal data The protection is based on ε-differential Privacy. rigorousness Designed algorithms to publish counts in real-time. E.g., Counts: How many people at Pittsburgh station now Personlized privacy & Published data utility is better. flexibility real-time trajectory data users privacy preferences Privacy Model & PPDP algorithm sensitive data raw statistics in real-time noisy data publishable! Private Data 12

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Conclusion & Future work 13

Problem Definition PPDP over infinite trajectory streams Data collection process as follows: Trusted Server uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data safely publish risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics (a) raw data 14

Problem Definition PPDP over infinite trajectory streams How to transform (c) to safe version (c), while keeping it similar to (c) as much as possible? Trusted Server safely publish uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics 15

Problem Definition PPDP over infinite trajectory streams How to transform (c) to safe version (c), while keeping it similar to (c) as much as possible? Ad-hoc methods CANNOT provide a reliable privacy guarantee e.g., method of deleting the values of 1 It is hard to model the attacker s background knowledge in this big data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk Ad-hoc PPDP algorithm locs t1 t2 t3 t4 t5 park 2 0 0 0 0 office 0 0 0 0 0 bar 0 0 0 0 2 gym 0 0 0 0 0 (c) safe(?) statistics 16

Differential Privacy: A rigorous privacy definition ε-differential Privacy (ε-dp) [4] As a de facto privacy standard for statistical data publishing. Randomized algorithm A achieve ε-dp, if it satisfies Pr[A(Q(D))] Pr[A(Q(D*))] eε, ε > 0 D*:database except any one individual s data then A achieves ε-dp ; ε is a given positive parameter. D Q( ) D* Q( ) ε: Privacy budget unitary privacy level control. Robust under attacker with arbitrary background knowl. Laplace Mechanism & Exponential Mechanism can be used as sub-procedure of A [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. 17

Differential Privacy: A rigorous privacy definition ε-differential Privacy (ε-dp) [4] As a de facto privacy standard for statistical data publishing. Randomized algorithm A achieve ε-dp, if it satisfies Pr[A(Q(D))] Pr[A(Q(D*))] eε, ε > 0 D*:database except any one individual s data then A achieves ε-dp ; ε is a given positive parameter. D A( Q( )) D* A( Q( )) ε: Privacy budget unitary privacy level control. Robust under attacker with arbitrary background knowl. Laplace Mechanism & Exponential Mechanism can be used as sub-procedure of A [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. Less noise More High Data utility Low Low Privacy Level High ε 0 18

Differential Privacy: A rigorous privacy definition ε-differential Privacy (ε-dp) [4] As a de facto privacy standard for statistical data publishing. Randomized algorithm A achieve ε-dp, if it satisfies Pr[A(Q(D))] Pr[A(Q(D*))] eε, ε > 0 D*:database except any one individual s data then A achieves ε-dp ; ε is a given positive parameter. D A( Q( )) D* A( Q( )) ε: Privacy budget unitary privacy level control. Robust under Linkage attack. Laplace Mechanism [4] & Exponential Mechanism [5] Less noise High Data utility can be used as sub-procedure of A [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. More Low Low Privacy Level High ε 0 19

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! 20

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! 21

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. However, it is not safe that protecting only 1 data point uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data ε ε ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics 22

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. However, it is unrealistic for infinite trajectory streams uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics 23

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Using Laplace Mechanism [4] as sub-procedure: uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (b) trajectory representation of raw data risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. 24

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Using Laplace Mechanism [4] as sub-procedure: risk locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics adding special Laplace noise to counts PPDP algo. A locs t1 t2 t3 t4 t5 park 2.8 4.1-0.1 2 0.9 office 0.1 2.1 2.1 2.2 0.1 bar gym 0.2 1.9 1.2 0.7 6.1 0.9-1.1 3.5-0.9 1.2 (c) ε-dp statistics [4] C. Dwork, et al., Calibrating Noise to Sensitivity in Private Data Analysis, TCC 2006. 25

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Our observation: In real-life, individuals may have different requirements on privacy. ε ε ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (1) protecting any one patiotemporal data point ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (2) protecting all length of trajectories 26

Problem Analysis PPDP over infinite trajectory streams How to apply ε-dp to PPDP of infinite trajectory streams? depends on what we want to protect! Two naive methods: (1) to protect data of each one timestamp. (2) to protect each user s data of all timestamps. Our observation: In real-life, individuals may have different requirements on privacy. 2-trajectory 3-trajectory ll1 ll2 ll3 privacy preference of user i overall privacy : ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (our method) to protect any ll-trajectory ll successive spatio-temporal data points 27

Related work Differential Privacy on finite streams (1) to protect data of each one timestamp. event-level privacy [8] (2) to protect each user s data of all timestamps. user-level privacy [8] FAST [7]: Laplace + Kalman filter (predict/correct the noisy data) Differential Privacy on infinite streams w-event privacy [6] [6]G. Kellaris et al, Differentially Private Event Sequences over Infinite Streams, VLDB 14. [7]L. Fan et al, FAST: Differentially Private Real-time Aggregate Monitor with Filtering and Adaptive Sampling, SIGMOD 13. [8] C. Dwork, Differential Privacy in New Settings., SODA, pp.174 183, 2010. 28

Related work Differential Privacy on finite streams (1) to protect data of each one timestamp. event-level privacy [8] (2) to protect each user s data of all timestamps. user-level privacy [8] FAST [7]: Laplace + Kalman filter (predict/correct the noisy data) Differential Privacy on infinite streams w-event privacy [6] ε ε ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (1) event-level privacy ε ε ε t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar (2) user-level privacy [7]L. Fan et al, FAST: Differentially Private Real-time Aggregate Monitor with Filtering and Adaptive Sampling, SIGMOD 13. [8] C. Dwork, Differential Privacy in New Settings., SODA, pp.174 183, 2010. 29

Related work Differential Privacy on finite streams (1) to protect data of each one timestamp. event-level privacy [8] (2) to protect each user s data of all timestamps. user-level privacy [8] FAST [7]: Laplace + Kalman filter (predict/correct the noisy data) Differential Privacy on infinite streams w-event privacy [6] t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar ll-trajectory privacy(ll=3) (this study) t1 t2 t3 t4 t5 u1 park bar u2 bar park u3 park office gym bar w-event privacy (w=3) [6] [6]G. Kellaris et al, Differentially Private Event Sequences over Infinite Streams, VLDB 14. 30

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Conclusion & Future work 31

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism re-publish strategy loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 32

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism re-publish strategy loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 33

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism GA re-publish strategy loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 34

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism GA re-publish strategy MMD loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 35

Overview of our solution A rigorous and flexible privacy model: ll-trajectory privacy Challenge: proof of how to achieve it by DP mechanism ll1 ll2 ll3 PPDP algorithm to publish counts in real-time. u 1 u 2 u 3 Challenge: real-time & infinite streams design a Greedy Algorithm (GA) to dynamically add noise at each timestamp; Challenge: to much noise re-publish the noisy data with Minimum Manhattan Distance (MMD) to the current data ll-trajectory PPDP algorithm privacy model t1 t2 t3 t4 t5 pa ba rk r ba pa r rk pa rk offi ce gy ba m r (b) Infinite trajectories risk loc t t t t t par 2 0 0 0 1 offi 0 1 0 0 0 bar 0 1 0 0 2 gy 0 0 1 0 0 (c) raw statistics Laplace Mechanism GA re-publish strategy MMD loc t t t t t par 2.84.1-0.12 0.9 offi 0.12.12.12.20.1 bar 0.21.91.20.76.1 gy 0.9-1.13.5-0.91.2 (c) l-trajectory private data 36

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) ε i is privacy budget variables at each ti the sum of ε i at timestamp i of any ll-trajectory should be less than ε uid tim loc. u1 t1 e park t1 t2 t3 t4 t5 u3 t1 park u2 t2 bar u1 park bar ll1 u3 t2 office u2 bar park ll2 u3 t3 gym u1 t5 bar u3 park office gym bar ll3 u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 (b) trajectory representation of raw data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 37

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) ε i is privacy budget variables at each ti the sum of ε i at timestamp i of any ll-trajectory should be less than ε ε1+ε5 ε ε1+ε2+ε3 ε uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data ε2+ε5 ε ε2+ε3+ε5 ε locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 38

uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) If we know all data in advance: ε i is privacy budget variables at each ti using Linear Programming the sum of ε i at timestamp i of any Maximize: ll-trajectory should be less than ε ε1+ε5 ε ε1+ε2+ε3 ε privacy budget variables: ε2+ε5 ε ε1 ε2 ε3 ε4 ε5 ε2+ε3+ε5 ε t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data approximate Maximum Utility locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 39

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) If we know all data in advance: privacy budget variables ε However, we need: i at each ti using Real-time Linear publishing Programming & the sum of ε i at timestamp i of any for Maximize: ll-trajectory Infinite streams should be less than ε ε1+ε5 ε ε1+ε2+ε3 ε privacy budget variables: ε2+ε5 ε ε1 ε2 ε3 ε4 ε5 ε2+ε3+ε5 ε uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 40

How to achieve ll-trajectory privacy We proved that how to achieve it by conventional DP mechanisms (e.g., Laplace/Exponential mechanism) privacy budget variables ε i at each ti the sum of ε i at timestamp i of any ll-trajectory should be less than ε ε1+ε5 ε? ε1+ε2+ε3 ε privacy budget variables: ε2+ε5 ε? ε1 ε2 ε3 ε4 ε5 ε2+ε3+ε5 ε? uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 41

PPDP algorithm: GA GA: Greedy Algorithm to get approximately optimal ε i by incomplete information uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data ε1+ε5 ε? ε1+ε2+ε3 ε ε2+ε5 ε? ε2+ε3+ε5 ε? locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 42

PPDP algorithm: GA GA: Greedy Algorithm to get approximately optimal ε i by incomplete information idea: exponential decay (1) set? =0, then compute ε3=w (2) use w/2 as the value of ε3 (reserve w/2 for? ) uid tim loc. u1 t1 e park u3 t1 park u2 t2 bar u3 t2 office u3 t3 gym u1 t5 bar u2 t5 park u3 t5 bar (a) raw data privacy budget variables: ε1 ε2 ε3 ε4 ε5 t1 t2 t3 t4 t5 u1 park bar ll1 u2 bar park ll2 u3 park office gym bar ll3 (b) trajectory representation of raw data ε1+ε5 ε? ε1+ε2+ε3 ε ε2+ε5 ε? ε2+ε3+ε5 ε? locs t1 t2 t3 t4 t5 park 2 0 0 0 1 office 0 1 0 0 0 bar 0 1 0 0 2 gym 0 0 1 0 0 (c) raw statistics risk 43

PPDP algorithm: MMD Republishing strategy to improve data utility of counts idea: in real-life data, the data has periodically repeated pattern re-publish Adjacent noisy (Adj) data (has been studied in [6]) re-publish noisy data who is holding Minimum Manhattan Distance (MMD) to real data of current timestamp (using Exponential Mechanism[5]) published noisy data n1 ~ nt-1 the current real data rt counts 5 2.5 0 time=1 3 5 7 9 11 13 15 17 19 nmmd or nadj [5] F. McSherry and K. Talwar, Mechanism Design via Differential Privacy, FOCS, 2007. [6]G. Kellaris et al, Differentially Private Event Sequences over Infinite Streams, VLDB 14. 44

Personalized privacy control Proposed framework dynamic budget allocation (GA) Private Approximation strategy PPDP framework collected database at timestamp t ε 1 ε t-1 us 1 us t n 1 n t-1 r t GA Dynamic Budget Allocation set of uids us t real statistics r t ε t,1 ε t,2 D t Private Approximation Strategy Private Publishing stream Adj or MMD Private publishing ε t private statistics n t capable of publishing diverse statistical data over infinite trajectory streams Adj / MMD is optimized for counts publishing 45

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Future work 46

Experiments Four real-life trajectory datasets Desc. PeopleFlow 1 Geolife 2 T-Drive 3 WorldCup98 4 people moving; people moving; diverse type of mobility; sparse;beijing taxis moving; Beijing webpage click streams Locs. Amt. 18 56 21 1,000 users Amt. 11,406 170 2,698 550,762 TimeStamps Amt. 1,694 1,440 886 722 Interval of TS 5 mins 1 min 10 mins 1 hour Length of TS 6 days 24 hours* ~7 days ~35 days Datapoints Amt. 102,468 240,990 37,255 1,258,542 1.http://pflow.csis.u-tokyo.ac.jp/ 2.http://research.microsoft.com/en-us/projects/geolife/ 3.http://research.microsoft.com/en-us/projects/tdrive/ 4.http://ita.ee.lbl.gov/html/contrib/WorldCup.html *.Aggregating 50,176 hours timestamps to 24 hours by omitting the date. 47

Experiments Budget Allocation Approximation Strategy realtime Utility Evaluation Uniform uniformly X O bad LP Approximately global optimised X X normal GA+Adj dynamically republish Adj O better GA+MMD dynamically republish MMD O Best FAST fix [7] uniformly _ O better,not stable [7]L. Fan et al, FAST: Differentially Private Real-time Aggregate Monitor with Filtering and Adaptive Sampling, SIGMOD 13. 48

Experiments Real data v.s. Noisy data(ll=20,ε=1) 300 225 150 real data GA+MMD FAST Uniform PeopleFlow counts 75 0-75 -150-225 150 155 160 165 170 175 180 185 190 195 real data GA+MMD 200 205 210 215 220 225 230 235 240 245 250 255 FAST [7] timestamp 260 265 270 275 280 285 290 295 300 49

Experiments Metrics Mean of Absolute Error (MAE) MAE(R, N) = 1 T * locs T locs i=1 j=1 r i [ j] n i [ j] All of them are the lower, the better Mean of Square Error (MSE) sensitivity to large error MSE(R, N) = 1 T * locs T locs i=1 j=1 r i [ j] n i [ j] 2 KL-divergence similarity of two distribution (the lower the more similar ) D KL (R N) = 1 T T locs! ln r i[ j] * $ # & r " n i [ j] * i [ j] * % i=1 j=1 50

Experiments MAE/MSE by varying ll (ε=1) PeopleFlow Geolife T-Drive WorldCup98 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD 1E+03 1E+02 UNIFORM FAST GA+Adj GA+MMD MAE 1E+01 1E+01 1E+01 1E+01 1E+00 1E+00 10 20 30 40 50 60 70 80 90 100 1E+00 10 20 30 40 50 60 70 80 90 100 1E+00 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 ll ll ll ll PeopleFlow Geolife T-Drive WorldCup98 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD 1E+06 1E+05 Uniform GA+Adj FAST GA+MMD MSE 1E+04 1E+03 1E+04 1E+03 1E+04 1E+03 1E+04 1E+03 1E+02 1E+02 1E+02 1E+02 1E+01 1E+01 10 20 30 40 50 60 70 80 90 100 ll 1E+01 10 20 30 40 50 60 70 80 90 100 ll 1E+01 10 20 30 40 50 60 70 80 90 100 ll 10 20 30 40 50 60 70 80 90 100 ll 51

Experiments MAE/MSE by varying ε (ll=20) PeopleFlow Geolife T-Drive WorldCup98 1E+06 1E+05 1E+04 UNIFORM FAST GA+Adj GA+MMD 1E+06 1E+04 UNIFORM FAST GA+Adj GA+MMD 1E+06 1E+04 UNIFORM FAST GA+Adj GA+MMD 1E+06 1E+04 UNIFORM FAST GA+Adj GA+MMD MAE 1E+03 1E+02 1E+02 1E+02 1E+02 1E+01 1E+00 1E+00 1E+00 1E+00 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. ε ε ε ε PeopleFlow Geolife T-Drive WorldCup98 1E+12 1E+09 UNIFORM FAST GA+Adj GA+MMD 1E+12 1E+08 UNIFORM FAST GA+Adj GA+MMD 1E+12 1E+09 UNIFORM FAST GA+Adj GA+MMD 1E+12 1E+09 UNIFORM FAST GA+Adj GA+MMD MSE 1E+06 1E+06 1E+06 1E+03 1E+04 1E+03 1E+03 1E+00 1E+00 1E+00 1E+00 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. 1.0E-4. 0.001. 0.01. 0.1. 1.0. 10.0. ε ε ε ε 52

Outline Motivation: opportunity & privacy risk Problem definition and analysis Proposed solution Experiment result Conclusion & Future work 53

Conclusion Proposed a rigorous and flexible privacy model for spatio-temporal data: l-trajectory privacy; Personal Data application: PPDM Privacy Models business model: privacy as services/money PPDP algorithms PPDM algorithms PPDP Framework for spatio-temporal data; safe OpenData safe MiningResult Algorithms GA+MMD for publishing private counts with high utility in real-time. 54

Future Work Improve algorithm GA+MMD for private counts the counts should be non-negative & integer More flexible privacy models ll-trajectory privacy Location-based privacy model PPDP/ PPDM for mining personal spatio-temporal data 55

Thank you! Any question? 56