R&D Centre for Mobile Applications Czech Technical University in Prague Spatial Extension of the Reality Mining Dataset Michal Ficek, Lukas Kencl sponsored by
Mobility-Related Applications Wanted! Urban planning Transport management Content delivery Cloud computing for mobiles Delay Tolerant Networks... Appropriate user-tracking datasets to study mobility are missing! 2
Reality Mining Dataset Nathan Eagle & Alex Pentland, Massachusetts Institute of Technology, 2005 Machine-sensed data mobile terminal-based recording communication (voice, SMS, data, duration) proximity (Bluetooth devices nearby) location (date, area, Cell-ID, network) phone status (charging, idle, apps in use) social ties (friends, colleagues) 100 users during 9 months MIT students, staff unique and rare source of data Publicly available for download* tens of publications Spatial dimension not available! exploit dataset for further research support and validate results derived to date * http://reality.media.mit.edu/download.php 3
From Cell to Geographical Coordinates - I Cell Identification in GSM / UMTS mobile network: Cell Global Identity Mobile Country Code Mobile Network Code Location Area Code Cell Identifier 4
From Cell to Geographical Coordinates - II MCC, MNC LAC, Cell-ID Cell-ID list from operator not publicly available Cell-ID databases publicly available (OpenCellID, CellDB, CellSpotting) only sparse coverage commercial (Location-API.com, Cell-id Look-up API) limited access Longitude / Latitude need full Cell Global Identifier Reality Mining Dataset contains only Location Area Code and Cell-ID LAC / Cell-ID Google Location API HTTP hidden API for My Location service direct request from plain PC possible MCC, MNC LAC / Cell-ID computer mimics mobile phone accepts even only LAC and Cell-ID 5 Lat / Lon Lat / Lon
Location Data Acquisition Almost 33,000 unique cells present in Reality Mining Dataset Retrieved 46.75% of locations with geographical coordinates 1. Five years delay between dataset recording and location retrieval 2. Mobile networks evolution (3G) and renumbering 3. Boston (Massachusetts, USA) area completely missing due to operator acquisitions All retrieved cell locations 6
Outliers Detection and Removal - I Unlikely places Distorted trajectories impossible distant hops (between continents in 2 seconds) Why not simply remove distant hops? because airplanes fly fast and far Distorted trajectories Why not use Mobile Country Code to detect cells outside corresponding Country? trajectory distortion caused by LAC-Cell/ID pair reuse MCC and MNC codes are missing! Unlikely places 7
Outliers Detection and Removal - II Observation: Location Area consistency cells with same LAC form compact clusters Location Areas cover small areas common mobile network design pattern All retrieved cell locations Location Areas in CZ Location Areas from cell locations Location Areas neither compact nor covering small areas 8
LAC-clustering Algorithm LAC-clustering algorithm heuristic extension of general agglomerative hierarchical clustering 1. Select cells with the same LAC 2. Let each cell location be a cluster 3. repeat Merge the two closest clusters 4. until only one cluster remains 5. Use distance criterion for forming clusters 6. Select one Location Area representative... and iterate over all LACs 35Km GSM radio limit 9
Location Data Acquisition Outliers Removed Removed ~1,500 unique cells from ~15,000 cell locations with coordinates 42% of all unique locations in the Reality Mining Dataset left Locations all around the World, not only Boston! Note: can t verify result correctness LAC-clustered cell locations 10
Movement Trajectory Reconstruction Space-time cube visualization Missing locations will distort trajectory unknown locations mobility info missing Trajectory example from Reality Mining Dataset Reconstruct user trajectory from consistent subsequences have majority of known locations... (e.g. > 95%) are representatively long... (e.g. > 300 locations) 11
Finding Consistent Subsequences - I Example desired subsequence length L 5 desired known locations ratio C 60% (3 out of 5) Two consistent subsequences time 6 out of 10 60% 6 out of 9 66% known location unknown location consistent subsequence 12
Finding Consistent Subsequences - II 1. Handle locations sequence as discrete signal: known location... 1 unknown location... 0 2. Apply moving average filter with window size L 3. Select locations above desired known locs ratio threshold C L > 300 locations C > 95% 13
Reality Mining Dataset Locations Summary Fraction of unique known locations per user varies between 0% and 68% retrieved locations don t cover whole user pool Time spent by users on known locations similar same ratio, but different distribution heavy/long tail users spend most of their time at only few places Consistent subsequences describe approx. 0.6% to 15% of user mobility trace from total 9200 hours of tracking based on parameters of consistent subsequences cell locations cover most likely business, conference & vacation trips Total time in consistent subsequences 95% of known locations, length 300 locations total time in all consistent subsequences approx. 326 days 14
Conclusion Method for retrieving geographical locations for GSM / UMTS Cells based on querying Google Locations API LAC-clustering for outliers detection and removal representative movement trajectory reconstruction Retrieved coordinates for 42% of unique cells from Reality Mining Dataset method suitable for similar datasets Spatial information opens further research possibilities 326 days of valuable user-mobility traces 15
What Next? What can be derived from such spatial data? usage patterns at different locations, when traveling at different speeds mobile user movement prediction validation and support of previously published results active vs. passive tracking comparison correlation of mobility and behavior of the user... Greater level of Cell-ID obfuscation for further dataset recordings? hashing / obfuscation preserving cellular network nature Limits of informed consent? Google Locations API did not exist in the time when Reality Mining Dataset was recorded can we provide trustworthy guarantees about restrictions on future information retrieval from monitored data at all? 16
Thank you! Interested? Why not read our Paper? Michal Ficek, Lukas Kencl: Spatial Extension of the Reality Mining Dataset Czech Technical University in Prague michal.ficek@rdc.cz www.rdc.cz 17