Modelling exploration and preferential attachment properties in individual human trajectories

Similar documents
Modeling the scaling properties of human mobility

Validating general human mobility patterns on massive GPS data

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Encapsulating Urban Traffic Rhythms into Road Networks

Supplementary Figures

Detecting Origin-Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area

A universal model for mobility and migration patterns

Data from our man Zipf. Outline. Zipf in brief. Zipfian empirics. References. Zipf in brief. References. Frame 1/20. Data from our man Zipf

Exploring Human Mobility with Multi-Source Data at Extremely Large Metropolitan Scales. ACM MobiCom 2014, Maui, HI

Power Laws & Rich Get Richer

Understanding individual human mobility patterns Supplementary Material

Uncovering the Digital Divide and the Physical Divide in Senegal Using Mobile Phone Data

Socio-Economic Levels and Human Mobility

Understanding Social Characteristic from Spatial Proximity in Mobile Social Network

Spatiotemporal Patterns of Urban Human Mobility

A time series plot: a variable Y t on the vertical axis is plotted against time on the horizontal axis

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

data lam=36.9 lam=6.69 lam=4.18 lam=2.92 lam=2.21 time max wavelength modulus of max wavelength cycle

Exploring Urban Areas of Interest. Yingjie Hu and Sathya Prasad

A Cloud Computing Workflow for Scalable Integration of Remote Sensing and Social Media Data in Urban Studies

Urban characteristics attributable to density-driven tie formation

Crossing Over the Bounded Domain: From Exponential To Power-law Intermeeting

Human Mobility Models for Opportunistic Networks

ESS 461 Notes and problem set - Week 1 - Radioactivity

Exploring spatial decay effect in mass media and social media: a case study of China

A Deterministic Simulation Model for Sojourn Time in Urban Cells with square street geometry

Limits and Continuity

Demand and Trip Prediction in Bike Share Systems

Friendship and Mobility: User Movement In Location-Based Social Networks. Eunjoon Cho* Seth A. Myers* Jure Leskovec

Performance of Low Density Parity Check Codes. as a Function of Actual and Assumed Noise Levels. David J.C. MacKay & Christopher P.

transportation research in policy making for addressing mobility problems, infrastructure and functionality issues in urban areas. This study explored

Guidelines on Using California Land Use/Transportation Planning Tools

Area Classification of Surrounding Parking Facility Based on Land Use Functionality

Extracting mobility behavior from cell phone data DATA SIM Summer School 2013

Predicting New Search-Query Cluster Volume

Temporal patterns in information and social systems

Prediction of Citations for Academic Papers

Department of. Computer Science. Empirical Estimation of Fault. Naixin Li and Yashwant K. Malaiya. August 20, Colorado State University

Alg2H Ch6: Investigating Exponential and Logarithmic Functions WK#14 Date:

Experiment 4 Free Fall

MAINIDEA Write the Main Idea for this section. Explain why the slope of a velocity-time graph is the average acceleration of the object.

Machine Learning Using Machine Learning to Improve Demand Response Forecasts

Data Mining and Analysis: Fundamental Concepts and Algorithms

Linear Motion 1. Scalars and Vectors. Scalars & Vectors. Scalars: fully described by magnitude (or size) alone. That is, direction is not involved.

Lessons From the Trenches: using Mobile Phone Data for Official Statistics

A framework for spatio-temporal clustering from mobile phone data

Acceleration. OpenStax College

Economic Transition and Growth

4 Bias-Variance for Ridge Regression (24 points)

Heavy Tails: The Origins and Implications for Large Scale Biological & Information Systems

Data Mining II Mobility Data Mining

Sampling Distributions. Introduction to Inference

Investigation of Possible Biases in Tau Neutrino Mass Limits

PHYSICS 653 HOMEWORK 1 P.1. Problems 1: Random walks (additional problems)

Shlomo Havlin } Anomalous Transport in Scale-free Networks, López, et al,prl (2005) Bar-Ilan University. Reuven Cohen Tomer Kalisky Shay Carmi

exponential function exponential growth asymptote growth factor exponential decay decay factor

arxiv: v1 [stat.ml] 3 Jul 2012

ECEn 370 Introduction to Probability

Analysis of Disruption Causes and Effects in a Heavy Rail System

Intrinsic and Cosmological Signatures in Gamma-Ray Burst Time Proles: Time Dilation Andrew Lee and Elliott D. Bloom Stanford Linear Accelerator Center

Museumpark Revisit: A Data Mining Approach in the Context of Hong Kong. Keywords: Museumpark; Museum Demand; Spill-over Effects; Data Mining

Regularity and Conformity: Location Prediction Using Heterogeneous Mobility Data

Determining Sample Sizes for Surveys with Data Analyzed by Hierarchical Linear Models

Estimating Large Scale Population Movement ML Dublin Meetup

Cover times of random searches

A. 8 B. 10 C. 10 D Which expression is equivalent to (x 3 ) 1?

Physics 2A Chapter 1 HW Solutions

MSc Thesis. Modelling of communication dynamics. Szabolcs Vajna. Supervisor: Prof. János Kertész

Clustering non-stationary data streams and its applications

VISUAL EXPLORATION OF SPATIAL-TEMPORAL TRAFFIC CONGESTION PATTERNS USING FLOATING CAR DATA. Candra Kartika 2015

Evaluation of urban mobility using surveillance cameras

Leverage Sparse Information in Predictive Modeling

Understanding Individual Daily Activity Space Based on Large Scale Mobile Phone Location Data

Exploring the Association Between Family Planning and Developing Telecommunications Infrastructure in Rural Peru

SETTLEMENT. Student exercises. The Kuril Biocomplexity Project:

Quantitative Economics for the Evaluation of the European Policy

A non-linear elastic/perfectly plastic analysis for plane strain undrained expansion tests

Adaptive Contact Probing Mechanisms for Delay Tolerant Applications. Wang Wei, Vikram Srinivasan, Mehul Motani

CIV3703 Transport Engineering. Module 2 Transport Modelling

A MODEL OF TRAVEL ROUTE CHOICE FOR COMMUTERS

Universal model of individual and population mobility on diverse spatial scales

Single-Trial Neural Correlates. of Arm Movement Preparation. Neuron, Volume 71. Supplemental Information

Ms. York s AP Calculus AB Class Room #: Phone #: Conferences: 11:30 1:35 (A day) 8:00 9:45 (B day)

Tomás Eiró Luis Miguel Martínez José Manuel Viegas

A model of capillary equilibrium for the centrifuge technique

SPACE Workshop NSF NCGIA CSISS UCGIS SDSU. Aldstadt, Getis, Jankowski, Rey, Weeks SDSU F. Goodchild, M. Goodchild, Janelle, Rebich UCSB

Chapter 2 Motion in One Dimension. Slide 2-1

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

How to estimate quantiles easily and reliably

Acceleration. 3. Changing Direction occurs when the velocity and acceleration are neither parallel nor anti-parallel

Modelling data networks stochastic processes and Markov chains

Community Engagement in Cultural Routes SiTI Higher Institute on Territorial Systems for Innovation Sara Levi Sacerdotti

Social and Technological Network Analysis. Lecture 11: Spa;al and Social Network Analysis. Dr. Cecilia Mascolo

Introduction to Probability and Statistics (Continued)

The Spatial Structure of Cities: International Examples of the Interaction of Government, Topography and Markets

Inform is a series of white papers designed to provide advice on material characterization issues. Mie theory The rst 100 years

Learning Likely Locations

2007 First-Class Mail Rates for Flats* Weight (oz) Rate (dollars) Weight (oz) Rate (dollars)

Integrated Math 1 Honors Module 9H Quadratic Functions Ready, Set, Go Homework Solutions

Experiment 2. F r e e F a l l

Transcription:

1.204 Final Project 11 December 2012 J. Cressica Brazier Modelling exploration and preferential attachment properties in individual human trajectories using the methods presented in Song, Chaoming, Tal Koren, Pu Wang, and Albert-László Barabási. 2010. Modelling the Scaling Properties of Human Mobility.

1. Introduction I took the fi nal project as an opportunity to understand the next step of trajectory modeling after the Homework 3 analysis that employed the continuous time random walk (CTRW), by following the methods presented by Song et. al. (2010) in the paper "Modelling the Scaling Properties of Human Mobility" ("Song"). Song proposes an extension of the CTRW model to include the human mobility tendencies of exploration and preferential return. The authors resolve the three scaling anomalies of the asymptotic behavior of humans (see Section 2), although they still do not incorporate short-term periodic modulations such as diurnal travel patterns and correlations in the order in which locations are visited (Song 2010, 5). This project is also a starting point for learning how to work with mobile phone data, which has great potential as a proxy for traditional mobility datasets. Mobility models based on call data records (CDR) have been used to describe and approximate differences in mobility based on urban form properties across different cities (Kang et al. 2011) as well as complex and patterns of use within cities (Toole et al. 2012) that are otherwise diffi cult to document without trip journals. Therefore, in this project, I am working towards an exploration of what kinds of information are contained within mobile phone records, by making a preliminary analysis of the relationships between trajectory measures and local urban form characteristics. In the future, I hope to incorporate these measures into research on comprehensive energy use monitoring across both transport (jump distances) and building operations (pause times at different locations). The core of the individual mobility model proposed by Song contains only two parameters to describe the probability of visiting a new location, as well as one function to describe preferential return. But to consistently describe the empirical data, the modeler must also parameterize the displacement and waiting time distributions for that dataset. Furthermore, to verify that the different properties of a dataset, such as the CDR data used in this project, are consistent with the interrelationships of these parameters as proposed by Song, several more trajectory measures should be calculated. This process results in four main steps, which structure this fi nal report: Determine the rate of exploration of new locations, the individual probability of visiting new locations, and the probability of returning to a previously-visited location (section 3a) Parameterize the probability density functions for displacement and waiting time (section 3b) Determine the relationship between the number of distinct locations visited and the time duration, as well as between the frequency of visiting a location and its order of visitation, to confi rm the relationship between these parameters are consistent, as proposed by Song (section 3c) Generate simulated trajectories from the parameterized model, and compare the properties of these trajectories to the empirical data (section 4) Brazier 2

2. Methods and Data Song uses two mobility datasets, mobile phone trajectories and location-based service records, to empirically demonstrate the properties of the exploration and preferential return model. The mobile phone dataset contains the "time-resolved trajectories of three million anonymized mobile-phone users" (Song 2010, 1) for an entire year. For this project, I will use the San Francisco mobile phone dataset containing 3126 users' trajectories (3107 after removing users with S = 1), a difference in sample size that is an important consideration for the meaning of the comparison with the article's results. The trajectories span one month, instead of one year. I read the data fi le into Matlab as a matrix with the following columns for each user, after removing the semicolon delimiters: User ID Number of events 4082000002 18 Time [seconds] X [km] Y [km] Tower ID Other indices 398224-2687.8144-4311.8173 3041 1 2 398318-2687.2603-4311.6812 3119 1 2... The graphs below illustrate the expected heterogeneity in the CDR trajectors among users, with some users having frequent activity, while others contain little usable data: 2640 2650 2660 X [km] 2670 2680 2690 2700 100 148 196 244 292 340 388 436 484 532 580 628 676 724 772 800 t [hours] Y [km] 4290 4300 4310 4320 4330 4340 4350 4360 100 148 196 244 292 340 388 436 484 532 580 628 676 724 772 800 t [hours] Brazier 3

With these datasets, Song illustrates three critical differences, or scaling anomalies, between the human trajectories and CTRW models: 1. Humans visit fewer distinct locations over time, compared with the approximations of CTRW and Levy fl ights. 2. The visitation frequency distribution of humans tends towards zero, instead of becoming asymptotically constant for the least-visited locations, as it does in CTRW and Levy fl ights. 3. The mean square displacement (MSD) of humans grows more slowly than the CTRW prediction of logarithmic growth, so ultraslow diffusion should be incorporated into the model. The authors then propose a new CTRW formulation that accounts for exploration limits (new locations are added more slowly over time) and preferential return (visitation probability is linked to the historical frequency of visiting each location). The model consists of three steps (shown below in Figure 2 from Song): 1. The individual pauses for an amount of time selected from the pause distribution P(Δt) 2. The individual may then move to a new location with the probability P new = ρs -γ, or return to an already visited location with the complementary probability P ret = 1- ρs -γ 3. If the individual went exploring, s/he will move a distance selected from the jump distribution P(Δr). If the individual made a preferential return, s/he will return to a location with probability i = f i, where f i is the frequency of each location s visits, and f k ~ k -ζ. The authors validate the model by generating trajectories and comparing the resulting radius of gyration distribution to those of the original mobile phone trajectories, showing they are in agreement. Time: t + Δt Δr Time: t P new = ρs γ (i) Exploration S = 5 S = 4 P ret = 1 ρs γ (ii) Preferential Return Π i = f i f 1 f 2 f 3 S = 4 Brazier 4

3. Model Parameterization a. Measure rate of visiting new locations, ΔS vs. S; estimate γ; estimate ρ for each user ΔS 1 = 1 ΔS 2 = 1/2 = 0.5 ΔS 3 = 1/4 = 0.25 Δn 1 = 1 Δn 2 = 2 Δn 3 = 4 1 2 1 1 3 1 2 2 3 4 4 1 2 3 4 Event location index t S i In the fi gure above, I diagram the procedure for determining the rate of visiting new locations versus returning to previously visited locations. Song initially sets up a vector in which visited locations are denoted with 0 and new locations receive the value 1, then averages all the visited locations corresponding to each S; instead, I directly count the number of visited locations up to and including the next new location, Δn i. The change in the number of distinct locations corresponding to the previously visited S i is then ΔS i = 1 / Δn i. Song then 'normalizes' ΔS for each user by dividing by its average value, <ΔS>, for 1 < S < 100. The rate of change for each individual, ΔS/<ΔS>, is therefore not dependent on that individual's ρ value, which cannot be determined until the global relationship P new ~ S -γ has been established. Unfortunately, for users with S< 100, this average <ΔS> will be artifi cially high and will result in a lower γ value in the ΔS/<ΔS> ~ S -γ relationship. Only 370 out of 3100 users reach 70 distinct locations within a month, and just 130 visit one hundred different cell towers. It is diffi cult to obtain a sample size that is not affected by the averaging problem and that does not have substantial noise in the data point distribution. I illustrate this problem in the three plots below, for a user with S ~= 5, 50, and 100, respectively. The 'slope' of the fi tted line ranges from 0 < γ < 0.61, which demonstrates that users who don't span enough locations (S) have highly variable fi tted line results. For a small sample size, an accurate value of γ cannot be determined. Brazier 5

To move forward, I conducted a sensitivity analysis on the amount of users to include for determining the global ΔS/<ΔS> ~ S -γ, and the fi tted parameter range was 0.098 < γ < 0.121 for the range of users with 5 < S max < 90. I fi nally select users with S max > 50, with a corresponding model parameter of γ = 0.121. The plot below compares these results to the corresponding Figure 3a (Song 2010, 4). 10 1 a 10 1 S -0.12 S -0.21 S -0.21 Unbinned Binned S/< S> ΔS/ ΔS 10 10 1 10 2 S 10 1 1 10 100 S The Song data point mean is centered much lower than in my plot, which means that my methods require further review. In the interests of time, I did not reproduce Figure 3b, which shows the empirical and estimated relationship between the probability of return to a location versus the frequency of prior visitation, Π = f. Instead, I will directly use the stabilized frequencies f i, which are calculated in Section 3c, as a probability distribution for the simulations. I give two examples of determining ρ in the relationship ΔS = P new = ρs -0.12 for each user, below. 10 1 10 1 P new P new 10 1 10 1 10 2 S 10 1 10 1 10 2 S Brazier 6

The following histogram for ρ over all users with S max > 50 exhibits the same trend as the ρ distribution in Song's Figure 3c. The distribution is centered around ρ 0.97, compared with the average value of ρ 0.6 in Figure 3c. Again, this distribution represents a less complete dataset, in which users who have visited fewer locations by the end of one month also produce lower ρ than in Song's dataset. 0.16 c 0.14 4 0.12 0.1 3 P(ρ) 0.08 0.06 P( ρ ) 2 0.04 0.02 1 0 0.4 0.6 0.8 1 1.2 1.4 0 0 0.2 0.4 0.6 0.8 1.0 ρ b. Measure jump distance and pause time distributions, P( r) and P( t); estimate α, β To measure the distances and pauses, I must fi rst fi lter the CDR to obtain the events that are separated by a fi xed time interval (2 hours), in order to correct the bias from the widely varying interevent times that characterize the calling pattern of each user" (Song 2010, SC4). Song shows that the remaining trajectory points represent a set of known distances traveled at known time intervals, which are not "driven by the statistics of call activity." I based the following diagram of this fi ltering process on Figure S3 in the Supplementary Information (Song 2010). The shaded boxes represent events for which the interevent time interval is 2 hours. I then fi tted the probability density function to the exponential curve, P(Δr) ~ Δr - -1-α, to estimate α. The jump distance parameter α = 0.71 appears to follow a more rapidly decaying distribution, compared with Song's estimate of α = 0.55. (see next page) Brazier 7

= 10 = 20 = 10 = 5 1 2 1 2 2 3 2 1 1 1 1 3 = 2hr 10 1 T = 2 hour (D 1 ) Δr 1.71 Δr 1.55 b P(Δr) 10 2 10 3 10 4 1 10 100 Δr [km] The estimation of pause times, Δt, is slightly more involved. In order to determine the probability of fi nding a pause time of length Δt, which needs to be fi ltered from the complete trajectory according to similar criteria as the displacements (that the location at a fi xed time interval must be known for the events before and after ), I must divide the number of those pause times by the probability of sampling an equally long interval in which events may occur at multiple locations, per the diagram and relationship below (adapted from Song 2010, Figure S3). (Equation S2) n r =4 n r =4 n r =4 n r = 4 n r = 6 1 2 1 2 2 3 2 1 1 1 1 3 n w = 2 n w = 4 Brazier 8

Moreover, the probability density function of the pause times follows an exponential distribution that contains the additional term of exp(-δt/τ), where τ is the cutoff time of 17 hours in Song's case. The complete relationship is then P(Δt) ~ Δt - -1-β exp(-δt/τ).the distribution for this dataset is shown below, in comparison with the supplementary information Figure S4. The fi tted parameter is β = 0.x, versus Song's fi nding of β = 0.8. Shown below is the frequency of pauses of length n w. I could not get the measurement of n r to work in time, but the plot below is beginning to exhibit a similar distribution to Song's plod 10 5 10 4 b Frequency 10 3 10 2 10 1 10 1 10 2 Δt c. Confirm relationships S(t) ~ t μ and f k ~ k -ζ, estimate μ and ζ First I check that the number of new locations visited, S(t), has an exponential relationship with the time duration, S(t) ~ t μ. I generated a vector of the S i value at a regular time interval of Δt = 2 hours, and then averaged these values for each S i across all users. Note that this user set is not the same as the fi nal set of users for estimating γ in Section 3a. By fi tting the exponential curve, I obtained μ = 0.66, which is fairly consistent with the value of 0.6 shown in Song's Figure 1a (see below). Next I generated the Zipf plot for the frequency of visitation of each location, f -ζ, versus the ordered rank of its fi rst visitation. I did not have time to re-sample or bin the users within particular S regimes (i.e., S max 30, S max 60, etc.), but the plot below does show that the parameter ζ varies with S max, which agrees with the fi nding of potentially different γ values for different regimes of S max in Section 3a. The parameter value is similar to Song's, ζ 1.2. However, the article's Figure 1b plot also shows that the Song dataset may contain users for which S < 100, in which case my observations in Section 3a regarding the infl uence of users' different maximum S on the relationship ΔS/<ΔS> ~ S -γ may be incorrect. The comparative plots are below. Brazier 9

1 3 3 2 3 1 2 2 3 2 4 4 3 2 1 1 1 1 3 5 1 10 2 X: 739 Y: 45.3 100 Day Week Month Year r g = 32 km r g = 64 km S(t) 10 1 Empirical data S(t) 10 r g = 128 km r g = 256 km t 0.6 t 0.8 10 1 10 2 10 3 t (h) 1 1 10 100 1,000 10,000 t (h) 10 1 k 1.2 S < 30 S < 60 S > 60 b 10 1 f.k 10 2 f k 10 3 10 4 10 1 10 2 10 3 k 10 2 10 3 S = 20 S = 40 S = 60 k 1.2 1 10 100 k Brazier 10

In the Supplementary information to the article, Song derives the relationships between the parameters (γ, β, μ, ζ) which follow equations (5) and (6) (Song 2010, 3), shown below. The relationship between ζ and γ seems to dictate that γ must = 0.2 for ζ 1.2; however with γ 0.12 in my case, the steeper curves shown in the Zipf plot for f -ζ do not suggest that ζ < 1.2. For the relationship μ = β / (1 + γ), my parameter estimates imply that (could not estimate \beta in time). Therefore, there are some internal inconsistencies with the parameter estimate, which might be due to the CDR sample size. 4. Model Results If I had more time, I would set up the model generation function similar to the Levy fl ight function in Homework 3. The function would follow the procedure laid out in Section 2: 1. Accept inputs of total number of events to generate, model parameters (γ, α, β), and individual model parameters and return location probabilities (ρ, f ). 2. Generate a pause time by selecting from the pause distribution P(Δt). 3. Generate a random number between 0 and 1 and compare this number to P new = ρs -γ 4. If the number satisfi es the probability of exploration, then select a distance from the jump distribution P(Δr) and a random direction for the move. 5. If the number instead falls in the range of the complementary probability of return to a previous location, P ret = 1- ρs -γ, select a return location from the probability distribution i = f i, where f i is the frequency of each location s visits. 6. Repeat process for total number of events. In the CDRs, there is not an explicit 'pause time' between each event at different locations, so I think the above model that inserts pause times before each move is acceptable but am not certain. I also intended to calculate the radius of gyration for each user, r g, to compare the model results to the empirical data. 5. Concluding Remarks I have been exposed to some important considerations when working with CDR data, as opposed to more continuous time and locational data. I have also surmised that the parameter values of the Song model can be quite variable, especially when the dataset is small. I will leave to future work the more rigorous analysis of Brazier 11

the relationship between mobility measures, such as the 'strength of exploration' ρ, and urban form variables that had initially prompted me to undertake this project. Following Yan Ji s thesis (2011), I would have liked to assign income and population predictors according to the user s home tower area. Then I would calculate certain destination distance predictors, distance from city center and distance from local center, for each tower, as well as other urban form indicators that measure connectivity and accessibility. Based on a regression or clustering analysis, I would have liked to explore the usefulness of the parameters in Song's model compared with parameters proposed in other studies (cf. Kang 2011, Toole 2012). References Ji, Yan. 2011. Understanding Human Mobility Patterns Through Mobile Phone Records: A Cross-Cultural Study. MIT. Kang, C., X. Ma, D. Tong, and Y. Liu. 2011. Intra-urban Human Mobility Patterns: An Urban Morphology Perspective. Physica A: Statistical Mechanics and Its Applications. Song, Chaoming, Tal Koren, Pu Wang, and Albert-László Barabási. 2010. Modelling the Scaling Properties of Human Mobility. Nature Physics 6 (10) (September 12): 818 823. Toole, J. L., M. Ulm, M. C. González, and D. Bauer. 2012. Inferring Land Use from Mobile Phone Activity. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing, 1 8. Brazier 12