Mining and Visualizing Spatial Interaction Patterns for Pandemic Response

Similar documents
A Framework for Discovering Spatio-temporal Cohesive Networks

Detail in network models of epidemiology: are we there yet?

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Construction and Analysis of Climate Networks

A MULTISCALE APPROACH TO DETECT SPATIAL-TEMPORAL OUTLIERS

Flow Mapping and Multivariate Visualization of Large Spatial Interaction Data

Cell-based Model For GIS Generalization

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid

Epidemic reemergence in adaptive complex networks

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Spatial Epidemic Modelling in Social Networks

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Issues around verification, validation, calibration, and confirmation of agent-based models of complex spatial systems

Encapsulating Urban Traffic Rhythms into Road Networks

Detecting Origin-Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area

A Preliminary Model of Community-based Integrated Information System for Urban Spatial Development

A comparison of optimal map classification methods incorporating uncertainty information

EMMA : ECDC Mapping and Multilayer Analysis A GIS enterprise solution to EU agency. Sharing experience and learning from the others

Application of Statistical Physics to Terrorism

SPATIAL ANALYSIS. Transformation. Cartogram Central. 14 & 15. Query, Measurement, Transformation, Descriptive Summary, Design, and Inference

Spatial Regression. 1. Introduction and Review. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Using Image Moment Invariants to Distinguish Classes of Geographical Shapes

Comparison of spatial methods for measuring road accident hotspots : a case study of London

Process Driven Spatial and Network Aggregation for Pandemic Response

Visualizing and Analyzing Activities in an Integrated Space-time Environment: Temporal GIS Design and Implementation

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

CONCEPTUAL DEVELOPMENT OF AN ASSISTANT FOR CHANGE DETECTION AND ANALYSIS BASED ON REMOTELY SENSED SCENES

CS : Spatial Data Modeling and Analysis. Geovisualization

The Recognition of Temporal Patterns in Pedestrian Behaviour Using Visual Exploration Tools

Geovisualization of Attribute Uncertainty

Applying Visual Analytics Methods to Spatial Time Series Data: Forest Fires, Phone Calls,

Neighborhood Locations and Amenities

The Challenge of Geospatial Big Data Analysis

A BASE SYSTEM FOR MICRO TRAFFIC SIMULATION USING THE GEOGRAPHICAL INFORMATION DATABASE

Creation and modification of a geological model Program: Stratigraphy

Global Scene Representations. Tilke Judd

Mapcube and Mapview. Two Web-based Spatial Data Visualization and Mining Systems. C.T. Lu, Y. Kou, H. Wang Dept. of Computer Science Virginia Tech

Appendixx C Travel Demand Model Development and Forecasting Lubbock Outer Route Study June 2014

Web Visualization of Geo-Spatial Data using SVG and VRML/X3D

Components for Accurate Forecasting & Continuous Forecast Improvement

GENERALIZATION IN THE NEW GENERATION OF GIS. Dan Lee ESRI, Inc. 380 New York Street Redlands, CA USA Fax:

Uncovering the Digital Divide and the Physical Divide in Senegal Using Mobile Phone Data

A framework for spatio-temporal clustering from mobile phone data

Principal Component Analysis, A Powerful Scoring Technique

Visitor Flows Model for Queensland a new approach

Creation and modification of a geological model Program: Stratigraphy

Chapter 6 Spatial Analysis

Interact with Strangers

An introduction to clustering techniques

GIS CONCEPTS ARCGIS METHODS AND. 3 rd Edition, July David M. Theobald, Ph.D. Warner College of Natural Resources Colorado State University

OBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Introduction to Google Mapping Tools

Module 4 Educator s Guide Overview

Contents. Interactive Mapping as a Decision Support Tool. Map use: From communication to visualization. The role of maps in GIS

Exploring the Impact of Ambient Population Measures on Crime Hotspots

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Question 1. Feedback Lesson 4 Quiz. You submitted this quiz on Tue 27 May :04 PM MYT. You got a score of out of

Virgili, Tarragona (Spain) Roma (Italy) Zaragoza, Zaragoza (Spain)

Re-Mapping the World s Population

Spatial Decision Tree: A Novel Approach to Land-Cover Classification

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data

GIS and Remote Sensing Support for Evacuation Analysis

Announcing a Course Sequence: March 9 th - 10 th, March 11 th, and March 12 th - 13 th 2015 Historic Charleston, South Carolina

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Neighborhood social characteristics and chronic disease outcomes: does the geographic scale of neighborhood matter? Malia Jones

Treemaps and Choropleth Maps Applied to Regional Hierarchical Statistical Data

Caesar s Taxi Prediction Services

IV Course Spring 14. Graduate Course. May 4th, Big Spatiotemporal Data Analytics & Visualization

ACCELERATING THE DETECTION VECTOR BORNE DISEASES

The AIR Severe Thunderstorm Model for the United States

Visualization of Trajectory Attributes in Space Time Cube and Trajectory Wall

BAYESIAN MODEL FOR SPATIAL DEPENDANCE AND PREDICTION OF TUBERCULOSIS

GIS CONCEPTS ARCGIS METHODS AND. 2 nd Edition, July David M. Theobald, Ph.D. Natural Resource Ecology Laboratory Colorado State University

GOVERNMENT GIS BUILDING BASED ON THE THEORY OF INFORMATION ARCHITECTURE

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Role of GIS in Tracking and Controlling Spread of Disease

DETECTION OF DANGEROUS MAGNETIC FIELD RANGES FROM TABLETS BY CLUSTERING ANALYSIS

Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab

Link to related Solstice article: Sum-graph article, Arlinghaus, Arlinghaus, and Harary.

Texas A&M University

GIS Spatial Statistics for Public Opinion Survey Response Rates

ECE 592 Topics in Data Science

1 Heuristics for the Traveling Salesman Problem

GeoComputation 2011 Session 4: Posters Discovering Different Regimes of Biodiversity Support Using Decision Tree Learning T. F. Stepinski 1, D. White

Module 2, Investigation 1: Briefing Where do we choose to live and why?

Three-Dimensional Visualization of Activity-Travel Patterns

Computing Correlation Anomaly Scores using Stochastic Nearest Neighbors

A route map to calibrate spatial interaction models from GPS movement data

CS570 Introduction to Data Mining

Human lives involve various activities

Scalable Bayesian Event Detection and Visualization

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

Enhance Security, Safety and Efficiency With Geospatial Visualization

Road Network Analysis as a Means of Socio-spatial Investigation the CoMStaR 1 Project

Cipra D. Revised Submittal 1

Clustering Analysis of London Police Foot Patrol Behaviour from Raw Trajectories

VISUAL ANALYTICS APPROACH FOR CONSIDERING UNCERTAINTY INFORMATION IN CHANGE ANALYSIS PROCESSES

Cartographic and Geospatial Futures

Transcription:

Mining and Visualizing Spatial Interaction Patterns for Pandemic Response Diansheng Guo Abstract This paper views the movements of people among locations as a spatial interaction problem, where locations interact with each other via shared visitors. The purpose of this research is to analyze the general characteristics of such spatial interactions and propose a strategy to explore and visualize such a large volume of data and reveal important patterns. The proposed strategy combines clustering, ordering, and visual techniques to efficiently present overall patterns and provide an interface for visual data mining and decisionsupport in response to possible disease outbreaks. 1 Introduction The movements of individuals between specific locations and the contacts between different groups of people are essential in modeling disease spread [7], [13], [8]. The daily activity of people from place to place forms a complex and dynamic network of spatial interactions between locations. It is a challenging problem to analyze such spatial interactions. Very often interaction data are difficult to acquire. However, researchers have begun to generate or discover such data sources, for example, generating simulation data of people s daily activities [3], [7] or using surrogate information (e.g., bank notes) to model human travel activities [5]. Interaction data is often very large, unique, and complex (in terms of potential patterns), which demands special data mining algorihtms to process and effective visual approaches to reveal/present the discovered patterns. The data used in this paper is a large collection of simulated human activities in an urban setting for a normal day, which includes over 8 million records and each record shows a specific activity (e.g., the person ID, location ID, activity type, time, duration, etc.). There are altogether over 1.6 million people and 181,295 unique locations [7]. This data set is available from http://ndssl.vbi.vt.edu/opendata. This research was partially supported by the United States Department of Homeland Security through the National Consortium for the Study of Terrorism and Responses to Terrorism (START), grant number N00140510629. However, any opinions, findings, and conclusions or recommendations in this document are those of the authors and do not necessarily reflect views of the U.S. Department of Homeland Security. Department of Geography, University of South Carolina. 709 Bull Street, Columbia, SC 29208. Email: guod@sc.edu This activity data set can be analyzed from different perspectives. This paper views such activities as a spatial interaction problem [2], [6], where locations interact with each other via the visitors that they share. 2 introduces several basic definitions and parameters to characterize such a spatial interaction network. In 3 I present an overall analysis of the characteristics of the spatial interaction network formed with the simulation data. Based on the general analysis of the spatial interaction properties, 4 proposes a strategy to aggregate data, synthesize information, and visually present patterns to allow human interpretation and decision-support during pandemic outbreaks. Figure 1: A bipartite of people s daily activities (with time ignored). Two locations are connected if they share at least one visitor. For example, locations 1 and 3 share three visitors (i.e., person 1, 2, and 3). 2 Problem Definition The activity data that involve N locations and P people is tranformed to a N N spatial interaction matrix. If the dynamics of interactions across time are also considered, then we have a space-time system defined by N N T, where T is time (in seconds). To quantify the strength of interaction between two individual locations, we can use the number of shared visitors between the two locations as an indicator of

Figure 2: The total number of location pairs (L pc ) that share exactly P s visitors. Figure 3: The connection ratio C r for all location pairs that share P s visitors. the interaction strength. In this section and the next section, the analysis focuses on interactions between individual locations, while in 4 the primary focus is on interactions between clusters of locations. The measure of interaction strength between two location clusters is presented in 4.3. 2.1 Bipartite graph. Bipartite graphs are often used to model the people-location relations [7]. As shown in Figure 1, for each day one person may visit several locations (places) and a location (place) may have many different visitors. However, existing analyses of a bipartite graph often focus on either people or locations and examine the degree distribution for each node [7], [5]. How to define the relationship between locations that do not have direct connections in the barpartite graph? In this paper, two locations are connected if they share at least one visitor for a normal day. This definition is similar to that presented in [15] but is different from that defined in [7] where two locations connect only if there is at least one person who moves directly from one location to the other location. 2.2 Spatial interaction. Not all elements in the N N spatial interaction matrix have values. Some locations do not share any visitor and thus have no connection to each other. The value for each connection, i.e., the weight for the connection between two locations can be defined by the number of visitors (P s ) they share or simply the geographic distance. Thus, we can view the spatial interaction matrix as a graph/network of locations. The density of a graph (see [16]) is defined as: (1) = 2L pc /(L(L 1)), where is the graph density, L pc is the number of edges (or connections) in the graph, and L is the total number of nodes (or vertices) involved. Here I define a very similar measure, connection ratio (C r ): (2) C r = L pc /L. As we know that at least L 1 connections are needed to connect L locations. Therefore, if C r is close to 1.0, the network or graph is barely connected and easy to break. 2.3 Location degree. As explained above, two locations are connected to each other if they share at least one visitor. All of these connections among locations form a spatial interaction network/graph. The degree of a location in this network/graph is the number of connections it has to other locations. 3 General Properties In this section a series of analysis of the spatial interaction network are presented to demonstrate several characteristics of the network/graph. 3.1 Location pair count vs. shared visitors. Figure 2 shows the total number of location pairs (L pc ) that share exactly P s visitors. The two variables exhibit a strong power-law relationship. Figure 3 shows the connection ratio C r for all location pairs that share P s visitors.

Figure 4: The relationship between the number of shared visitors (P s ) and the average distance of all location pairs that share exactly P s visitors. Dark blue points are the average distances and pink points are the standard deviation values. Figure 5: The relationship between the location degree and the number of locations that are of that degree. Except for locations of low degrees (e.g., < 100), the two variables exhibit a strong power-law relationship. See 2.3 for the definition of the degree of a location. 3.2 Geographic distance vs. shared visitors. Figure 4 presents the relationship between the number of shared visitors (P s ) and the average distance of all location pairs that share exactly P s visitors. Both the average and the standard deviation are shown. It shows that location pairs that share one or two visitors vary greatly in distances, with an average distance of about 8km. However, locations that share more visitors tend to be closer in space. Again, it shows roughly a powerlaw trend, which means that strong connections (in terms of shared visitors) are often localized. 3.3 Location degree. Figure 5 shows yet another power-law relationship, which is between the location degree and the total number of locations of that degree, especially for degrees higher than 100. 4 Mining and Visualizing Interaction Patterns The analysis result presented above indicates that strong interations are primarily between locations that are geographically close. It also indicates that it is possible to segment and partition the spatial interaction network to explore interaction patterns in an aggregated form. However, there are two major challenges. First, the large data size requires that the data mining algorithm should be scalable [9] or the data should be sampled. Second, given the complexity of potential patterns, effective visual approaches are needed help human users (or decision-makers) to visually recognize and interpret patterns [10], [11]. In this research, all locations are first grouped into a cluster hierarchy. Graph-based clustering methods are potential candiates for this task [9], [12], [4]. However, most of these clustering methods are often of complexity O(n 2 log n) or O(n 3 ), which is not efficient enough to process over 180,000 locations. Approaches that can be used to aggregate locations and reduce the data size include sampling [14], simplifying (e.g., removing weak location connections), or partitioning [1]. In this study, a sampling-based aggregation is developed (see next subsection). Then a hierarchical clustering of those aggregated locations is carried out to further synthesize the data and facilitate the detection of interaction patterns. 4.1 Location sampling. A sampling procedure is carried out to reduce the network size while preserving its major structures and patterns. Locations with a degree higher than 600 are singled out as samples (or representatives). Altogether there are 4056 sample locations. Figure 6 shows these sample locations on top of all 181,295 locations. As the map shows, most of the sample locations are within dense areas and/or close to highways. Figure 7 shows the relationship between sample locations and urban limits (boundaries) in the greater Portland metrapolitan area. Most of the sample locations are within urban limits. 4.2 Hierarchical clustering of locations. A three-step clustering procedure is used to aggregate all

Figure 6: The 4056 Sample locations (purple points) are plotted on top of all locations (gray points) in the Greater Portland area. Major highways in this area are also shown (red lines). Figure 7: This map shows the relationship between the 4056 Sample locations (purple points) and the urban areas (in yellow). Most of the sample locations are within a urban limit. locations (not just samples) into a hierarchy of clusters. First, the complete-linkage hierarchical clustering method is used to group the the 4056 sample locations into a hierarchy of clusters based on their geographic distances to each other. Second, each non-sample location is assigned to its nearest sample location based on its straight-line distances to every sample location. In other words, each sample location represents a group of locations. However, some sample locations only represent a very small number of locations if there are many other sample locations in the neighborhood (e.g., in the downtown area). Therefore, a third step is taken to recursively process the hierarchy of clusters from the bottom up. Two clusters will be merged into one if both of them are small (in terms of the number of locations they represent). After the merging process, the 4056 groups of locations (each of which is represented by a sample location) are aggregated to 248 clusters, each of which may contain one or more samples and all the locations that they represent. The average cluster size (in terms of the total number of locations they contain) is 755, with a minimum of 260 and maximum of 2002. Most of the 248 clusters are of a size between 500 and 1000. Note that locations contained in the same cluster are geographic neighbors (i.e., they form a contiguous region) as the hierarchy of clusters are derived based on geographic distances. 4.3 Measure of interaction strength. To quantify and visualize interaction patterns, a measure for interaction strength between two clusters of locations is defined in the following equation: (3) I ab = 1000 2 C ab /(S a S b ), where I ab is the interaction strength between clusters a and b, C ab is the number connections between the two clusters of locations, and S a and S b are the cluster size for a and b respectively. This measure is a normalized value that takes into account of the two cluster sizes. For example, a value of 500 for I ab indicates that 500 connections are expected between the two clusters if each of them contains 1000 locations. Note that clusters a and b can be the same cluster and in this case the interaction strength measures the internal interactions within that cluster. 4.4 1D ordering of location clusters. An ordering can be obtained from the hierarchical clustering result derived above. To derive a one-dimensional ordering from a cluster hierarchy, several different methods are available [10], [4]. In this study the optimal ordering method introduced in [4] is used to order the 248 location clusters. The purpose of the 1D ordering is twofold. First, to discover spatial interaction patterns (i.e., patterns in the N N space), we have to project location clusters to a one-dimensional ordering so that we can look at the interaction between locations in a 2D matrix (see Figure 9). Second, a one-dimensional or-

Figure 8: The ordering of the 248 clusters (each is represented by its centroid) based on their interaction strength to each other. Interestingly, the order mostly follows a spatial order (i.e., points are primarily connected to their neighbors). Figure 9: The spatial interaction matrix (248 248). Each cell has a value representing the interaction strength between two clusters of locations. dering can preserve more information than a cluster hierarchy. An ordering tries to arrange location clusters so that clusters with strong interactions (i.e., higher interaction strength values) are placed as close as possible to each other in the ordering. Figure 8 shows the ordering of the 248 clusters (each is represented by its centroid) based on their interaction strength to each other. Interestingly, although this ordering is derived based on interaction strength, it still mostly follows a spatial order (i.e., points are primarily connected to their neighbors). This indicates that strong interactions are between locations that are geographically close. 4.5 Visualizing and detecting spatial interaction patterns. The matrix view shown in Figure 9 is constructed by placing the 1D ordering of the 248 location clusters on the diagonal. Therefore the size of this interaction matrix is 248 248 and each cell has a value representing the interaction strength between two clusters. To accentuate patterns, the matrix is smoothed with a 3 3 moving average window. The color of each cell is assigned according to a 5-class classification, with higher interaction strength represented with a darker color. See 4.3 for the interpretation of the interaction strength value. This matrix is an overview of the major spatial interaction patterns present in the much larger spatial interaction network, where each node is an individual location. With the clustering, ordering, and matrix view, it is clear that locations can be segmented into a number of clusters so that most interactions are within the cluster while there are much less connections to other clusters. The detection and visualization of such a cluster structure are particularly important for planning mitigation strategies during a pandemic outbreak. For example, the discovered cluster structure can help identify the region of the highest priority for immediate action (e.g., distribution of vaccines and/or travel restriction [8]) when infectious individuals are detected at a certain place. From the matrix view we can also identify critical locations that are the connection point between different clusters. These locations are shown in the matrix as vertical lines and horizontal lines, e.g., the big cross at the top-left corner is a cluster of downtown locations, which has strong connections to many other clusters. Early detection and efficient targeting are suggested by [7] as alternative strategies to contain (or prevent) a pandemic crisis. Then these critical locations are the most important places to install sensors for early detection.

5 Conclusion and Future Work The research proposes an approach to explore spatial interaction patterns in a large simulated dataset of human activities in an urban setting (the Greater Portland area). The proposed approach combines clustering, ordering, and visual methods to present overall patterns and provide an interface for visual data mining and decision-support in response to possible pandemic outbreaks. The preliminary result shows both promises and challenges ahead. So far, the approach is limited to location interactions, ignoring the time, the actual contact of people at a location, and disease-specific factors and models. To support decision-making in a time-critical situation, the mining and visualization system needs a more interactive and effective interface to allow the inquiry of a variety of information and the integration of different data sources and models. It is ultimately desirable that simulation methods, automatic data mining algorithms, and visualization approaches are coupled together to support an iterative process of knowledge acquisition. For example, a simulation model generates the data, which is then analyzed by a data mining method to discover major patterns or anomalies, which are then presented with visual approaches to help human experts interpret patterns, formulate hypotheses, and generate what-if scenarios. The simulation model is then reconfigured acoording to the new scenario and generates a new dataset. References [1] A. Abou-rjeili and G. Karypis, Multilevel algorithms for partitioning power-law graphs, Technical Report (TR 05-034), Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 2005. [2] T. C. Bailey and A. C. Gatrell, Interactive spatial data analysis, John Wiley & Sons, Inc., New York, NY, 1995. [3] C. Barrett, J. Smith, and S. Eubank, Modern Epidemiology Modeling, Scientific American, March 2005. [4] Z. Bar-Joseph, E. D. Demaine, D. K. Gifford, A. M. Hamel, T. S. Jaakkola, and N. Srebro, K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data, Bioinformatics, 19 (2003), pp. 1070-1078. [5] D. Brockmann, L. Hufnagel, and T. GEisel, The scaling laws of human travel, Nature, 439 (2006), pp. 462-465. [6] A. D. Cliff and J. K. Ord, Spatial Processes: Models and Applications, Pion, London, UK, 1981. [7] S. Eubank, H. Guclu, V. Anil Kumar, M. Marathe, A. Srinivasan, Z. Toroczkai, and N. Wang, Modeling Disease Outbreaks in Realistic Urban Social Networks, Nature, 429 (2004), pp. 180-184. [8] T. C. Germann, K. Kadau, I. M. Longini, and C. A. Macken. em Mitigation strategies for pandemic influenza in the United States, PNAS 103(15), pp. 5935-5940, 2006. [9] D. Guo, D. Peuquet, and M. Gahegan, ICEAGE: Interactive Clustering and Exploration of Large and High-dimensional Geodata, GeoInformatica 7 (2003), pp. 229-253. [10] D. Guo, Coordinating Computational and Visualization Approaches for Interactive Feature Selection and Multivariate Clustering, Information Visualization, 2 (2003), pp. 232-246. [11] D. Guo, M. Gahegan, A. M. MacEachren, and B. Zhou, Multivariate Analysis and Geovisualization with an Integrated Geographic Knowledge Discovery Approach, Cartography and Geographic Information Science, 32 (2005), pp. 113-132. [12] J. Han, M. Kamber, and A. K. H. Tung, Spatial Clustering Methods in Data Mining: a survey, Geographic Data Mining and Knowledge Discovery, edited by H. J. Miller and J. Han, Taylor & Francis, London and New York, pp. 33-50, 2001. [13] N. M. Ferguson, D. A. Cummings, S. Cauchemez, C. Fraser, S. Riley, A. Meeyai, S. Iamsirithaworn, and D. S. Burke, Strategies for Containing an Emerging Influenza Pandemic in Southeast Asia, Nature, 437 (2005), pp. 209-214. [14] D. Rafiei and S. Curial, Effectively Visualizing Large Networks Through Sampling,Proceedings of the IEEE Visualization 2005 - (VIS 05), pp. 48-56, 2005. [15] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos, Neighborhood Formation and Anomaly Detection in Bipartite Graphs, Fifth IEEE International Conference on Data Mining (ICDM 05), pp. 418-425, 2005. [16] S. Wasserman, and K. Faust, Social Network Analysis, Cambridge University Press, Cambridge, UK, 1994.