GeoSpark: A Cluster Computing Framework for Processing Spatial Data

Size: px
Start display at page:

Download "GeoSpark: A Cluster Computing Framework for Processing Spatial Data"

Transcription

1 GeoSpark: A Cluster Computing Framework for Processing Spatial Data Jia Yu School of Computing, Informatics, and Decision Systems Engineering, Arizona State University 699 S. Mill Avenue, Tempe, AZ jiayu2@asu.edu Jinxuan Wu School of Computing, Informatics, and Decision Systems Engineering, Arizona State University 699 S. Mill Avenue, Tempe, AZ jinxuanw@asu.edu Mohamed Sarwat School of Computing, Informatics, and Decision Systems Engineering, Arizona State University 699 S. Mill Avenue, Tempe, AZ msarwat@asu.edu ABSTRACT This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which ext regular Apache Spark RDD to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). The system users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range and Join) on SRDDs. GeoSpark adaptively decides whether a spatial index needs to be created locally on an SRDD partition to strike a balance between the run time performance and memory/cpu utilization in the cluster. Extensive experiments show that GeoSpark achieves better run time performance (with reasonable memory/cpu utilization) than its Hadoop-based counterparts (e.g., SpatialHadoop) in various spatial data processing applications. 1. INTRODUCTION The volume of available spatial data increased tremously. Such data includes but not limited to: weather maps, socioeconomic data, vegetation indices, and more. Moreover, novel technology allows hundreds of millions of users to use their mobile devices to access their healthcare information and bank accounts, interact with fris, buy stuff online, search interesting places to visit on-the-go, ask for driving directions, and more. In consequence, everything we do on the mobile internet leaves breadcrumbs of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ spatial digital traces, e.g., geo-tagged tweets, venue checkins. Making sense of such spatial data will be beneficial for several applications that may transform science and society For example: (1) Socio-Economic Analysis: that includes for example climate change analysis, study of deforestation, population migration, and variation in sea levels, (2) Urban Planning: assisting government in city/regional planning, road network design, and transportation / traffic engineering., (3) Commerce and Advertisement: e.g., point-of-interest(poi) recommation services. The aforementioned applications needs a powerful data management platform to handle the large volume of spatial data. Challenges to building such platform are as follows: Challenge I: System Scalability. The massivescale of available spatial data hinders making sense of it using traditional spatial query processing techniques. Moreover, big spatial data, besides its tremous storage footprint, may be extremely difficult to manage and maintain. The underlying database system must be able to digest Petabytes of spatial data, effectively stores it, and allows applications to efficiently retrieve it when necessary. Challenge II: Interactive Performance. User will not tolerate delays introduced by the underlying spatial database system to execute queries efficiently. Instead, the user needs to see useful information quickly. Hence, the underlying spatial data processing system must figure out effective ways to process user s request in a sub-second response time. Existing spatial database systems (DBMSs) [14] ext relational DBMSs with new data types, operators, and index structures to handle spatial operations based on the Open Geospatial Consortium. Even though such systems sort of provide full support for spatial data, they suffer from a scalability issue. Based upon a relational database system, such systems are not scalable enough to handle large-scale analytics over big spatial data. Recent works (e.g., [2, 6]) ext the Hadoop [5] ecosystem to perform spatial analytics at scale. Although the Hadoop-based approach achieves high scalability and exhibits excellent performance in batchprocessing jobs, it shows poor performance handling applications that require interactive performance. Apache Spark [17], on the other hand, is an in-memory cluster computing system. Spark provides a novel data abstraction

2 called resilient distributed datasets (RDDs)[18] that are collections of objects partitioned across a cluster of machines. Each RDD is built using parallelized transformations (filter, join or groupby) that could be traced back to recover the RDD data. In memory RDDs allow Spark to outperform existing models (MapReduce). Unfortunately, Spark does not provide support for spatial data and operations. Hence, users need to perform the tedious task of programming their own spatial data processing jobs on top of Spark. This paper presents GeoSpark 1 an in-memory cluster computing system for processing large-scale spatial data. GeoSpark exts the core of Apache Spark to support spatial data types, indexes, and operations. In other words, the system exts the resilient distributed datasets (RDDs) concept to support spatial data. This problem is quite challenging due to the fact that (1) spatial data may be quite complex, e.g., rivers and cities geometrical boundaries, (2) spatial (and geometric) operations (e.g., Overlap, Intersect, Convex Hull, Cartographic Distances) cannot be easily and efficiently expressed using regular RDD transformations and actions. GeoSpark exts RDDs to form Spatial RDDs (SRDDs) and efficiently partitions SRDD data elements across machines and introduces novel parallelized spatial (geometric operations that follows the Open Geospatial Consortium (OGC) [13] standard) transformations and actions (for SRDD) that provide a more intuitive interface for users to write spatial data analytics programs. Moreover, GeoSpark exts the SRDD layer to execute spatial queries (e.g., Range query and Join query) on large-scale spatial datasets. After geometrical objects are retrieved in the Spatial RDD layer, users can invoke spatial query processing operations provided in the Spatial Query Processing Layer, decide how spatial objects could be stored, indexed, and accessed using SRDDs, and return the spatial query results required by user. In summary, the key contributions of this paper are as follows: GeoSpark as a full-fledged cluster computing framework to load, process, and analyze large-scale spatial data in Apache Spark. A set of out-of-the-box Spatial Resilient Distributed Dataset (SRDD) types (e.g., Point RDD and Polygon RDD) that provide in house support for geometrical and distance operations. SRDDS provides an Application Programming Interface (API) for Apache Spark programmers to easily develop their spatial analysis programs. Spatial data indexing strategies that partition the input Spatial RDD using a grid structure and assigns grids to machines for parallel execution. GeoSpark also adaptively decide whether a spatial index needs to be created locally on a Spatial RDD partition to strike a balance between the run time performance and memory/cpu utilization in the cluster. Extensive experimental evaluation that benchmarks the performance of GeoSpark in spatial analysis applications like spatial join, spatial aggregation, and spatial co-location pattern recognition. The experiments also compare and contrast the GeoSpark with existing Hadoop-based systems (i.e., SpatialHadoop). 1 The source code is available at Sarwat/GeoSpark The rest of this paper is organized as follows. Section 2 highlights the related work. GeoSpark architecture is given in Section 3. Section 4 presents the Spatial Resilient Distributed Datasets (SRDDs) and Section 5 explains how GeoSpark efficiently processes spatial queries on SRDDs. Section 7 experimentally evaluates GeoSpark. Finally, Section 8 concludes the paper. 2. BACKGROUND AND RELATED WORK Spatial Database Systems. Spatial database operations are vital for spatial analysis and spatial data mining. Spatial range queries inquire about certain spatial objects exist in a certain area. There are some real scenarios in life: Return all parks in Phoenix or return all restaurants within one mile of my current location. In terms of the format, spatial range query needs one set of points or polygons and one querywindow as inputsandreturnsall thepoints/polygons which lie in the query area. Spatial join queries are queries that combine two datasets or more with a spatial predicate, such as distance relations. There are also some real scenarios in life: tell me all of the parks which have rivers in Phoenix and tell me all of the gas stations which have grocery stores within 500 feet. Spatial join query needs one set of points, rectangles or polygons (Set A) and one set of query windows (Set B) as inputs and returns all points and polygons that lie in each one of the query window set. Spatial query processing algorithms usually make use of spatial indexes to reduce the query latency. For instance, R-Tree [8] provides an efficient data partitioning strategy to efficiently index spatial data. Its key idea is that group nearby objects and put them in the next higher level node of the tree. R-Tree is a balanced search tree and obtains better search speed and less storage utilization. However, its performance could be reduced by heavy update activities. Quad-Tree [7, 16] is another spatial index which is used to recursively divide a two-dimensional space into four quadrants. Quad-Tree fits uniform data well and heavy update activities do not affect its performance. Parallel and Distributed Spatial Data Processing. As the development of distributed data processing system, more and more people in geospatial area direct their attention to deal with massive geospatial data with distributed frameworks. Hadoop-GIS [2] utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. And also Hadoop-GIS can support declarative spatial queries with an integrated architecture wit HIVE. SpatialHadoop [6], a comprehensive extension to Hadoop, has native support for spatial data by modifying the underlying code of Hadoop. The spatial functions it can provide include Grid, R-tree, R+-tree, spatial range query, knn and spatial join query. MD-HBase [12] exts HBase, a non-relational database runs on top of Hadoop, to support multidimensional indexes which allows for efficient retrieval of points using range and knn queries. Parallel SECONDO [9] combines Hadoop with SECONDO, a database which can handle non-standard data types, like spatial data, usually not supported by standard systems. It uses Hadoop as the distributed task manager and does distributed operations on spatial DBMS of multiple nodes. Although these systems have well-developed functions, all of them are implemented on Hadoop framework. That means they cannot avoid the disadvantages of Hadoop, especially a large number of reads and writes on disks.

3 Spatial Query Processing Layer Spatial Range Point RDD Spatial KNN Spatial RDD (SRDD) Layer Rectangle RDD Geometrical Operations Library Apache Spark Layer Spatial Join Polygon RDD Figure 1: GeoSpark Overview 3. GEOSPARK OVERVIEW As depicted in Figure 1, GeoSpark consists of three main layers: (1) Apache Spark Layer: this layer provides the basic Apache Spark functionality which includes the RDD concept along with its actions and transformations. (2) Spatial Resilient Distributed Dataset (SRDD) Layer: this layer exts the regular RDD to support geometrical objects (i.e., points, rectangles, and polygons) as well as geometrical operations on these objects. (3) Spatial Query Processing Layer: this layer harnesses and exts the SRDD layer to execute spatial queries (e.g., Range query and Join query) on large-scale spatial datasets. Apache Spark Layer. The Apache Spark Layer consists of regular operations that are natively supported by Apache Spark. That consists of loading / saving data from / to persistent storage (e.g., stored on local disk or Hadoop file system HDFS). For instance, the PersistToFile() persists the dataset in one instance to a file on disk or on Hadoop Distributed File System. This operation requires a file system path from users and then persists the dataset in this instance to that path. Spatial RDDs Layer. This layer exts Spark with spatial RDDs (SRDDs) that efficiently partition SRDD data elements across machines and introduces novel parallelized spatial transformations and actions (for SRDD) that provide a more intuitive interface for users to write spatial data analytics programs. The SRDD layer consists of three new RDDs: PointRDD, RectangleRDD and PolygonRDD. One useful Geometrical operations library is also provided for every spatial RDD. Spatial Query Processing Layer. Based on Spatial RDD layer, Spatial Query Processing Layer supports spatial queries (e.g., Range query and Join query) for large-scale spatial datasets. After geometrical objects are stored and processed in Spatial RDD layer, user can call spatial queries provided in Spatial Query Processing Layer and GeoSpark processes such query in the in-memory cluster and returns the final results to the user. SpatialRangeQuery() This query requires a query area from users as the input, which can be one circle or random polygon, accesses the Spatial RDD, and finds all of the points, rectangles or polygons which fall into the query area. The result of this query is stored in one instance of Spatial RDD. User can decide whether to do further calculations or persist it on disk. SpatialJoinQuery() In this query, one query area set is joined with one Spatial RDD. The query area set which is composed of rectangles or polygons can be also stored in Spatial RDD. GeoSpark then joins the two Spatial RDDs and returns a new Spatial RDD instance which is exted from the original SRDDs. For one query area, the object contained by it will be attached behind it in this instance. 4. SPATIAL RDD (SRDD) LAYER This section describes the details inside spatial RDDs. Spatial RDDs are intuitively extension of traditional RDDs. Spatial data can be easily stored in SpatialRDD, processed by geometrical library and queried by spatial query layer swiftly. 4.1 Spatial Objects Support (SRDDs) GeoSpark supports various spatial data input format (e.g., Comma Separated Value, Tab Separated Value and Well-Known Text). Different from the era users sp time on parsing input format by themselves, GeoSpark users only need to specify the format name and the start column of spatial data and GeoSpark will take over the data transformation and store processed data in SpatialRDDs automatically. At the storage level, GeoSpark takes advantages of JTS Topology Suite [4] to support spatial objects. Each spatial object is stored as a point, rectangle or polygon type. In terms of the type of spatial objects, spatial RDDs (SRDDs) are defined as follows: PointRDD. PointRDD supports all of the 2D Point objects (that represent points on the surface of the earth) with the following format: Longitude, Latitude. All points in a PointRDD are automatically partitioned by Apache Spark Layer and assigned to machines accordingly. RectangleRDD. RectangleRDD supports rectangle objects in the following format: Point A ( Longitude, Latitude ) and Point B ( Longitude, Latitude ). Point A and Point B are a pair of vertexes on the diagonal of one rectangle. Rectangles in a RectangleRDD are also distributed to different machines by Apache Spark Layer. PolygonRDD. All random polygon objects are supported by PolygonRDD. The required format of PolygonRDD is as follows: Point A ( longitude, Latitude ), Point B ( Longitude, Latitude ), Point C,.... The number of columns has no upper limit. Underlying Apache Spark Layer partitions PolygonRDDs to distributed clusters. 4.2 SRDDs Built-in Geometrical Operations GeoSpark provides built-in operations for spatial RDDs. Once a spatial RDD is initialized, the built-in operations of this spatial RDD come to be available to users. Via welldefined invocation API, users can execute complex operations on spatial data stored in spatial RDDs efficiently without involving the implementation details of these functionalities. From an implementation perspective, these operations interact with Apache Spark Layer through Map, Sort, Filter, Reduce and so on. In this transparent procedure, users

4 Arizona (a) Tweets spatial distribution 4.3 SRDD Partitioning GeoSpark automatically partitions all loaded Spatial RDDs by creating one global grid file for data partitioning. The main idea for assigning each element in a Spatial RDD to the same 2-Dimensional spatial grid space is as follows: Firstly, split the spatial space into a number of equal geographical size grid cells which compose a global grid file. Then traverse each element in the SRDD and assign this element to a grid cell if the element overlaps with this grid cell. Global grids are low cost in either file byte-size or data partitioning. On the other hand, the construction of global grid file is an iterative job which requires multiple coordinate sorting on the same datasets. Arizona (b) Tweets spatial distribution with a grid Figure 2: Geo-Tagged Tweets in the United States only focus on spatial analysis programming details without any knowledge about the underlying processes. GeoSpark provides a set of geometrical operations which is called Geometrical Operations Library. This library provides native support for geometrical operations that follow the Open Geospatial Consortium (OGC) [13] standard. Example of the geometrical operations provided by GeoSpark are as follows: Initialize() The functionality of this operation is to initialize a Spatial PointRDD, RectangleRDD or PolygonRDD which supports three common geometrical objects, point, rectangle and polygon, and related operations. The operation parses input data and stores it with spatial object type. The dataset should follow the required format in the corresponding GeoSpark Spatial RDDs. Overlap() In one Spatial RDD, the goal of this operation is to findall of the internal objects which are intersected with others in geometry. Inside() In one Spatial RDD, this operation can find all of the internal objects which are contained by others in geometry. Disjoint() In one Spatial RDD, this operation returns all of the objects which are not intersected or contained by one particular object or object set in geometry. The parameter of this operation is the instance of the object or object set. MinimumBoundingRectangle() This operation finds the minimum bounding rectangles for each object in a Spatial RDD or return a large minimum bounding rectangle which contains all of the internal objects in a Spatial RDD. Union() In one Spatial PolygonRDD, this operation could return the union polygon of all polygons in this RDD. Data: Input Spatial RDDs /* Step1:Create a global grid file has N grids */ Find the minimum geo-boundary for two inputs; Create a grid file; /* Each grid has equal geo-size */ /* Step2:Assign gridid to each element */ foreach spatial object in the SRRDs do for grid = 0 to N do if this grid contains / intersects with this Spatial Object then Assign this grid ID to this Spatial Object; /* Duplicates happen when one spatial object intersects with multiple grids */ Algorithm 1: Data Partitioning To partition the input datasets, GeoSpark performs the following steps for all spatial RDDs: (1) Load the original datasets from the data source, transform the original datasets to extract spatial information and store them in regular RDD. Meanwhile, cache the RDD into memory for the next iterative job. (2) Traverse the coordinates in the input RDDs multiple times to find their Minimum Bounding Rectangles. (3) Calculate the grid file boundary (GLB) which is the intersection of the two MBRs. (4) Pre-filter the two datasets by GLB to remove the elements never overlap with others. This step is especially suitable for the case that two spatial datasets are not in the same spatial space and only have a small intersection area. An example of global grids is shown in Figure 2b. Elements, points (Tweets) or polygons (States), lie in a lot of same size grids. The number of grids may impact query performance. Section 6 provides more explanation about it. Global grids construction needs to sort the original dataset by coordinates iteratively to find the geographical boundary (Minimum Bounding Rectangle). For instance, sort by X-coordinate and Y-coordinate in point set. It might be the most time-consuming step in this algorithm if it relies on other computing framework (i.e., Hadoop). However, GeoSpark counteracts this perfectly with Apache Spark cache in memory feature. GeoSpark caches (or partially caches) the target dataset in memory meanwhile executes the first time sorting. The next sorting for the cache will only cost millisecond time.

5 Global grids Program flow DAG Data flow Read input from files Text file: Point and other columns Storage system Master Transform input format Optional: Create spatial indexes on partitions Map RDD1: Point Load data Process data Check spatial relation Filter RDD2: Point Worker Worker Worker Store data Save output as files Local index Local index Local index Text file: Point Storage system Figure 4: Range query program flow, DAG and data flow Worker Worker Worker Figure 3: GeoSpark execution model 5. SPATIAL QUERY PROCESSING LAYER The section describes the implementation details of spatial query processing layer in GeoSpark. 5.1 Execution Model Figure 3 gives the general execution model followed by GeoSpark. To accelerate a spatial query, GeoSpark leverages the grid partitioned Spatial RDDs, the fast in-memory computation and DAG scheduler of Apache Spark to parallelize the query execution. Spatial indexes like Quad-Tree and R-Tree are also provided in spatial query processing layer. Spatial query algorithms avoid reduce-like tasks as possible as they can. As we mentioned before, reduce like tasks may result in data shuffle across the cluster and data shuffle is expensive in terms of either time or resources. Users specify whether GeoSpark should consider local spatial indexes. GeoSpark adaptively decides whether a local spatial index should be created for a certain SRDD partition based on a tradeoff between the indexing overhead (memory and time) on one-hand and the query selectivity as well as the number of spatial objects on the other hand. Since index building is an additional overhead, GeoSpark executes a full spatial object scan or nested loops (in case of join) for some SRDD partitions that only have very few spatial objects. This execution model has the following advantages: Efficient data partitioning: After the SRDD data is partitioned according the girds, only elements lie inside the same grid need to be calculated the spatial relations. Clusters do not need to sp time on those spatial objects in different grid cells which have are guaranteed not to intersect. Available local spatial indexes: For elements lie in the same grid, GeoSpark can create local spatial indexes like Quad-Tree or R-Tree on-the-fly. Indexbased spatial query may exhibit much higher efficiency than scan-based or nested loop algorithms. In the rest of this section, we present how GeoSpark uses the aforementioned execution model to process range and join queries. The algorithm for running K-Nearest Neighbor queries (however the main idea is similar) is omitted for the sake of the space. 5.2 Spatial Range Query Generally speaking, spatial range query is fast and less resource-consuming. Therefore, the first priority of spatial range query is to walk around the unnecessary overhead and keep the algorithm neat and efficient. Spatial indexes may also improve the query performance. GeoSpark implements the spatial range query algorithm in the following steps: 1. Broadcast the query window to each machine in the clusterandcreate aspatial indexoneachspatialrdd partition if necessary. 2. For each SRDD partition, if a spatial index is created, use the query window to query the spatial index. Otherwise, check the spatial predicate between the query window and each spatial object in the SRDD partition. If the spatial predicate holds true, the algorithm adds the spatial object to the result set. 3. Remove spatial objects duplicates that existed due to the global grid partitioning phase. 4. Return the result to the next stage of the spark program (if needed) or Persist the result set to disk. For a better understanding of GeoSpark spatial range query, the DAG, data flow and programming flow are described in Figure 4. As the data flow shows, GeoSpark processes data in parallel without data shuffle to achieve better performance. 5.3 Spatial Join Query As mentioned earlier, to accelerate the speed of a spatial join query, almost all of the algorithms creates a spatial index or grid file. However, a spatial join operation usually iterates the original dataset multiple times to have some global parameters like boundary or spatial layout. This procedure is routine and time-consuming. Thanks to the in-memory computing feature of Apache Spark, the efficiency of iterative jobs could be significantly improved. Therefore, spatial join queries in GeoSpark can achieve significantly better performance. GeoSpark implements the parallel join algorithm proposed by [19] and [10]. The algorithm first traverses the spatial objects in the two input SRDDS. If spatial object lies inside one grid cell, the algorithm assigns the grid ID to this element. If one element intersects with two or more grid cells, then duplicate this element and assign different grid IDs to the copies of this element. The algorithm then Joins the two datasets by their keys which are grid IDs. For the spatial objects (from the two SRDDs) that have the same grid ID, the algorithm calculates their spatial relations. If two elements from two SRDDS are overlapped, the algorithm keeps them in the final results. The algorithm then

6 Data: PointRDD and RectangleRDD Result: One Key-V alues set in (Rect, Point, Point,...) Return two Key-V alue sets in (gridid, Rectangle) and (gridid, Point); /* Step1: Local spatial join execution */ for gridid = 0 to N do for each PointRDD partition in gridid do Create a spatial index if necessary; foreach RectangleRDD partition in gridid do foreach Rectangle in RectangleRDD partition do /* Index-based query */ if index exists then Search the spatial index ; Record the result for this rectangle; /* Nested loop query */ else foreach point has this gridid do if this rectangle contains this point then Record this point for this rectangle; Return a Key-V alues set in (Rectangle, Point, Point,...); /* Step 2:Remove duplicates */ foreach rectangle occurs > one times as Key in the result do Combine their V alues and delete duplicates; Return a Key-V alues set in (Rectangle, Point, Point,...); Algorithm 2: Spatial join with global grids and local index groups the results for each rectangle. The grouped results are in the following format: Rectangle, Point, Point, Point. Finally, the algorithm removes the duplicated points and returns the result to other operations in the Spark DAG or saves the final result to disk. Minimized data shuffle. GeoSpark pays lots of effort on decreasing the data shuffle scale to achieve a better performance. Data shuffle appears two times in GeoSpark s spatial join algorithm. They are caused by Join and Group- ByKey respectively as shown in the DAG execution diagram given in Figure 5. These two data shuffle operations are inevitable. However, the scale of data shuffling operations has been significantly decreased by the filters performed by GeoSpark before them. Actually, there is another small data shuffle when the cached Spatial RDDs are sorted for the MBRs. It is not included in the figure because it doesn t change the data and has no impacts on the main data flow. 6. GEOSPARK USE CASES This section describes three example applications as use cases for GeoSpark. Figure 5: Spatial join query DAG 6.1 Application 1: Spatial Aggregation Assume an environmental scientist studying the relationship between air quality and trees would like to explore the trees population in San Francisco. A query may leverage the SpatialRangeQuery() provided by GeoSpark to just return all trees in San Francisco. Alternatively, a heat map (spatial aggregate) that shows the distribution of trees in San Francisco may be also helpful. This spatial aggregate query (i.e., heat map) needs to count all trees at every single region over the map. In the heat map case, in terms of spatial queries, the heat map is a spatial join in which the target set is the tree map in San Francisco and the query area set is a set of regions or polygons which compose the map of San Francisco. The number of regions deps on the display resolution, or granularity, in the heat map. The GeoSpark program is as follows (code given in Figure 6): (1) Call GeoSpark PointRDD initialization method to store the dataset of trees in memory. (2) Call GeoSpark SpatialJoinQuery() in PointRDD. The first parameter is a set of polygons which can be stored in Spatial PolygonRDD or regular list. The second one is count which means count thenumberof trees fall in each of the regions.(3) Use a new instance of Spatial PolygonRDD to store the result of Step (2). Step (2) returns the count for each polygon. The format of each tuple is like this: (Polygon, count) such that Polygon represents the boundaries of the spatial region. (4) Call persistence method in Spark to persist the result PolygonRDD. 6.2 Application 2: Spatial Autocorrelation Spatial autocorrelation studies whether neighbor spatial data points might have correlations in some non-spatial attributes. Moran s I and Geary s C are two common exponents in spatial autocorrelation. Based on the exponents, analysts can tell whether these objects influence each other. Global Moran s I and Geary s C reflect the spatial autocor-

7 /* San Francisco Trees Heat Map */ public void SFTreeHeatMap(){ PointRDD USTrees = PointRDD.Initialization (SparkContext,DatasetLocation); PolygonRDD SFTreesRegions = USTrees.SpatialJoinQueryWithIndex(SFRegions,"RTree"); SpatialPairRDD SFTreesCount = SFTreesRegions.CountByKey(); SFTreesCount.Persistence(SFTreesCountResultPath); } Figure 6: Trees Heat Map (Java code) in GeoSpark /* Global Adjacency Matrix */ public void GlobalAdjMat() { PointRDD targetset = PointRDD.Initialization (SparkContext,DatasetLocation); RDD globaladjacentmatrix = targetset.spatialjoinquery(targetset, WITHIN, 10); globaladjacentmatrix.persistence(matrixlocation); } Figure 7: Adjacency Matrix (Java code) in GeoSpark GeoSpark spatial join query in one of the PointRDDs. The first parameter is another PointRDD and the second one is the query distance. In this case, we assume the query distance is 10 miles. (3) Use a new instance of Spatial Pair- RDD to store the result of Step (2). Step (2) will return the whole point set which has a new column specify the neighbors of each tuple within in 10 miles. The format is like this: Point coordinates (longitude, latitude), neighbor 1 coordinates (longitude, latitude), neighbor 2 coordinates (longitude, latitude),... (4) Call persistence method in Apache Spark Layer to persist the resulting PointRDD. 7. EXPERIMENTS This section provides a comprehensive experimental evaluation that studies the performance of GeoSpark. Compared approaches. we compare the following spatial data processing approaches: GeoSpark_NoIndex: GeoSpark approach without spatial index. In this approach, data is only partitioned according grids. relation for the whole dataset. Compared with global exponents, local Moran s I and Geary s C reflect the correlation between one specific object and its neighbors. Moran s I and Geary s C indexes are defined by two specific formulas. An important part of these formulas is to find the spatial adjacent matrix. For global and local exponents, there are corresponding global adjacent matrix and local adjacent matrix. In this matrix, each tuples stands for whether two objects, such as points, rectangles or polygons, are neighbors or not in the spatial space. An application programmer may leverage the Spatial RDDs and the spatial query processing layer provided by GeoSpark to implement the spatial autocorrelation analysis procedure (Code given in Figure 7). Assume one dataset is composed of millions of point objects. The process to find the global adjacent matrix in GeoSpark is as as follows: (1) Call GeoSpark PointRDD initialization method to store the dataset in memory. (2) Call GeoSpark spatial join query in PointRDD. The first parameter is the query point set itself and the second one is the query distance. In this case, we assume the query distance is 10 miles. (3) Use a new instance of Spatial PairRDD to store the result of Step (2). Step (2) will return the whole point set which has a new column specify the neighbors of each tuple within in 10 miles. The format is like this: Point coordinates (longitude, latitude), neighbor 1 coordinates (longitude, latitude), neighbor 2 coordinates (longitude, latitude),... (4) Call persistence method in Spark to persist the resulting PointRDD. 6.3 Application 3: Spatial Co-location Spatial co-location is defined as two or more species are ofter located in a neighborhood relationship. Ripley s K function [15] is often used in judging co-location. It usually executes multiple times and form a 2-dimension curve for observation. The calculation of K function also needs the adjacent matrix between two type of objects. As we mentioned in spatial autocorrelation analysis, adjacent matrix is the result of a join query. The procedure in GeoSpark to find this matrix has the following steps: (1) Call GeoSpark PointRDD initialization method to store the two datasets in memory. (2) Call GeoSpark_QuadTree: GeoSpark approach with spatial Quad-Tree index. In this approach, spatial Quad- Tree is created on each partitions after data partitioned according grids. GeoSpark_RTree: GeoSpark approach with spatial R- Tree index. In this approach, spatial R-Tree is created on each partitions after data partitioned according grids. SpatialHadoop_NoIndex: SpatialHadoop approach without spatial index. SpatialHadoop_RTree: SpatialHadoop approach with spatial R-Tree index. Cluster. Our cluster setting on Amazon EC2 is as follows: (1) Cluster size: 17 nodes which have 16 r3.2xlarge workers and 1 c4.2xlarge master. (2) Operating System per node: Ubuntu Server LTS 64-bit. (3) CPU per worker node: Eight Intel Xeon Processor operating at 2.5 GHz with Turbo up to 3.3 GHz. (4) Memory per worker node: 61 GB in total and 50 GB registered memory in Spark and Hadoop. (5) Storage per worker node: Amazon Elastic Block Store general purpose SSD 100 GB. (6) Max throughput per worker node: 800 MBps. Datasets. We use three real spatial datasets from TIGER project [3] in our experiments: Zcta GB dataset, Areawater 6.5 GB dataset and Edges 62 GB dataset. They contain all the cities, all the lakes and all the meaningful boundaries in the US in rectangle format correspondingly. All of the datasets are preprocessed by SpatialHadoop and are open to the public on its website [1]. Metrics. We use three metrics to measure GeoSpark performance. They are: (1) Run time: It stands for the total program run time of one spatial analysis application. (2) Memory utilization: It stands for average memory utilization during one spatial analysis application. (3) CPU utilization: It stands for average CPU utilization / computation power during one spatial analysis application. Monitoring tool. We deploy Ganglia [11], a scalable distributed monitoring system for high performance computing

8 (a) Run time (b) Necessary memory utilization (c) CPU utilization Figure 8: Approaches on different size data with GeoSpark and SpatialHadoop (a) Run time (b) Necessary memory utilization (c) CPU utilization Figure 9: Approaches on different size clusters with GeoSpark and SpatialHadoop Table 1: Join query time with different grid numbers Grids NoIndex 1094s 624s 916s QuadTree 397s 399s 709s RTree 385s 408s 735s systems such as clusters, on our Amazon EC2 experimental cluster. Our Ganglia deployment is similar with that of Apache Spark as well as Apache Hadoop. The Ganglia client program records CPU utilization and memory utilization of its host machine per second and the Ganglia server program collects the metadata from the clients across the cluster automatically. 7.1 Global grid number in spatial join query As we mentioned in Section 5, the number global grids may impact the spatial join query performance. Generally speaking, small grid number means more heavy local query on each node while the large one means more time on assigning grid IDs to elements. To have a better understanding of the grid number setting, we choose different global grid number and join TIGER Zcta GB dataset with Edges 62 GB dataset on our 16 workers cluster. The experiment result is shown in Table 1. As Table 1 shows, GeoSpark join query without local spatial index has the least run time when the grid number is while the run time of join query with Quad-Tree and R-Tree is in sub-linear growth according to the growing of grid number. This makes sense because that join queries sp more time on local query if the grid number is too small. When the grid number is too large, they sp more time on the loop for assigning grid IDs to elements. And also, small size grid cells caused by large number of grids might be covered by elements in datasets and result in much more duplications when assign grid IDs as well as more time when remove duplications. However, join query with help of spatial index still sps relative less time on local query even if the grid number is very small. Based on the curve in Table 1, GeoSpark decides the grid number automatically with a default grid element factor which means the upper limit of the elements can lie inside one grid. Users can also set their own grid element factor for their own cases. 7.2 Impact of data size This section compares GeoSpark on TIGER Areawater 6.5 GB dataset with TIGER Edges 62 GB dataset as well as SpatialHadoop. They are tested on 16 nodes cluster. Their performance are shown in Figure 8. As depicted in Figure 8, GeoSpark and SpatialHadoop cost more run time on the large dataset than that on the small one. However, GeoSpark achieves much better run time performance than SpatialHadoop in both of the two datasets. This superiority is more obvious on the small dataset. The reason is that GeoSpark can cache more percentage of the intermediate data in memory on the small scale input than that on the large one. Caching more intermediate data can accelerate the processing speed. From a memory utilization perspective, GeoSpark and SpatialHadoop may crash if the registered memory is lower than the necessary memory. Normally, Spark will use the left whole memory which is the difference between registered memory (50 GB per worker node in our setting) and necessary memory to cache intermediate data because the simply encoded intermediate data in GeoSpark is very large for fast reading speed in the later steps. The intermediate data doesn t fit the cache memory will be spilled to disk. As we see in Figure 8, GeoSpark necessary memory utiliza-

9 (a) Run time (b) Necessary memory utilization (c) CPU utilization Figure 10: Spatial aggregation with GeoSpark and SpatialHadoop R-Tree index join queries (a) Run time (b) Necessary memory utilization (c) CPU utilization Figure 11: Spatial co-location with GeoSpark and SpatialHadoop R-Tree index join queries tion is lower than SpatialHadoop on both of the two different size data because GeoSpark can spill intermediate data into cache memory swiftly instead of spill it into disk slowly. Slowly spilling may make lots of intermediate stay in memory for a while. However, this advantage on the large dataset is not as obvious as that on the small dataset. The reason behind is that, due to the large scale input, more intermediate data which doesn t fit the cache memory has to be spilled into disk slowly which has to stay in memory. In terms of CPU utilization, GeoSpark costs more CPU computation power than SpatialHadoop. This sacrifice is for caching intermediate data to increase the processing speed. Caching intermediate data to memory in GeoSpark needs encoding and decoding which is computation consuming. This overhead is much more obvious on the large dataset because of its large scale intermediate data. GeoSpark_QuadTree and GeoSpark_RTree have less run time even though these local indexes are created on-thefly while SpatialHadoop_RTree query time which includes in advance index creating is longer than that Spatial- Hadoop_NoIndex. The reason is that GeoSpark only creates small spatial indexes in the grid cells which have the elements might be overlapped by others while SpatialHadoop creates spatial indexes for all of the elements. And also the size of indexes in GeoSpark is smaller but the number is larger. Therefore, GeoSpark index creating has better parallelism. It also makes sense that in both of GeoSpark and SpatialHadoop, GeoSpark_QuadTree, GeoSpark_RTree and SpatialHadoop_RTree save memory for smaller query scale but consume CPU computation power for creating indexes. 7.3 Effect of cluster size This section compares GeoSpark performance on differ size clusters with SpatialHadoop. They are tested on TIGER Edges 62 GB dataset. Figure 9 shows the results. As described in Figure 9, GeoSpark and SpatialHadoop run time performance improves with the increasing number of the machines in the cluster. GeoSpark consumes less run time than SpatialHadoop especially on GeoSpark_QuadTree and GeoSpark_RTree. The reason we mentioned before is the cache of intermediate data and the better parallelism of GeoSpark index creating. The memory utilization of GeoSpark and SpatialHadoop on more powerful clusters is also better and GeoSpark also has less necessary memory than SpatialHadoop on different cluster due to the cache of intermediate data. But the difference of necessary memory utilization between is more significant on the small size cluster. The fact is: although the parallelism of GeoSpark and SpatialHadoop is worse on the small size cluster, GeoSpark still can cache most parts of the intermediate data in to cache memory swiftly while SpatialHadoop slow intermediate data spilling on disk is affected by it seriously. GeoSpark and SpatialHadoop exhibits less CPU utilization on the large size cluster than that on the small size cluster. It makes sense because GeoSpark has higher CPU consumption than SpatialHadoop due to the encoding and decoding for caching intermediate data. 7.4 Performance of different Applications We implement the spatial aggregation and spatial co-location analysis mentioned in Section 6 with GeoSpark_RTree and SpatialHadoop_RTree. To show the excellence of GeoSpark on iterative analysis. In spatial colocation, we iteratively query GeoSpark SRDDs two times with different distances which can be defined as neighborhood relationships in adjacent matrix. Since SpatialHadoop doesn t natively support iterative jobs, we have to run SpatialHadoop_RTree two times for a reasonable comparison. Figure 10 and 11 describe their performances. For spatial aggregation, we join TIGER Zcta 1.5 GB dataset with TIGER Edges 62 GB dataset. For spatial co-

10 location, we use the first point column in both of TIGER Zcta 1.5 GB dataset and TIGER Edges 62 GB dataset and join them together. As shown in Figure 10 and 11, GeoSpark outperforms SpatialHadoop in both applications. And their performances are also being improved with the increasing of cluster size. Since the spatial aggregation doesn t need complex transformation on the result of spatial join, the performance of GeoSpark and SpatialHadoop is similar with that on spatial join query. However, the gap between GeoSpark and SpatialHadoop performance in spatial co-location is more obvious than in spatial aggregation. For the run time, GeoSpark only costs the quarter time of SpatialHadoop. There are two reasons: (1) The sets we use in spatial colocation are two point sets. GeoSpark spatial PointRDD provides optimized run time performance and lower memory overhead for points while SpatialHadoop treats them as regular spatial objects which have MBRs without any optimizations. (2) GeoSpark caches these datasets in memory with SRDDs automatically after loads from the storage system. The iterative jobs like spatial co-location can invoke these SRDDs multiple times from memory without any data transformation and data loading. While SpatialHadoop has to read and transform the original datasets again and again. For the memory utilization on spatial co-location, the difference between GeoSpark and SpatialHadoop is also more obvious than that on spatial aggregation because the two reason we just mentioned. In terms of CPU utilization on spatial co-location, GeoSpark consumes much more CPU computation power than SpatialHadoop. Due to less data transformation and loading, GeoSpark sps more run time on encoding and decoding intermediate data in cache memory. And this step is highly computation consuming. 8. CONCLUSION AND FUTURE WORK This paper introduced GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark provides an API for Apache Spark programmers to easily develop spatial analysis applications. Moreover, GeoSpark provides native support for spatial data indexing and query processing algorithms in Apache Spark to efficiently analyze spatial data at scale. Extensive experiments on data sizes and cluster sizes show that GeoSpark achieves better run time performance (with reasonable memory/cpu utilization) than its MapReduce-based counterparts (e.g., SpatialHadoop) in various spatial data analysis scenarios. The proposed ideas are packaged into an open source software artifact. In the future, we plan to ext Spark SQL engine with a set of SQL User-Defined-Functions (UDFs) that maps to spatial data types and proximity constraints. The input/output of these UDFs would be quite similar to UDFs defined in PostGIS, an extension to PostgreSQL that provides a SQL interface for users to express spatial operations on geographical data. We also envision GeoSpark to be used by Earth and Space Scientists, Geographers, Politicians, Commercial Institutions to analyze spatial data at scale. We also expect the scientific community will contribute to GeoSpark and add new functionalities on top-of it that serve novel spatial data analysis applications. 9. ACKNOWLEDGMENT This project is supported in part by the National Geospatial-Intelligence Agency (NGA) Foresight Project. 10. REFERENCES [1] [2] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H. Saltz. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce. Proceedings of the VLDB Endowment, PVLDB, 6(11): , [3] U. Bureau. Topologically integrated geographic encoding and referencing (tiger). [4] M. Davis. Secrets of the jts topology suite. Free and Open Source Software for Geospatial, [5] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of ACM, 51: , [6] A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: An efficient mapreduce framework for spatial data. Proceedings of the VLDB Endowment, PVLDB, 6(12): , [7] R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4(1):1 9, [8] A. Guttman. R-trees: a dynamic index structure for spatial searching, volume 14. ACM, [9] J. Lu and R. H. Guting. Parallel Secondo: Boosting Database Engines with Hadoop. In International Conference on Parallel and Distributed Systems, pages , [10] G. Luo, J. F. Naughton, and C. J. Ellmann. A non-blocking parallel spatial join algorithm. In Data Engineering, Proceedings. 18th International Conference on, pages IEEE, [11] M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7): , [12] S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi. MD-Hbase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In Proceedings of the International Conference on Mobile Data Management, MDM, pages 7 16, [13] Open Geospatial Consortium. [14] PostGIS. [15] B. D. Ripley. Spatial statistics, volume 575. John Wiley & Sons, [16] H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2): , [17] Spark. [18] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, NSDI, pages 15 28, [19] X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2(2): , 1998.

A REVIEW ON SPATIAL DATA AND SPATIAL HADOOP

A REVIEW ON SPATIAL DATA AND SPATIAL HADOOP International Journal of Latest Trends in Engineering and Technology Vol.(8)Issue(1), pp.545-550 DOI: http://dx.doi.org/10.21172/1.81.071 e-issn:2278-621x A REVIEW ON SPATIAL DATA AND SPATIAL HADOOP Kirandeep

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce

Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce Yanhui Liang, Stony Brook University Hoang Vo, Stony Brook University Ablimit Aji, Hewlett Packard Labs Jun Kong, Emory University

More information

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka Visualizing Big Data on Maps: Emerging Tools and Techniques Ilir Bejleri, Sanjay Ranka Topics Web GIS Visualization Big Data GIS Performance Maps in Data Visualization Platforms Next: Web GIS Visualization

More information

Spatial Analytics Workshop

Spatial Analytics Workshop Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics Introduction The Rise of Spatial Analytics

More information

Nearest Neighbor Search with Keywords in Spatial Databases

Nearest Neighbor Search with Keywords in Spatial Databases 776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,

More information

Rainfall data analysis and storm prediction system

Rainfall data analysis and storm prediction system Rainfall data analysis and storm prediction system SHABARIRAM, M. E. Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/15778/ This document is the author deposited

More information

Web Visualization of Geo-Spatial Data using SVG and VRML/X3D

Web Visualization of Geo-Spatial Data using SVG and VRML/X3D Web Visualization of Geo-Spatial Data using SVG and VRML/X3D Jianghui Ying Falls Church, VA 22043, USA jying@vt.edu Denis Gračanin Blacksburg, VA 24061, USA gracanin@vt.edu Chang-Tien Lu Falls Church,

More information

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan ArcGIS GeoAnalytics Server: An Introduction Sarah Ambrose and Ravi Narayanan Overview Introduction Demos Analysis Concepts using GeoAnalytics Server GeoAnalytics Data Sources GeoAnalytics Server Administration

More information

A REVIEW ON SPATIAL DATA AND SPATIAL HADOOP

A REVIEW ON SPATIAL DATA AND SPATIAL HADOOP International Journal of Latest Trends in Engineering and Technology Vol.(8)Issue(1), pp.545-550 DOI: http://dx.doi.org/10.21172/1.81.071 e-issn:2278-621x A REVIEW ON SPATIAL DATA AND SPATIAL HADOOP Kirandeep

More information

The Challenge of Geospatial Big Data Analysis

The Challenge of Geospatial Big Data Analysis 288 POSTERS The Challenge of Geospatial Big Data Analysis Authors - Teerayut Horanont, University of Tokyo, Japan - Apichon Witayangkurn, University of Tokyo, Japan - Shibasaki Ryosuke, University of Tokyo,

More information

Spatial Data Science. Soumya K Ghosh

Spatial Data Science. Soumya K Ghosh Workshop on Data Science and Machine Learning (DSML 17) ISI Kolkata, March 28-31, 2017 Spatial Data Science Soumya K Ghosh Professor Department of Computer Science and Engineering Indian Institute of Technology,

More information

INTRODUCTION TO GEOGRAPHIC INFORMATION SYSTEM By Reshma H. Patil

INTRODUCTION TO GEOGRAPHIC INFORMATION SYSTEM By Reshma H. Patil INTRODUCTION TO GEOGRAPHIC INFORMATION SYSTEM By Reshma H. Patil ABSTRACT:- The geographical information system (GIS) is Computer system for capturing, storing, querying analyzing, and displaying geospatial

More information

GENERALIZATION IN THE NEW GENERATION OF GIS. Dan Lee ESRI, Inc. 380 New York Street Redlands, CA USA Fax:

GENERALIZATION IN THE NEW GENERATION OF GIS. Dan Lee ESRI, Inc. 380 New York Street Redlands, CA USA Fax: GENERALIZATION IN THE NEW GENERATION OF GIS Dan Lee ESRI, Inc. 380 New York Street Redlands, CA 92373 USA dlee@esri.com Fax: 909-793-5953 Abstract In the research and development of automated map generalization,

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester

More information

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE Yuan-chun Zhao a, b, Cheng-ming Li b a. Shandong University of Science and Technology, Qingdao 266510 b. Chinese Academy of

More information

SPATIAL INDEXING. Vaibhav Bajpai

SPATIAL INDEXING. Vaibhav Bajpai SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

One Optimized I/O Configuration per HPC Application

One Optimized I/O Configuration per HPC Application One Optimized I/O Configuration per HPC Application Leveraging I/O Configurability of Amazon EC2 Cloud Mingliang Liu, Jidong Zhai, Yan Zhai Tsinghua University Xiaosong Ma North Carolina State University

More information

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde ArcGIS Enterprise: What s New Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde ArcGIS Enterprise is the new name for ArcGIS for Server ArcGIS Enterprise Software Components ArcGIS Server Portal

More information

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications

More information

Geodatabase An Introduction

Geodatabase An Introduction 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Geodatabase An Introduction David Crawford and Jonathan Murphy Session Path The Geodatabase What is it?

More information

Geodatabase An Overview

Geodatabase An Overview Federal GIS Conference February 9 10, 2015 Washington, DC Geodatabase An Overview Ralph Denkenberger - esri Session Path The Geodatabase - What is it? - Why use it? - What types are there? Inside the Geodatabase

More information

The File Geodatabase API. Craig Gillgrass Lance Shipman

The File Geodatabase API. Craig Gillgrass Lance Shipman The File Geodatabase API Craig Gillgrass Lance Shipman Schedule Cell phones and pagers Please complete the session survey we take your feedback very seriously! Overview File Geodatabase API - Introduction

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

Introduction to ArcGIS Server Development

Introduction to ArcGIS Server Development Introduction to ArcGIS Server Development Kevin Deege,, Rob Burke, Kelly Hutchins, and Sathya Prasad ESRI Developer Summit 2008 1 Schedule Introduction to ArcGIS Server Rob and Kevin Questions Break 2:15

More information

ArcGIS. for Server. Understanding our World

ArcGIS. for Server. Understanding our World ArcGIS for Server Understanding our World ArcGIS for Server Create, Distribute, and Manage GIS Services You can use ArcGIS for Server to create services from your mapping and geographic information system

More information

GEOGRAPHIC INFORMATION SYSTEMS Session 8

GEOGRAPHIC INFORMATION SYSTEMS Session 8 GEOGRAPHIC INFORMATION SYSTEMS Session 8 Introduction Geography underpins all activities associated with a census Census geography is essential to plan and manage fieldwork as well as to report results

More information

INTRODUCTION TO GIS. Dr. Ori Gudes

INTRODUCTION TO GIS. Dr. Ori Gudes INTRODUCTION TO GIS Dr. Ori Gudes Outline of the Presentation What is GIS? What s the rational for using GIS, and how GIS can be used to solve problems? Explore a GIS map and get information about map

More information

Environment (Parallelizing Query Optimization)

Environment (Parallelizing Query Optimization) Advanced d Query Optimization i i Techniques in a Parallel Computing Environment (Parallelizing Query Optimization) Wook-Shin Han*, Wooseong Kwak, Jinsoo Lee Guy M. Lohman, Volker Markl Kyungpook National

More information

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #26: Spatial Databases (R&G ch. 28) SAMs - Detailed outline spatial access methods problem dfn R-trees Faloutsos 2

More information

OBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1

OBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1 OBEUS (Object-Based Environment for Urban Simulation) Shareware Version Yaffo model is based on partition of the area into Voronoi polygons, which correspond to real-world houses; neighborhood relationship

More information

ArcGIS Deployment Pattern. Azlina Mahad

ArcGIS Deployment Pattern. Azlina Mahad ArcGIS Deployment Pattern Azlina Mahad Agenda Deployment Options Cloud Portal ArcGIS Server Data Publication Mobile System Management Desktop Web Device ArcGIS An Integrated Web GIS Platform Portal Providing

More information

GIS CONCEPTS ARCGIS METHODS AND. 2 nd Edition, July David M. Theobald, Ph.D. Natural Resource Ecology Laboratory Colorado State University

GIS CONCEPTS ARCGIS METHODS AND. 2 nd Edition, July David M. Theobald, Ph.D. Natural Resource Ecology Laboratory Colorado State University GIS CONCEPTS AND ARCGIS METHODS 2 nd Edition, July 2005 David M. Theobald, Ph.D. Natural Resource Ecology Laboratory Colorado State University Copyright Copyright 2005 by David M. Theobald. All rights

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Geodatabase Essentials Part One - Intro to the Geodatabase. Jonathan Murphy Colin Zwicker

Geodatabase Essentials Part One - Intro to the Geodatabase. Jonathan Murphy Colin Zwicker Geodatabase Essentials Part One - Intro to the Geodatabase Jonathan Murphy Colin Zwicker Session Path The Geodatabase - What is it? - Why use it? - What types are there? Inside the Geodatabase Advanced

More information

Karsten Vennemann, Seattle. QGIS Workshop CUGOS Spring Fling 2015

Karsten Vennemann, Seattle. QGIS Workshop CUGOS Spring Fling 2015 Karsten Vennemann, Seattle 2015 a very capable and flexible Desktop GIS QGIS QGIS Karsten Workshop Vennemann, Seattle slide 2 of 13 QGIS - Desktop GIS originally a GIS viewing environment QGIS for the

More information

Design and implementation of a new meteorology geographic information system

Design and implementation of a new meteorology geographic information system Design and implementation of a new meteorology geographic information system WeiJiang Zheng, Bing. Luo, Zhengguang. Hu, Zhongliang. Lv National Meteorological Center, China Meteorological Administration,

More information

Query Analyzer for Apache Pig

Query Analyzer for Apache Pig Imperial College London Department of Computing Individual Project: Final Report Query Analyzer for Apache Pig Author: Robert Yau Zhou 00734205 (robert.zhou12@imperial.ac.uk) Supervisor: Dr Peter McBrien

More information

GeoSparkViz: A Scalable Geospatial Data Visualization Framework in the Apache Spark Ecosystem

GeoSparkViz: A Scalable Geospatial Data Visualization Framework in the Apache Spark Ecosystem GeoSparkViz: A Scalable Geospatial Data Visualization Framework in the Apache Spark Ecosystem Jia Yu Arizona State University, Tempe AZ jiayu2@asu.edu Zongsi Zhang Arizona State University, Tempe AZ zzhan236@asu.edu

More information

THE DESIGN AND IMPLEMENTATION OF A WEB SERVICES-BASED APPLICATION FRAMEWORK FOR SEA SURFACE TEMPERATURE INFORMATION

THE DESIGN AND IMPLEMENTATION OF A WEB SERVICES-BASED APPLICATION FRAMEWORK FOR SEA SURFACE TEMPERATURE INFORMATION THE DESIGN AND IMPLEMENTATION OF A WEB SERVICES-BASED APPLICATION FRAMEWORK FOR SEA SURFACE TEMPERATURE INFORMATION HE Ya-wen a,b,c, SU Fen-zhen a, DU Yun-yan a, Xiao Ru-lin a,c, Sun Xiaodan d a. Institute

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Configuring Spatial Grids for Efficient Main Memory Joins

Configuring Spatial Grids for Efficient Main Memory Joins Configuring Spatial Grids for Efficient Main Memory Joins Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki École Polytechnique Fédérale de Lausanne (EPFL), Imperial College London Abstract. The performance

More information

Integrated Electricity Demand and Price Forecasting

Integrated Electricity Demand and Price Forecasting Integrated Electricity Demand and Price Forecasting Create and Evaluate Forecasting Models The many interrelated factors which influence demand for electricity cannot be directly modeled by closed-form

More information

Geodatabase An Introduction

Geodatabase An Introduction Federal GIS Conference 2014 February 10 11, 2014 Washington DC Geodatabase An Introduction Ralph Denkenberger esri Session Path The Geodatabase - What is it? - Why use it? - What types are there? Inside

More information

Bentley Geospatial update

Bentley Geospatial update Bentley Geospatial update Tallinna, 01.11.2007 Timo Tuukkanen, Bentley Systems Issues today Bentley Map available now detailed introduction to Bentley Map Bentley Geo Web Publisher Bentley Web GIS update,

More information

SOFTWARE ARCHITECTURE DESIGN OF GIS WEB SERVICE AGGREGATION BASED ON SERVICE GROUP

SOFTWARE ARCHITECTURE DESIGN OF GIS WEB SERVICE AGGREGATION BASED ON SERVICE GROUP SOFTWARE ARCHITECTURE DESIGN OF GIS WEB SERVICE AGGREGATION BASED ON SERVICE GROUP LIU Jian-chuan*, YANG Jun, TAN Ming-jian, GAN Quan Sichuan Geomatics Center, Chengdu 610041, China Keywords: GIS; Web;

More information

Esri Overview for Mentor Protégé Program:

Esri Overview for Mentor Protégé Program: Agenda Passionate About Helping You Succeed Esri Overview for Mentor Protégé Program: Northrop Grumman CSSS Jeff Dawley 3 September 2010 Esri Overview ArcGIS as a System ArcGIS 10 - Map Production - Mobile

More information

Write a report (6-7 pages, double space) on some examples of Internet Applications. You can choose only ONE of the following application areas:

Write a report (6-7 pages, double space) on some examples of Internet Applications. You can choose only ONE of the following application areas: UPR 6905 Internet GIS Homework 1 Yong Hong Guo September 9, 2008 Write a report (6-7 pages, double space) on some examples of Internet Applications. You can choose only ONE of the following application

More information

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Sam Williamson

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Sam Williamson ArcGIS Enterprise: What s New Philip Heede Shannon Kalisky Melanie Summers Sam Williamson ArcGIS Enterprise is the new name for ArcGIS for Server What is ArcGIS Enterprise ArcGIS Enterprise is powerful

More information

Exploring Urban Areas of Interest. Yingjie Hu and Sathya Prasad

Exploring Urban Areas of Interest. Yingjie Hu and Sathya Prasad Exploring Urban Areas of Interest Yingjie Hu and Sathya Prasad What is Urban Areas of Interest (AOIs)? What is Urban Areas of Interest (AOIs)? Urban AOIs exist in people s minds and defined by people s

More information

GIS CONCEPTS ARCGIS METHODS AND. 3 rd Edition, July David M. Theobald, Ph.D. Warner College of Natural Resources Colorado State University

GIS CONCEPTS ARCGIS METHODS AND. 3 rd Edition, July David M. Theobald, Ph.D. Warner College of Natural Resources Colorado State University GIS CONCEPTS AND ARCGIS METHODS 3 rd Edition, July 2007 David M. Theobald, Ph.D. Warner College of Natural Resources Colorado State University Copyright Copyright 2007 by David M. Theobald. All rights

More information

Road Ahead: Linear Referencing and UPDM

Road Ahead: Linear Referencing and UPDM Road Ahead: Linear Referencing and UPDM Esri European Petroleum GIS Conference November 7, 2014 Congress Centre, London Your Work Making a Difference ArcGIS Is Evolving Your GIS Is Becoming Part of an

More information

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Spatial Database. Ahmad Alhilal, Dimitris Tsaras Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information

More information

Bentley Map V8i (SELECTseries 3)

Bentley Map V8i (SELECTseries 3) Bentley Map V8i (SELECTseries 3) A quick overview Why Bentley Map Viewing and editing of geospatial data from file based GIS formats, spatial databases and raster Assembling geospatial/non-geospatial data

More information

Geometric Algorithms in GIS

Geometric Algorithms in GIS Geometric Algorithms in GIS GIS Software Dr. M. Gavrilova GIS System What is a GIS system? A system containing spatially referenced data that can be analyzed and converted to new information for a specific

More information

GIS for the Beginner on a Budget

GIS for the Beginner on a Budget GIS for the Beginner on a Budget Andre C. Bally, RLA, GIS Coordinator, Harris County Public Infrastructure Department Engineering Division This presentation, GIS for Beginners on a Budget. will briefly

More information

What are the five components of a GIS? A typically GIS consists of five elements: - Hardware, Software, Data, People and Procedures (Work Flows)

What are the five components of a GIS? A typically GIS consists of five elements: - Hardware, Software, Data, People and Procedures (Work Flows) LECTURE 1 - INTRODUCTION TO GIS Section I - GIS versus GPS What is a geographic information system (GIS)? GIS can be defined as a computerized application that combines an interactive map with a database

More information

SPATIAL INFORMATION GRID AND ITS APPLICATION IN GEOLOGICAL SURVEY

SPATIAL INFORMATION GRID AND ITS APPLICATION IN GEOLOGICAL SURVEY SPATIAL INFORMATION GRID AND ITS APPLICATION IN GEOLOGICAL SURVEY K. T. He a, b, Y. Tang a, W. X. Yu a a School of Electronic Science and Engineering, National University of Defense Technology, Changsha,

More information

GEOGRAPHY 350/550 Final Exam Fall 2005 NAME:

GEOGRAPHY 350/550 Final Exam Fall 2005 NAME: 1) A GIS data model using an array of cells to store spatial data is termed: a) Topology b) Vector c) Object d) Raster 2) Metadata a) Usually includes map projection, scale, data types and origin, resolution

More information

ArcGIS Platform For NSOs

ArcGIS Platform For NSOs ArcGIS Platform For NSOs Applying GIS and Spatial Thinking to Official Statistics Esri UC 2014 Demo Theater Applying GIS at the NSO Generic Statistical Business Process Model (GSBPM) 1 Specify Needs 2

More information

In this exercise we will learn how to use the analysis tools in ArcGIS with vector and raster data to further examine potential building sites.

In this exercise we will learn how to use the analysis tools in ArcGIS with vector and raster data to further examine potential building sites. GIS Level 2 In the Introduction to GIS workshop we filtered data and visually examined it to determine where to potentially build a new mixed use facility. In order to get a low interest loan, the building

More information

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data

More information

Distributed Architectures

Distributed Architectures Distributed Architectures Software Architecture VO/KU (707023/707024) Roman Kern KTI, TU Graz 2015-01-21 Roman Kern (KTI, TU Graz) Distributed Architectures 2015-01-21 1 / 64 Outline 1 Introduction 2 Independent

More information

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012 Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

MapOSMatic: city maps for the masses

MapOSMatic: city maps for the masses MapOSMatic: city maps for the masses Thomas Petazzoni Libre Software Meeting July 9th, 2010 Outline 1 The story 2 MapOSMatic 3 Behind the web page 4 Pain points 5 Future work 6 Conclusion Thomas Petazzoni

More information

These modules are covered with a brief information and practical in ArcGIS Software and open source software also like QGIS, ILWIS.

These modules are covered with a brief information and practical in ArcGIS Software and open source software also like QGIS, ILWIS. Online GIS Training and training modules covered are: 1. ArcGIS, Analysis, Fundamentals and Implementation 2. ArcGIS Web Data Sharing 3. ArcGIS for Desktop 4. ArcGIS for Server These modules are covered

More information

Spatial Data, Spatial Analysis and Spatial Data Science

Spatial Data, Spatial Analysis and Spatial Data Science Spatial Data, Spatial Analysis and Spatial Data Science Luc Anselin http://spatial.uchicago.edu 1 spatial thinking in the social sciences spatial analysis spatial data science spatial data types and research

More information

A Spatial Data Infrastructure for Landslides and Floods in Italy

A Spatial Data Infrastructure for Landslides and Floods in Italy V Convegno Nazionale del Gruppo GIT Grottaminarda 14 16 giugno 2010 A Spatial Data Infrastructure for Landslides and Floods in Italy Ivan Marchesini, Vinicio Balducci, Gabriele Tonelli, Mauro Rossi, Fausto

More information

Spatial Information Retrieval

Spatial Information Retrieval Spatial Information Retrieval Wenwen LI 1, 2, Phil Yang 1, Bin Zhou 1, 3 [1] Joint Center for Intelligent Spatial Computing, and Earth System & GeoInformation Sciences College of Science, George Mason

More information

Geog 469 GIS Workshop. Managing Enterprise GIS Geodatabases

Geog 469 GIS Workshop. Managing Enterprise GIS Geodatabases Geog 469 GIS Workshop Managing Enterprise GIS Geodatabases Outline 1. Why is a geodatabase important for GIS? 2. What is the architecture of a geodatabase? 3. How can we compare and contrast three types

More information

Geographic Systems and Analysis

Geographic Systems and Analysis Geographic Systems and Analysis New York University Robert F. Wagner Graduate School of Public Service Instructor Stephanie Rosoff Contact: stephanie.rosoff@nyu.edu Office hours: Mondays by appointment

More information

A Technique for Importing Shapefile to Mobile Device in a Distributed System Environment.

A Technique for Importing Shapefile to Mobile Device in a Distributed System Environment. A Technique for Importing Shapefile to Mobile Device in a Distributed System Environment. 1 Manish Srivastava, 2 Atul Verma, 3 Kanika Gupta 1 Academy of Business Engineering and Sciences,Ghaziabad, 201001,India

More information

Introduction to Google Mapping Tools

Introduction to Google Mapping Tools Introduction to Google Mapping Tools Google s Mapping Tools Explore geographic data. Organize your own geographic data. Visualize complex data. Share your data with the world. Tell your story and educate

More information

Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning

Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning Stephen Brockwell President, Brockwell IT Consulting, Inc. Join the conversation #AU2017 KEYWORD Class Summary Silos

More information

Leveraging Web GIS: An Introduction to the ArcGIS portal

Leveraging Web GIS: An Introduction to the ArcGIS portal Leveraging Web GIS: An Introduction to the ArcGIS portal Derek Law Product Management DLaw@esri.com Agenda Web GIS pattern Product overview Installation and deployment Configuration options Security options

More information

TRAITS to put you on the map

TRAITS to put you on the map TRAITS to put you on the map Know what s where See the big picture Connect the dots Get it right Use where to say WOW Look around Spread the word Make it yours Finding your way Location is associated with

More information

Geometric Algorithms in GIS

Geometric Algorithms in GIS Geometric Algorithms in GIS GIS Visualization Software Dr. M. Gavrilova GIS Software for Visualization ArcView GEO/SQL Digital Atmosphere AutoDesk Visual_Data GeoMedia GeoExpress CAVE? Visualization in

More information

APPLICATION OF GIS IN ELECTRICAL DISTRIBUTION NETWORK SYSTEM

APPLICATION OF GIS IN ELECTRICAL DISTRIBUTION NETWORK SYSTEM See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/305263658 APPLICATION OF GIS IN ELECTRICAL DISTRIBUTION NETWORK SYSTEM Article October 2015

More information

GIS = Geographic Information Systems;

GIS = Geographic Information Systems; What is GIS GIS = Geographic Information Systems; What Information are we talking about? Information about anything that has a place (e.g. locations of features, address of people) on Earth s surface,

More information

Overview. Everywhere. Over everything.

Overview. Everywhere. Over everything. Cadenza Desktop Cadenza Web Cadenza Mobile Cadenza Overview. Everywhere. Over everything. The ultimate GIS and reporting suite. Provide, analyze and report data efficiently. For desktop, web and mobile.

More information

One platform for desktop, web and mobile

One platform for desktop, web and mobile One platform for desktop, web and mobile Search and filter Get access to all data thematically filter data in context factually and spatially as well as display it dynamically. Export a selection or send

More information

ENV208/ENV508 Applied GIS. Week 1: What is GIS?

ENV208/ENV508 Applied GIS. Week 1: What is GIS? ENV208/ENV508 Applied GIS Week 1: What is GIS? 1 WHAT IS GIS? A GIS integrates hardware, software, and data for capturing, managing, analyzing, and displaying all forms of geographically referenced information.

More information

METHOD OF WCS CLIENT BASED ON PYRAMID MODEL

METHOD OF WCS CLIENT BASED ON PYRAMID MODEL METHOD OF WCS CLIENT BASED ON PYRAMID MODEL Shen Shengyu, Wu Huayi* State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 430079 Wuhan, China - shshy.whu@gmail.com,

More information

Spatial Analysis using Vector GIS THE GOAL: PREPARATION:

Spatial Analysis using Vector GIS THE GOAL: PREPARATION: PLAN 512 GIS FOR PLANNERS Department of Urban and Environmental Planning University of Virginia Fall 2006 Prof. David L. Phillips Spatial Analysis using Vector GIS THE GOAL: This tutorial explores some

More information

Analysis of Change in Land Use around Future Core Transit Corridors: Austin, TX, Eric Porter May 3, 2012

Analysis of Change in Land Use around Future Core Transit Corridors: Austin, TX, Eric Porter May 3, 2012 Analysis of Change in Land Use around Future Core Transit Corridors: Austin, TX, 1990-2006 PROBLEM DEFINITION Eric Porter May 3, 2012 This study examines the change in land use from 1990 to 2006 in the

More information

Why GIS & Why Internet GIS?

Why GIS & Why Internet GIS? Why GIS & Why Internet GIS? The Internet bandwagon Internet mapping (e.g., MapQuest) Location-based services Real-time navigation (e.g., traffic) Real-time service dispatch Business Intelligence Spatial

More information

Globally Estimating the Population Characteristics of Small Geographic Areas. Tom Fitzwater

Globally Estimating the Population Characteristics of Small Geographic Areas. Tom Fitzwater Globally Estimating the Population Characteristics of Small Geographic Areas Tom Fitzwater U.S. Census Bureau Population Division What we know 2 Where do people live? Difficult to measure and quantify.

More information

Portal for ArcGIS: An Introduction. Catherine Hynes and Derek Law

Portal for ArcGIS: An Introduction. Catherine Hynes and Derek Law Portal for ArcGIS: An Introduction Catherine Hynes and Derek Law Agenda Web GIS pattern Product overview Installation and deployment Configuration options Security options and groups Portal for ArcGIS

More information

Esri UC2013. Technical Workshop.

Esri UC2013. Technical Workshop. Esri International User Conference San Diego, California Technical Workshops July 9, 2013 CAD: Introduction to using CAD Data in ArcGIS Jeff Reinhart & Phil Sanchez Agenda Overview of ArcGIS CAD Support

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Road & Railway Network Density Dataset at 1 km over the Belt and Road and Surround Region

Road & Railway Network Density Dataset at 1 km over the Belt and Road and Surround Region Journal of Global Change Data & Discovery. 2017, 1(4): 402-407 DOI:10.3974/geodp.2017.04.03 www.geodoi.ac.cn 2017 GCdataPR Global Change Research Data Publishing & Repository Road & Railway Network Density

More information

Joint MISTRAL/CESI lunch workshop 15 th November 2017

Joint MISTRAL/CESI lunch workshop 15 th November 2017 MISTRAL@Newcastle Joint MISTRAL/CESI lunch workshop 15 th November 2017 ITRC at Newcastle ITRC at Newcastle MISTRAL at Newcastle New approach to infrastructure data management to open-up analytics, modelling

More information

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

What s New. August 2013

What s New. August 2013 What s New. August 2013 Tom Schwartzman Esri tschwartzman@esri.com Esri UC2013. Technical Workshop. What is new in ArcGIS 10.2 for Server ArcGIS 10.2 for Desktop Major Themes Why should I use ArcGIS 10.2

More information

a system for input, storage, manipulation, and output of geographic information. GIS combines software with hardware,

a system for input, storage, manipulation, and output of geographic information. GIS combines software with hardware, Introduction to GIS Dr. Pranjit Kr. Sarma Assistant Professor Department of Geography Mangaldi College Mobile: +91 94357 04398 What is a GIS a system for input, storage, manipulation, and output of geographic

More information

How to Optimally Allocate Resources for Coded Distributed Computing?

How to Optimally Allocate Resources for Coded Distributed Computing? 1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern

More information

Welcome to GST 101: Introduction to Geospatial Technology. This course will introduce you to Geographic Information Systems (GIS), cartography,

Welcome to GST 101: Introduction to Geospatial Technology. This course will introduce you to Geographic Information Systems (GIS), cartography, Welcome to GST 101: Introduction to Geospatial Technology. This course will introduce you to Geographic Information Systems (GIS), cartography, remote sensing, and spatial analysis through a series of

More information