Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Spatial Database Ahmad Alhilal, Dimitris Tsaras

Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information 2

What is spatial database Spatial Relationships Spatial Data Types 1 Spatial Index 5. spatial Join aggregate NN RNN Road Network Satellite Image

SDBMS Definition A spatial database system: Is a database system A DBMS with additional capabilities for handling spatial data Offers spatial data types (SDTs) in its data model and query language Structure in space: e.g., POINT, LIN, RGION Relationships among them: (l intersects r) Supports SDT in its implementation providing at least spatial indexing (retrieving objects in particular area without scanning the whole space) efficient algorithms for spatial joins (not simply filtering the cartesian product)

Modeling: spatial primitives for objects Point: object represented only by its location in space, e.g. center of a state Line (actually a curve or polyline): representation of moving through or connections in space, e.g. road, river Region: representation of an extent in 2d-space, e.g. lake, city

Modeling: Coverages Partition: set of region objects that are required to be disjoint (adjacency or region objects with common boundaries), e.g. thematic maps Networks: embedded graph in plane consisting of set of points (vertices) and lines (edges) objects, e.g. highways, power supply lines, rivers

Modeling: a sample spatial type system (1) XT={lines, regions}, GO={points, lines, regions} Spatial predicates for topological relationships: inside: geo x regions bool intersect, meets: ext1 x ext2 bool adjacent, encloses: regions x regions bool Operations returning atomic spatial data types: intersection: lines x lines points intersection: regions x regions regions plus, minus: geo x geo geo contour: regions lines

Modeling: a sample spatial type system (2) Spatial operators returning numbers dist: geo1 x geo2 real perimeter, area: regions real Spatial operations on set of objects sum: set(obj) x (objgeo) geo A spatial aggregate function, geometric union of all attribute values, e.g. union of set of provinces determine the area of the country closest: set(obj) x (objgeo1) x geo2 set(obj) Determines within a set of objects those whose spatial attribute value has minimal distance from geometric query object Other complex operations: overlay, buffering,

Modeling: spatial relationships Topological relationships: e.g. adjacent, inside, disjoint. Direction relationships: e.g. above, below, or north_of, sothwest_of, Metric relationships: e.g. distance valid topological relationships between two simple regions : disjoint, in, touch, equal, cover, overlap

Modeling: SDBMS data model DBMS data model must be extended by SDTs at the level of atomic data types (such as integer, string), or better be open for user-defined types (OR-DBMS approach): relation states (sname: STRING; area: RGION; spop: INTGR) relation cities (cname: STRING; center: POINT; ext: RGION;cpop: INTGR); relation rivers (rname: STRING; route: LIN)

What is spatial database Spatial Relationships Spatial Data Types 1 Spatial Index 5. spatial Join aggregate NN RNN Road Network Satellite Image

Spatial Queries Range query (spatial selection, window query) e.g. find all cities that intersect window W Answer set: {c1, c2} Nearest neighbor query e.g. find the city closest to the forest F Answer: c2 Aggregate query (Count, Sum, AVG, MIN..) Spatial join e.g. find all pairs of cities and rivers that intersect Answer set: {(r1,c1), (r2,c2), (r2,c5)}

Spatial Queries: spatial selection Spatial selection: returning those objects satisfying a spatial predicate with the query object All cities in Ontario SLCT sname FROM cities c WHR c.center inside Ontario.area All rivers intersecting a query window SLCT * FROM rivers r WHR r.route intersects Window All big cities no more than 100 Kms from Toronto SLCT cname FROM cities c WHR dist(c.center, Toronto.center) < 100 and c.pop > 500k (conjunction with other predicates and query optimization)

Spatial Queries: spatial join Spatial join: A join which compares any two joined objects based on a predicate on their spatial attribute values. For each river pass through Ontario, find all cities within less than 50 Kms. SLCT r.rname, c.cname, length(intersection(r.route, c.area)) FROM rivers r, cities c WHR r.route intersects ontario.area and dist(r.route,c.area) < 50

Spatial Indexing: Operations Spatial data structures either store points or rectangles (for line or region values) Operations on those structures: insert, delete, member Query types for points: Range query: all points within a query rectangle Nearest neighbor: point closest to a query point Distance scan: enumerate points in increasing distance from a query point. Query types for rectangles Intersection query Containment query

R-Tree Motivation 10 8 6 4 2 y axis m g h l e f k i j d b a c x axis 0 2 4 6 8 10 Range query: find the objects in a given range..g. find all hotels in Boston. No index: scan through all objects. NOT FFICINT! 16

R-Tree: Clustering by Proximity 10 8 6 4 2 y axis m g h l e f d b c 3 a x axis 0 2 4 6 8 10 Root 1 2 k i j Minimum Bounding Rectangle (MBR) 1 3 4 5 6 7 2 a b c d e f g h 3 4 5 6 i j k 7 l m 17

R-Tree 10 8 6 4 2 y axis m g h 6 e f 4 5 d b c 3 a i j 7 l k x axis 0 2 4 6 8 10 Root 1 2 1 3 4 5 6 7 2 a b c d e f g h i j k l m 3 4 5 6 7 18

R-Tree 10 8 6 4 2 y axis m g h l k e f 2 i j d 1 b a c x axis 0 2 4 6 8 10 Root 1 2 1 3 4 5 6 7 2 a b c d e f g h i j k l m 3 4 5 6 7 19

R-Tree: Operation m node Utilization M Insertion : Overflow : split Deletion Underflow: re-insert Bulk load Sort objects (e.g., x-coordinates, Hilbert value) Insert sorted objects in nodes

Spatial Queries: spatial join Index-based Join The R-Tree Join (RJ) Cities Level 1 qualifying pairs: {(A1, B1), (A2, B2)} Level 0 qualifying pairs: {(a1, b1), (a2, b2)}

Range Query 10 8 6 4 2 y axis m g h l e f k 2 i j d 1 b a c x axis 0 2 4 6 8 10 Root 1 2 1 3 4 5 6 7 2 a b c d e f g h i j k l m 3 4 5 6 7 22

Range Query 10 8 6 4 2 y axis m g h l e f k 2 i j d 1 b a c x axis 0 2 4 6 8 10 Root 1 2 1 3 4 5 6 7 2 a b c d e f g h i j k l m 3 4 5 6 7 23

Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information 24

Nearest Neighbor (NN) Query Given a query location q, find the nearest object. a q.g.: given a hotel, find its nearest bar. 25

Metrics: MINDIST, MINMAXDIST MINDIST: Minimum distance between q and an MBR. 1 MINDIST q MINMAXDIST It is an lower bound of d(o, q) for every object o in 1. MINMAXDIST: compute max dist between q and each edge of 1, then take min. It is an upper bound of d(o, q) for every object o in 1. 26

Refinement Phase (Pruning) It prune the qualifying results based on three rules: Downward pruning: An MBR R is discarded if there exists another R such that MINDIST(q,R) > MINMAXDIST(q,R ) Downward pruning: An object O is discarded if there exists an R such that the Actual-Dist(q,O) > MINMAXDIST(q,R) Upward pruning: An MBR R is discarded if an object O is found such that the MINDIST(q,R) > ActualDist(q,O)

Pruning 1 in NN Query If we see an object o, prune every MBR whose MINDIST > d(o, q). 3 1 4 q 2 o Side notice: at most one object in H! 28

Pruning 2 using MINMAXDIST Prune even before we see an object! Prune 1 if exists 2 s.t. MINDIST(q, 1 ) > MINMAXDIST(q, 2 ). 1 q 2 Object o in sub-tree of 2 s.t. d(o, q) MINMAXDIST(q, 2 ) MINMAXDIST: compute max dist between q and each edge of 2, then take min. 29

Depth-First (DF) NN Algorithm

Best-First NN Algorithm Keep a heap H of index entries and objects, ordered by MINDIST. Initially, H contains the root. While H xtract the element with minimum MINDIST If it is an index entry, insert its children into H. If it is an object, return it as NN. nd while 1 q 35

Best-First NN Algorithm y axis 10 8 6 4 e b f g h 4 5 1 d 3 a m 2 6 i query j k 7 l Action Visit Root Heap 1 1 2 2 2 c x axis 0 2 4 6 8 10 Root 1 2 1 2 3 4 5 1 6 7 2 5 9 5 2 13 a b c d e f g i j h k l m 2 10 13 3 4 5 6 7 36

Best-First NN Algorithm y axis 10 8 6 4 e b f g h 4 5 1 d 3 a m 2 6 i query j k 7 l Action Visit Root follow 1 Heap 1 1 2 2 2 3 2 5 5 5 4 9 2 c x axis 0 2 4 6 8 10 Root 1 2 1 2 3 4 5 1 6 7 2 5 9 5 2 13 a b c d e f g i j h k l m 2 10 13 3 4 5 6 7 37

Best-First NN Algorithm y axis 10 8 6 4 e b f g h 4 5 1 d 3 a m 2 6 i query j k 7 l Action Visit Root follow 1 follow 2 Heap 1 1 2 2 6 2 2 3 3 2 5 5 5 5 5 5 4 4 9 9 7 13 2 c x axis 0 2 4 6 8 10 Root 1 2 1 2 3 4 5 1 6 7 2 5 9 5 2 13 a b c d e f g i j h k l m 2 10 13 3 4 5 6 7 38

Best-First NN Algorithm y axis 10 8 6 4 e b f d g 3 h 4 5 1 a 2 6 i query m j k 7 l Action Visit Root follow 1 follow 2 follow 6 Heap 1 1 2 2 2 2 3 5 6 2 3 5 i 2 3 5 5 5 5 5 5 5 4 4 4 9 9 9 7 j 13 10 7 13 k 13 2 c x axis 0 2 4 6 8 10 Root 1 2 1 2 3 4 5 1 6 7 2 5 9 5 2 13 a b c d e f g i j h k l m 2 10 13 3 4 5 6 7 39

Best-First NN Algorithm y axis 10 8 6 4 e b f d g 3 h 4 5 1 a 2 6 i query m j k 7 l Action Visit Root follow 1 follow 2 follow 6 Heap 1 1 2 2 2 2 3 5 6 2 3 5 i 2 3 5 5 5 5 5 5 5 4 4 4 9 9 9 7 j 13 10 7 13 k 13 2 c Report i and terminate x axis 0 2 4 6 8 10 Root 1 2 1 2 3 4 5 1 6 7 2 5 9 5 2 13 a b c d e f g i j h k l m 2 10 13 3 4 5 6 7 40

Discussion NN Queries Both DF and BF can be easily adapted to I. extended (instead of point) objects II. retrieval of k (>1) NN. BF is incremental; i.e., it can report the NN in increasing order of distance without a given value of k.

Aggregation Query Given a range, find some aggregate value of objects in this range. Distributive: COUNT, SUM, MIN, MAX.g. find the total number of hotels in HK. Algebraic: AVG Holistic: Median Straightforward approach: reduce to a range query. Better approach: along with each index entry, store aggregate of the sub-tree 42

Aggregation Query: COUNT 10 8 6 4 2 y axis e b c f d g h 1 a i x axis 0 2 4 6 8 10 Root :8 1 :5 2 m j k l 2 The Aggregate R-Tree ar-tree 1 :3 3 :2 4 :3 5 :3 6 :2 7 2 a b c d e f g h i j k l m 3 4 5 6 7 43

Aggregation Query: COUNT Subtree pruned! 10 8 6 4 2 y axis e b c f d g h 1 a i x axis 0 2 4 6 8 10 Root :8 1 :5 2 m j k l 2 1 :3 3 :2 4 :3 5 :3 6 :2 7 2 a b c d e f g h i j k l m 3 4 5 6 7 44

Aggregation Query: COUNT 10 8 6 4 2 y axis e b c f d g h 1 a i x axis 0 2 4 6 8 10 Root :8 1 :5 2 m j k l 2 Partially Intersects q Window 1 :3 3 :2 4 :3 5 :3 6 :2 7 2 a b c d e f g h i j k l m 3 4 5 6 7 45

Minimum Aggregate Distance Given a set P of data points and a set Q of query points, an aggregate NN query returns the data point p with the minimum aggregate distance. The aggregate distance between a data point p and Q={q 1,.., q n } is defined as adist(p,q) = f( pq 1,, pq n ), where pq i is the uclidean distance of p and q i. We use f=sum in the examples. p 1 p 6 f=sum p 1 f= max p 6 p 2 p 3 p 4 p 5 q 4 p 2 p 3 p 4 p 5 q 4 8 q 1 5 q 2 2 13 p 10 p 11 q 1 q 2 9 6 p 10 p 11 p 8 p 9 4 p 12 p 8 p 9 6 p 12 p 7 q 3 p 7 q 3 46

Multiple Query Method (MQM) Idea based on the threshold algorithm for top-k queries: Perform incremental NN queries for each point in Q and combine their results. <p 10, 7>, <p 11, 6>, Threshold T=5 (2+3) very non-yet visited point must have distance T=5 Since the best candidate found p 11 has distance > T we must continue search p 1 p 2 p 3 p 4 N 4 N 3 N 1 1 st NN of q 1 p 5 p 6 q 2 q 1 5 2 3 3 p11 p 10 N 6 1 st NN of q 2 p 9 p N 8 2 N 5 p 7 p 12 47

Multiple Query Method (MQM) Idea based on the threshold algorithm for top-k queries: Perform incremental NN queries for each point in Q and combine their results. We find p 11 again, this time as the second NN of q 1 The value of the threshold is updated to T=6 We already have a candidate with distance equal to T=6 Search terminates p 1 p 2 p 3 p 4 N 4 N 3 N 1 1 st NN of q 1 p 5 p 6 q 1 q 2 3 p11 p 10 N 6 3 2 nd NN of q 1 p 9 p N 8 2 N 5 p 7 p 12 48

Minimum Bounding Method (MBM) MQM applies multiple NN queries and may visit the same node and data point multiple times MBM Applies the MBR of Q to prune the search space with a single (DF or BF) traversal, using heuristics to prune the search space Heuristic 1: Let M be the MBR of Q, and best_dist be the distance of the best points found so far. A node N cannot contain qualifying points, if: best_dist mindist( N, M ) n Heuristic 2: A node N cannot contain qualifying points, if: q Q i mindist( N, q ) q 1 i best_dist mindist(n 2,q 1 )=5 mindist(n 1,M)=3 M 3 2 N 1 best_nn best_distance=5 q 2 N 2 mindist(n 2,M)= mindist(n 2,q 2 )=1 49

Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Queries Aggregation Query RNN Query NN Queries with Validity Information 50

Reverse NN Queries Monochromatic: given a multi-dimensional dataset P and a point q, find all the points p P that have q as their nearest neighbor Bichromatic: given a set Q of queries and a query point q, find the objects p P that are closer to q than any other point of Q.g. I want to open a hotel such that it is the NN for the most tourist sights p 1 q p 3 p 4 RNN(q)=p 1, p 2 p 2 NN(q)= p 3 51

KM Algorithm (for static datasets) Find the NN of every data point p- let the vicinity circle (p, dist(p,nn(p))) centered at p with radius equal to the uclidean distance between p and its NN. Index the MBRs of all circles with an R-tree, called the RNN-tree The reverse nearest neighbors of q are retrieved by a point location query on the RNN-tree, which returns all circles that contain q. p 3 p 4 p 1 p 2 p 3 q p 4 p 1 p 2 52

SAA Algorithm (supports updates) Divide the space around the query q into six equal regions S 1 to S 6. Find the NN p i of q in each region S i Find the NN of each p i if distance(p i, NN(p i )) < distance(p i, q) there is no RNN of q in S i otherwise, the only RNN of q in S i is p i. S 1 S 2 S 3 p 2 p 60 o 1 p 3 p 6 p 4 60 o q p 5 S 4 S 1 S 2 S 3 p 2 p 60 o 1 p 3 p 6 p 4 60 o q p 5 S 4 S 6 S 5 p 7 S 6 S 5 p 7 53

TPL Algorithm (supports updates, >2 dimensions, k-rnn) Filter-refinement approach Find the set S cnd of candidate points Find neighbors of the query point q incrementally very new neighbor prunes the search space Continue until the entire space is pruned Keep all the pruned points and nodes in a set S rfn Refinement step: eliminate false positives from S cnd First NN N 1 p' p1 N 2 p1 ^(p, q) perpendicular bisector q ^( p 1, q ) q ^( p 2, q) p2 Second NN 54

xample of TPL N 11 N 6 p 6 p 8 N 4 p 4 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 55

Filter Step Visit Root N 11 N 6 p 6 p 8 N 4 p 4 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn visit } ro {N 10 11,N 12 56

N 11 N 6 p 6 p 8 N 4 p 4 Filter Step Visit N 10 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn visit N 10 {N 3 11,N 2 1 12 } 57

N 11 N 6 p 6 p 8 N 4 p 4 Filter Step Visit N 3 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn visit N 3 {N 11,N 2 1 12 }{p 1 }{p 3 } 58

Filter Step Visit N 11 N 11 N 6 p 6 N 4 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn visit N 11 {N 5,N 2,N 1,N 12 } {p 1 } { p 3,N 4,N 6 } 59

Filter Step Visit N 5 N 11 N 6 p 6 N 4 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn v isit N 5 { N 2, N 1, N 12 } { p 1, p 2 } { p 3, N 4, N 6,p 6 } 60

Filter Step Visit N 1 N 11 N 6 p 6 N 4 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn v isit N 1 { N 12 } { p 1, p 2, p 5 } { p 3, N 4, N 6,p 6, N 2,p 7 } 61

xample (end of filter step) N 11 N 6 p 6 N 4 contents omitted N 12 data R-tree N 10 N 11 N 12 N 5 p 2 N 1 N 2 N 3 N 4 N 5 N 6 N 2 p 3 N 3 p 1 N 10 q p 5 p 7 p 1 p 3 p 2 p 6 p 4 p 8... p 7 N 1 contents omitted p 5 Action Heap S cnd S rfn { p 1, p 2, p 5 } { p 3, N 4, N 6, p 6, N 2, p 7, N 12 } 62

Refinement Step Nodes outside the cicrles centered at the candidates can be discarded p 3 invalidates candidate p 1 In order to verify candidate p 2 we need to visit N 4 p 5 is an actual RNN N 2 p 3 N 3 N 11 p 6 N 5 p 1 N 6 N 4 q p 2 contents omitted N 12 Other Work: Bichromatic RNN, Stanoi et al., VLDB 2001 Road Networks, Yiu et al., TKD 2006 Metric Spaces, Tao et al., TKD 2006 Continuous RNN., Benetis et al., VLDBJ 2006 p 7 N 1 N 10 p 5 63

Chapter 5. NN Queries with Validity Information The result of conventional NN queries may be invalidated as soon as the query or some data object(s) move Goal: To return some additional information that determines the validity of the current result The validity region is such that the NN set remains the same as long as q remains in the region. The validity region of a NN query q is the Voronoi Cell of the NN o. 64

Validity Time Settings: Static Objects, Moving Query. Client/server architecture A central server indexes the Voronoi diagram of the objects Given a query, the server returns the current NN and the validity time of the result. The current NN is the object whose Voronoi cell covers the query point Restrictions: (i) assumes a maximum speed (ii) applicable only to single NN (iii) and static objects. o q a Voronoi cell of point o 65