Nearest Neighbor Search with Keywords

Similar documents
Nearest Neighbor Search with Keywords: Compression

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Nearest Neighbor Search with Keywords in Spatial Databases

Query Processing in Spatial Network Databases

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :52 AM

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

High-Dimensional Indexing by Distributed Aggregation

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Each internal node v with d(v) children stores d 1 keys. k i 1 < key in i-th sub-tree k i, where we use k 0 = and k d =.

Dictionary: an abstract data type

Lecture 2: Proof of Switching Lemma

Dictionary: an abstract data type

Search Trees. EECS 2011 Prof. J. Elder Last Updated: 24 March 2015

Fundamental Algorithms

CS4311 Design and Analysis of Algorithms. Analysis of Union-By-Rank with Path Compression

More Dynamic Programming

Evolutionary Tree Analysis. Overview

More Dynamic Programming

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Linear Classification: Linear Programming

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Nearest Neighbor Searching Under Uncertainty. Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University

CS 151. Red Black Trees & Structural Induction. Thursday, November 1, 12

Outline. Computer Science 331. Cost of Binary Search Tree Operations. Bounds on Height: Worst- and Average-Case

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

INF2220: algorithms and data structures Series 1

THE AREA CODE TREE FOR APPROXIMATE NEAREST NEIGHBOUR SEARCH IN DENSE POINT SETS. FATEMA RAHMAN Bachelor of Science, Jahangirnagar University, 2007

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Linear Classification: Linear Programming

Exact and Approximate Flexible Aggregate Similarity Search

Data Structures and Algorithms " Search Trees!!

Binary Search Trees. Motivation

Data selection. Lower complexity bound for sorting

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

In-Class Soln 1. CS 361, Lecture 4. Today s Outline. In-Class Soln 2

Digital search trees JASS

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

A Survey on Spatial-Keyword Search

Nearest Neighbor Search in Spatial

CS361 Homework #3 Solutions

P-LAG: Location-aware Group Recommendation for Passive Users

MAKING A BINARY HEAP

CSC236H Lecture 2. Ilir Dema. September 19, 2018

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Location-Based R-Tree and Grid-Index for Nearest Neighbor Query Processing

Jeffrey D. Ullman Stanford University

Cost models for distance joins queries using R-trees

Link Mining PageRank. From Stanford C246

Properties of Context-Free Languages

K-Nearest Neighbor Temporal Aggregate Queries

Lecture 5: Splay Trees

Query Optimization: Exercise

Design and Analysis of Algorithms

CS Data Structures and Algorithm Analysis

Binary Search Trees. Dictionary Operations: Additional operations: find(key) insert(key, value) erase(key)

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

Binary Decision Diagrams. Graphs. Boolean Functions

SPATIAL INDEXING. Vaibhav Bajpai

Information Retrieval. Lecture 6

Approximate Voronoi Diagrams

Problem: Data base too big to fit memory Disk reads are slow. Example: 1,000,000 records on disk Binary search might take 20 disk reads

Scout and NegaScout. Tsan-sheng Hsu.

Bayesian Networks: Independencies and Inference

Math 262A Lecture Notes - Nechiporuk s Theorem

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

ENS Lyon Camp. Day 2. Basic group. Cartesian Tree. 26 October

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

MAKING A BINARY HEAP

Efficient Spatial Data Structure for Multiversion Management of Engineering Drawings

Model Checking for Propositions CS477 Formal Software Dev Methods

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Transforming Hierarchical Trees on Metric Spaces

HW #4. (mostly by) Salim Sarımurat. 1) Insert 6 2) Insert 8 3) Insert 30. 4) Insert S a.

AVL Trees. Manolis Koubarakis. Data Structures and Programming Techniques

p 3 p 2 p 4 q 2 q 7 q 1 q 3 q 6 q 5

Fibonacci Heaps These lecture slides are adapted from CLRS, Chapter 20.

Lecture 18 April 26, 2012

Alpha-Beta Pruning: Algorithm and Analysis

Binary Decision Diagrams

2017 ACM ICPC ASIA GWALIOR REGIONAL CONTEST

Efficient Continuously Moving Top-K Spatial Keyword Query Processing

Alpha-Beta Pruning: Algorithm and Analysis

Lecture 4 Thursday Sep 11, 2014

Metrics: Growth, dimension, expansion

Scout, NegaScout and Proof-Number Search

Algorithms for pattern involvement in permutations

Lecture 2 September 4, 2014

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

k-protected VERTICES IN BINARY SEARCH TREES

INF4130: Dynamic Programming September 2, 2014 DRAFT version

Flexible Aggregate Similarity Search

Lecture 3: Big-O and Big-Θ

Slides based on those in:

Transcription:

Nearest Neighbor Search with Keywords Yufei Tao KAIST June 3, 2013

In recent years, many search engines have started to support queries that combine keyword search with geography-related predicates (e.g., Google Maps). Such queries are often referred to as spatial keyword search. In this lecture, we will discuss one such type of queries that finds use in many applications in practice.

Problem (Nearest Neighbor Search with Keywords) Let P be a set of points in N 2. Each point p P is associated with a set W p of terms. Given: a point q N 2, an integer k, a real value r, a set W q of terms a k nearest neighbor with keywords (knnwk) query returns the k points in P q (r) with the smallest Euclidean distances to q, where P q (r) = {p P W q W p and dist(p, q) r}. where dist(p, q) is the Euclidean distance between p and q.

Example: Suppose that P includes the black points p 1,...,. 7 6 5 4 3 2 1 0 p 4 p 5 p 1 q p 6 1 2 3 4 5 6 7 p p 1 p 4 p 5 p 6. W p {a, b} {b, d} {d} {a, e} {c, e} {c, d, e} {b, e} {c, d}. Given q as shown (the white point), k = 1, r = 5, and W q = {c, d}, then a knnwk query result returns p 6. Same query with k = 2 returns p 6 and.

Think What applications can you think of for this problem?

As a naive solution, we can first retrieve the set P q of points p P such that W q W p (think: how to do so with an inverted index?). Then, we obtain P q (r) from P q, and finally, obtain the query result by calculating the distances of the points in P q (r) to q. In practice, the values of k and r are small, which makes it possible to do better than the above solution.

Let us first look at a simpler problem: Problem (Nearest Neighbor Search) Let P be a set of points in R 2. Given: a point q R 2, an integer k a k nearest neighbor (knn) query returns the k points in P with the smallest Euclidean distances to q.

Example: Suppose that P includes the black points p 1,...,. 7 6 5 4 3 2 1 0 p 4 p 5 p 1 q p 6 1 2 3 4 5 6 7 p p 1 p 4 p 5 p 6. W p {a, b} {b, d} {d} {a, e} {c, e} {c, d, e} {b, e} {c, d}. Given q as shown (the white point) and k = 1, then a knnwk query returns p 1. Same query with k = 2 returns p 1 and.

Nearest neighbor search can be efficiently solved by indexing P with an R-tree T defined as follows: All the leaves of T are at the same level. Every point of P is stored in a unique leaf node of T. Every internal node stores the minimum bounding rectangle (MBR) of the points stored in its subtree. See the next slide for an example.

Example: u0 E1 E2 u1 u2 E3 E4 E5 E6 u3 u4 u5 u6 p4 p8 p2 p6 p3 p5 p1 p7 7 6 5 4 3 2 1 E 4 p 6 E 1 E 3 p E 5 5 p 4 p 1 q E 6 E 2 1 2 3 4 5 6 7 The left figure shows the tree whereas the right figure shows the points and MBRs.

The mindist of a point p and a rectangle R is the shortest distance between p and any point on R. R p mindist(p, R) Think How would you compute mindist(p, R)?

We can answer a 1NN query by a distance browsing (also called best first) algorithm: algorithm best-first(t, q) /* q is the query point; T is an R-tree */ 1. S the MBR of the root of T 2. while (true) 3. R the rectangle in S with the smallest mindist(q, R) 4. if R is a data point p then 5. return p 6. elseif R is the MBR of an internal node u then 7. insert to S the MBRs of all the child nodes of u 8. else /* R is the MBR of a leaf node u */ 9. insert to S all points stored in u

Example: u0 E1 E2 u1 u2 E3 E4 E5 E6 u3 u4 u5 u6 p4 p8 p2 p6 p3 p5 p1 p7 7 6 5 4 3 2 1 E 4 p 6 E 1 E 3 p E 5 5 p 4 p 1 q E 6 E 2 1 2 3 4 5 6 7 Given a 1NN query with the query point q as shown, the algorithm accesses nodes u 0, u 2, u 1, u 4 before returning.

Think Why is the algorithm correct? Think How would you extend the algorithm (easily) to answer a knn query? Think If we only want to return the k nearest neighbors of a query point q that are within distance r from q, how would you extend the algorithm (easily)?

Now, let us get back to the knnwk problem. We can create the following structure that combines the inverted index and the R-tree: For every term t in the dictionary, let P(t) be the set of points p P such that t W p. Create an R-tree on P(t), i.e., one R-tree per t.

Example:. 7 6 5 4 3 2 1 0 p 5 q 1 2 3 4 5 6 7 p p 1 p 4 p 5 p 6 W p {a, b} {b, d} {d} {a, e} {c, e} {c, d, e} {b, e} {c, d}. word inverted list a p 1 p 4 b p 1 c p 5 p 6 d p 6 e p 4 p 5 p 6 Create an R-tree on the points in the inverted list of each word.

We now extend the best first algorithm to answer a 1-NNwK query: algorithm best-first-1-nnwk(q, r, W q ) /* q is the query point, r is a distance range, and W q is a set of query words */ 1. S the root MBRs of the R-trees of the words in W q 2. while (true) 3. R the rectangle in S with the smallest mindist(q, R) 4. if mindist(q, R) > r then 5. return 6. if R is a data point p then 7. p.cnt + + 8. if p.cnt = W q then 9. return p 10. elseif R is the MBR of an internal node u then 11. insert to S the MBRs of all the child nodes of u 12. else /* R is the MBR of a leaf node u */ 13. insert to S all points stored in u

Example:. 7 6 5 4 3 2 1 0 p 5 q 1 2 3 4 5 6 7 p p 1 p 4 p 5 p 6 W p {a, b} {b, d} {d} {a, e} {c, e} {c, d, e} {b, e} {c, d}. word inverted list a p 1 p 4 b p 1 c p 5 p 6 d p 6 e p 4 p 5 p 6 For q being the point shown, r = 5, k = 1, and W q = {c, d}, the algorithm visits the points in this order:,, p 6, p 6, terminates after seeing the second p 6, and returns p 6.

Think Why Line 8? Think This algorithm is typically much faster than the naive algorithm mentioned at the beginning when r is small. Why? Think How to extend the algorithm to knnwk queries?