Nearest Neighbor Search with Keywords: Compression

Similar documents
Nearest Neighbor Search with Keywords

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

High-Dimensional Indexing by Distributed Aggregation

Linear Classification: Linear Programming

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

CS4311 Design and Analysis of Algorithms. Analysis of Union-By-Rank with Path Compression

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Ma/CS 6b Class 25: Error Correcting Codes 2

Multidimensional Divide and Conquer 1 Skylines

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Data selection. Lower complexity bound for sorting

Lecture 11: Polar codes construction

Succinct Data Structures for Approximating Convex Functions with Applications

Linear Classification: Linear Programming

A = , A 32 = n ( 1) i +j a i j det(a i j). (1) j=1

Nearest Neighbor Search with Keywords in Spatial Databases

Image and Multidimensional Signal Processing

Coding of memoryless sources 1/35

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Notes on Logarithmic Lower Bounds in the Cell Probe Model

1 Approximate Quantiles and Summaries

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Lecture 10 September 27, 2016

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Quantum Teleportation Pt. 1

Lecture 1 : Data Compression and Entropy

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Data Compression Techniques

Scheduling on Unrelated Parallel Machines. Approximation Algorithms, V. V. Vazirani Book Chapter 17

Lecture 9. Greedy Algorithm

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

R ij = 2. Using all of these facts together, you can solve problem number 9.

Lecture 18 April 26, 2012

Lecture 2: Divide and conquer and Dynamic programming

Geometric Steiner Trees

Quiz 2 out of 50 points

Advanced Engineering Mathematics Prof. Pratima Panigrahi Department of Mathematics Indian Institute of Technology, Kharagpur

repetition, part ii Ole-Johan Skrede INF Digital Image Processing

Data Compression Techniques

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

15.081J/6.251J Introduction to Mathematical Programming. Lecture 18: The Ellipsoid method

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

14:332:231 DIGITAL LOGIC DESIGN. Why Binary Number System?

Algorithm Design and Analysis

Lecture 21: Algebraic Computation Models

+ (50% contribution by each member)

Lecture 8: Shannon s Noise Models

Data Structures in Java

Lecture 4 : Adaptive source coding algorithms

Linear Programming. Scheduling problems

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Lecture 7: Trees and Structural Induction

Number Representation and Waveform Quantization

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

c i r i i=1 r 1 = [1, 2] r 2 = [0, 1] r 3 = [3, 4].

Selection and Adversary Arguments. COMP 215 Lecture 19

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Friday Four Square! Today at 4:15PM, Outside Gates

Lecture 4: Proof of Shannon s theorem and an explicit code

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

Information Retrieval. Lecture 6

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MODULAR CIRCUITS CHAPTER 7

Dictionary: an abstract data type

Ph219/CS219 Problem Set 3

Communications II Lecture 9: Error Correction Coding. Professor Kin K. Leung EEE and Computing Departments Imperial College London Copyright reserved

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Lecture 3 : Algorithms for source coding. September 30, 2016

1. Basics of Information

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Lecture 2 September 4, 2014

Department of Electrical Engineering, Polytechnic University, Brooklyn Fall 05 EL DIGITAL IMAGE PROCESSING (I) Final Exam 1/5/06, 1PM-4PM

Database Theory VU , SS Ehrenfeucht-Fraïssé Games. Reinhard Pichler

Lecture 23 Branch-and-Bound Algorithm. November 3, 2009

Lecture: Face Recognition and Feature Reduction

1 Introduction to information theory

Motivating the Covariance Matrix

Each internal node v with d(v) children stores d 1 keys. k i 1 < key in i-th sub-tree k i, where we use k 0 = and k d =.

Chapter 8 The Disjoint Sets Class

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

MATH 433 Applied Algebra Lecture 21: Linear codes (continued). Classification of groups.

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Lecture 16 Oct 21, 2014

! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.

Essential facts about NP-completeness:

Image Compression. 1. Introduction. Greg Ames Dec 07, 2002

Classes of Boolean Functions

Hash Functions for Priority Queues

6.02 Fall 2012 Lecture #1

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Copyright 2000, Kevin Wayne 1

Lecture 7: DecisionTrees

Binary addition example worked out

Lecture Introduction. 2 Linear codes. CS CTT Current Topics in Theoretical CS Oct 4, 2012

Transcription:

Nearest Neighbor Search with Keywords: Compression Yufei Tao KAIST June 3, 2013

In this lecture, we will continue our discussion on: Problem (Nearest Neighbor Search with Keywords) Let P be a set of points in N 2, where N represents the set of integers. Each point p P is associated with a set W p of terms. Given: a point q N 2, an integer k, a real value r, a set W q of terms a k nearest neighbor with keywords (knnwk) query returns the k points in P q (r) with the smallest Euclidean distances to q, where P q (r) = {p P W q W p and dist(p, q) r}. where dist(p, q) is the Euclidean distance between p and q.

In the previous lecture, we have learned the basic ideas: For every term t, make an inverted list list(t) collecting all the points p P such that t W p. Index each list(t) with an R-tree. Answer a query via distance browsing.

In this lecture, we will see how to implement these ideas efficiently. In particular, we will discuss: How to compress each list(t)? How to build an R-tree on each list(t). After resolving the above issues, we will obtain a structure called the spatial inverted index. For simplicity, we will focus on k = 1, namely, the 1NNwK problem. Extensions to arbitrary k are straightforward and left to you. assume that every x- and y-coordinate is in the range of [0, U 1], where U is a power of 2 that equals the lengths of the x- and y-dimensions.

Each entry in list(t) has the form (id, x, y). Think Why do we need id? Instead, we will store (pid, z) where: pid is an integer called a pseudo id. z is an integer called a z-value, from which we can obtain x and y uniquely.

Z-order An encoding that converts a 2d point p = (x, y) to a one dimensional value z(p) called the z-value of p. Suppose that x = a 1 a 2...a l 1 and y = b 1 b 2...b l 1 in binary, where l = log 2 U (recall that the x- and y- dimensions have domain [0, U 1], and U is a power of 2). Then, z(p) = a 1 b 1 a 2 b 2...a l 1 b l 1 in binary. Example Suppose x = 13, y = 25, and U = 32. Then: x = 01101 in binary. y = 11001 in binary. z(p) = 0111100011 in binary = 483 in decimal. Given z(p), we can decode x and y in a straightforward manner.

Z-order Pictorial illustrations: U = 2 0 0 U = 22 U = 2 1 5 1 3 4 0 2 1 7 6 3 13 12 9 15 14 11 0 2 8 10

Z-order The z-order encoding is a form of spatial filling curve, which aims at converting 2d points to 1d values in a proximity preserving manner. This means: If two points p 1 and p 2 are close in 2d space, then often z(p 1 ) and z(p 2 ) are close. If z(p 1 ) and z(p 2 ) are close, then often p 1 and p 2 are close in 2d space. Think Can you observe from the previous slide that the above sometimes are not true? Do you think that there would exist a perfect spatial filling curve that will make the above always true?

Now, let us get back to the 1NNwK problem. Let us sort all the points in P by z-value. Assign each point p P a pseudo id pid(p) that equals the rank of p in the sorted list (i.e., the first point has rank 1, the second has rank 2,...). Example: The figure on the right shows the z-values of the data points. Hence, pid(p 6 ) = 1, pid(p 2 ) = 2,... 7 6 5 4 3 2 1 0 p 8 p 3 p 5 p 2 q p 7 1 2 3 4 5 6 7 p 6 12 p 2 15 p 8 23 p 4 24 p 7 41 p 1 50 p 3 52 p 5 59

We are ready to explain how to compress an inverted list list(p). Suppose that S is the set of points in list(p). We sort these points in ascending order of their pseudo ids (and hence, also in ascending order of their z-values). Each entry of the list has the form (pid, z). Apply the gapping technique discussed previously, namely, for the i-th pair (i 2), store ( pid, z ), where pid is the difference between the pid of the pair, and that of the (i 1)-th pair, and similarly, for z. Store all integers using Elias Gamma code.

Example:. 7 6 5 4 3 2 1 0 p 8 p 3 p 5 p 2 q p 7 1 2 3 4 5 6 7 p p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 W p {a, b} {b, d} {d} {a, e} {c, e} {c, d, e} {b, e} {c, d}. p 6 12 p 2 15 p 8 23 p 4 24 p 7 41 p 1 50 p 3 52 p 5 59 Then list(d) has pairs: (0, 12), (1, 15), (2, 23), (6, 52). Hence, we store (0, 12), (1, 3), (1, 8), (4, 29).

Lemma Let n = P. If list(p) has r points, our compression scheme uses O(r(log n U2 r + log r )) bits.

Next, we describe a simple way to create effective R-trees. Let S be a set of 2d points. Suppose that we have sorted them by z-value in ascending order; let L be the sorted list. Given a parameter b, let us cut L into blocks of size b, where a block is a subsequence of points in L. Treat each block as a leaf node. Once all the leaf nodes have been decided, so are the internal nodes.

Example: 7 6 5 4 3 2 1 p 8 E 4 p 6 E 1 p 3 E 3 E 5 p 4 p 5 p 1 p 2 q E 6 p 7 E 2 1 2 3 4 5 6 7 u 0 E 1 E 2 u 1 u 2 E 3 E 4 E 5 E 6 u 3 u 4 u 5 u 6 p 4 p 8 p 2 p 6 p 3 p 5 p 1 p 7 p 6 12 p 2 15 p 8 23 p 4 24 p 7 41 p 1 50 p 3 52 p 5 59

We apply this idea to create R-trees on the inverted lists. There is only one issue left. Currently, each inverted list has been compressed using the gapping technique. As a result, if we want to decompress a point p in an inverted list, we must read the bits of all the points before p in the list. This creates a problem because, in answering a query by distance browsing, we must be able to decompress all the points in a leaf node quickly. To avoid this problem, we can instead apply the gapping idea locally in each leaf node. Namely, if a leaf node contains a sequence L of points (sorted by z-order), the (pid, z) pair of the first point in L is represented in its original form. One can show that the extra space overhead thus introduced is limited as long as b is not too small.