High-Dimensional Indexing by Distributed Aggregation

Similar documents
Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Nearest Neighbor Search with Keywords: Compression

Linear Classification: Perceptron

Nearest Neighbor Search with Keywords

Multidimensional Divide and Conquer 1 Skylines

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Jeffrey D. Ullman Stanford University

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Fully Understanding the Hashing Trick

Lecture 10 September 27, 2016

Linear Classification: Linear Programming

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Lecture 2: Divide and conquer and Dynamic programming

LINEAR ALGEBRA: THEORY. Version: August 12,

MAS221 Analysis Semester Chapter 2 problems

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Query Processing in Spatial Network Databases

Linear Classification: Linear Programming

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

c i r i i=1 r 1 = [1, 2] r 2 = [0, 1] r 3 = [3, 4].

1 Terminology and setup

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Exact and Approximate Flexible Aggregate Similarity Search

Elimination and back substitution

2 How many distinct elements are in a stream?

A Polynomial-Time Algorithm for Pliable Index Coding

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Lecture 8: Linear Algebra Background

1 Closest Pair of Points on the Plane

An analogy from Calculus: limits

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CSCI Honor seminar in algorithms Homework 2 Solution

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Approximate Voronoi Diagrams

Math 324 Summer 2012 Elementary Number Theory Notes on Mathematical Induction

Concrete models and tight upper/lower bounds

2 Notation and Preliminaries

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

Z Algorithmic Superpower Randomization October 15th, Lecture 12

Lecture 15: Random Projections

16 Embeddings of the Euclidean metric

Lecture 14 October 16, 2014

The τ-skyline for Uncertain Data

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms

Lecture 14: The Rank-Nullity Theorem

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Definition For a set F, a polynomial over F with variable x is of the form

Information Retrieval

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Searching Algorithms. CSE21 Winter 2017, Day 3 (B00), Day 2 (A00) January 13, 2017

Lecture 4: Best, Worst, and Average Case Complexity. Wednesday, 30 Sept 2009

Supervised Learning: Non-parametric Estimation

Internal link prediction: a new approach for predicting links in bipartite graphs

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Geometry of Similarity Search

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational Learning Theory

Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation

Sequence convergence, the weak T-axioms, and first countability

Quantum Recommendation Systems

Matrix Calculations: Inner Products & Orthogonality

Online Learning, Mistake Bounds, Perceptron Algorithm

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

The Completion of a Metric Space

CMPUT651: Differential Privacy

Lecture 7: Passive Learning

Mathematics-I Prof. S.K. Ray Department of Mathematics and Statistics Indian Institute of Technology, Kanpur. Lecture 1 Real Numbers

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

MATH 324 Summer 2011 Elementary Number Theory. Notes on Mathematical Induction. Recall the following axiom for the set of integers.

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures

Class Note #20. In today s class, the following four concepts were introduced: decision

Depth Estimation for Ranking Query Optimization

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

CS264: Beyond Worst-Case Analysis Lecture #15: Smoothed Complexity and Pseudopolynomial-Time Algorithms

CS246 Final Exam. March 16, :30AM - 11:30AM

Lecture 04: Secret Sharing Schemes (2) Secret Sharing

Persistent Homology. 128 VI Persistence

To every formula scheme there corresponds a property of R. This relationship helps one to understand the logic being studied.

Multi-Dimensional Top-k Dominating Queries

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

22 Approximations - the method of least squares (1)

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Chapter 5. Number Theory. 5.1 Base b representations

5. Solving the Bellman Equation

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Lecture 6: Non-stochastic best arm identification

Lecture 1: Period Three Implies Chaos

Lecture Notes 1: Vector spaces

CS 453 Operating Systems. Lecture 7 : Deadlock

Notes on the Point-Set Topology of R Northwestern University, Fall 2014

Transcription:

High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland

In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas from distributed aggregation: a class of problems in distributed computing that appears rather different from the geometric setup of high-dimensional indexing. We will also learn a new way to analyze the performance of an algorithm. This method called competitive analysis aims to prove that an algorithm is efficient on every input of the problem.

Distributed Computation We will now temporarily forget about high-dimensional data, and focus instead on the paradigm of distributed computing. In this paradigm, there are m servers which do not communicate with each other (consider that they are geographically apart). We want to perform certain computation on the data stored on these servers, but from a computer that is connected to the m servers in a network. For this purose, we need to acquire data from the servers over the network. The goal is to minimize the amount of communication required....

Many interesting problems can be studied in the paradigm described in the previous slide. We will look at one of them in this lecture. The problem is called distributed sum as defined in the next few slides.

Distributed Sum Let S be a set of n objects. Each object o S has m attributes, denoted as A 1 (o), A 2 (o),..., A m (o), respectively. For each i [1, m], the A i -values of all objects are stored on server i. Specifically, the server sorts the set {A i (o) o S} in non-descending order: denote the sorted list as L i. Each object o has a score defined as score(o) = m i=1 A i(o). We want to identify the object with the smallest score (if multiple objects have this score, all should be reported). Example (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) The answer is o 3, which has the smallest score 1 + 6 + 3 + 3 = 13.

Distributed Sum Each server i [1, m] provides two operations: Example getnext(): returns the next object in L i (or NULL if L i has been exhausted). probe(o): returns the A i value of the requested object o directly. (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) The first getnext() on L 2 returns (o 6, 2), while the second getnext on the same list returns (o 1, 3). Performing probe(o 5 ) on L 3 returns (o 5, 2).

Distributed Sum Our objective is to retrieve the object(s) with the smallest score using only the aforementioned operations. The cost of an algorithm is the number of operations performed. Example (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) We can easily solve the problem by retrieving everything from all servers, but this incurs a cost of nm. The challenge is to design an algorithm to reduce this cost as much as possible.

Applications This problem underlies a high-dimensional indexing approach that we will explain later. Moreover, it directly models a class of applications that aim to find the top-1 object by aggregating scores from multiple sites: Finding the best hotel (in a region, e.g., Brisbane) by aggregating user ratings from several booking sites. Finding the most popular movie by aggregating ratings from several movie recommendation sites....

Algorithm We now describe an algorithm named the threshold algorithm (TA) for solving the problem. Let us gain some ideas about the algorithm using our running example. (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) First, perform a getnext on every server. In this way, we obtain (o 3, 1) from L 1, (o 6, 2) from L 2, (o 1, 1) from L 3, and (o 1, 2) from L 4. For L 1, set threshold τ 1 = 1, indicating that all the unseen objects on L 1 must have A 1 value at least 1. Similarly, set τ 2 = 2, τ 3 = 1, and τ 4 = 2.

Algorithm Throughout the algorithm, whenever a new object o is returned by a sever, we retrieve the remaining m 1 attributes of o from the other servers using the probe operation. This allows us to calculate score(o) immediately. (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) As mentioned, the getnext on L 1 retrieved o 3 ; thus, we perform probe(o 3 ) on L 2, L 3, and L 4. This gives score(o 3 ) = 13. Similarly, we obtain score(o 6 ) = 48 and score(o 1 ) = 16.

Algorithm We then repeat the above. Namely, for each i [1, m], perform another getnext on each L i. Suppose that it returns (o, A i (o)); we update τ i to A i (o). If o has not been seen before, obtain its values on the other servers with probe(o). (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) Continuing our example, we retrieve the 2nd pair from each list: (o 1, 8), (o 1, 3), (o 5, 2), (o 3, 3). Since o 5 is seen for the first time, we retrieve A 1 (o 5 ) = 10, A 2 (o 5 ) = 7, A 4 (o 5 ) = 13, and calculate score(o 5 ) = 32. Now τ 1 = 8, τ 2 = 3, τ 3 = 2, and τ 4 = 3.

Algorithm (o 3, 1) (o 1, 8) (o 4, 9) (o 5, 10) (o 2, 12) (o 6, 18) (o 6, 2) (o 1, 3) (o 2, 4) (o 4, 5) (o 3, 6) (o 5, 7) (o 1, 1) (o 5, 2) (o 3, 3) (o 6, 8) (o 4, 10) (o 2, 11) (o 1, 2) (o 3, 3) (o 2, 10) (o 5, 13) (o 6, 20) (o 4, 27) List L 1 (attr. A 1 ) List L 2 (attr. A 2 ) List L 3 (attr. A 3 ) List L 4 (attr. A 4 ) Recall that our current best object o 3 has score 13. Since 4 i=1 τ i = 8 + 3 + 2 + 3 = 16 > 13, the algorithm terminates here with o 3 as the final result (think: why?).

Pseudocode of the TA Algorithm algorithm TA 1. τ 1 = τ 2 =... = τ m = 2. while m j=1 τ j score(o best ) where o best is the object with the lowest score among the seen objects 3. for i = 1 to m 4. (o, A i (o)) = getnext() on L i 5. τ i = A i (o) 6. if o seen for the first time then 7. obtain score(o) by probing the values of o on the other m 1 servers We will refer to Lines 3-7 collectively as a round. In our previous example, the algorithm terminated after 2 rounds.

Analysis The cost of the TA algorithm heavily depends on the input. On the easiest input, it may finish with a cost of O(m), while on the hardest input it may incur a cost of Ω(nm). (o 1, 0) (..., 1)... (o 1, 0) (..., 1)...... Best case O(m) (o 1, 0) (..., 1)... (o 1, 0) (o 2, 0)... (on 2, 0) (on 2 +1, 1) (on 2 +2, 1)... (o n, 1)... (o 1, 0) (o 2, 0)... (on 2, 0) (on 2 +1, 1) (on 2 +2, 1)... (o n, 1) 1 m/2 (on 2 +1, 0) (on 2 +2, 0)... (o n, 0) (o 1, 1) (o 2, 1)... (on 2, 1) Worst case Ω(nm)... (on 2 +1, 0) (on 2 +2, 0)... (o n, 0) (o 1, 1) (o 2, 1)... (on 2, 1) m/2 + 1 m

Analysis Next, we will show that the TA algorithm in fact performs reasonably well on every input (even when it has cost Ω(nm)!). More specifically, we will show that its cost can be higher than that of any algorithm by a factor of at most O(m). Hence, when m is small, the TA algorithm can no longer be improved considerably.

Analysis Suppose that we have fixed the input, namely, L 1, L 2,..., L m. Denote by cost TA the number of operations performed by the TA algorithm. Consider any other algorithm α whose cost is cost α. We will show that cost TA = O(m 2 ) cost α If TA has the above property, it is said to be O(m 2 )-competitive.

Analysis Lemma: cost α m. Proof: We will prove that α must probe at least one object from each of L 1, L 2,..., L m. Suppose that this is not true, such that α terminates without probing anything from L i for some i [1, m]. Let o be the object returned by α. As α knows nothing about A i (o ), it cannot rule out the possibility where A i (o ) is exceedingly large and prevents o from being the best object.

Analysis We will now prove our main theorem: Theorem: cost TA (m 2 + m) cost α. Proof: First, if cost α n, then the theorem follows directly from the obvious fact that cost TA mn. Next, we will focus exclusively on the scenario where cost α < n. Notice that in this case, there must have been at least one object o # that has not been seen by α that is, α does not know any attribute of o #. In the following, we will denote by o the best object eventually reported (if multiple objects are returned, let o be any of them).

Analysis Proof (cont.): Let x be the number of rounds performed by TA. Recall that (i) each round incurs m getnext operations, and (ii) each getnext is followed by at most m 1 probe operations. Therefore: Next, we will prove: cost TA m x + m x (m 1) = m 2 x. If we manage to do so, then we will have cost α x 1. (1) cost TA m 2 (cost α + 1) = m 2 cost α + m 2 which will complete the proof. m 2 cost α + m cost α = (m 2 + m) cost α

Analysis Proof (cont.): Now it remains to prove (1). Let P i be the sequence of the first x 1 objects of L i. x objects P i (x 1 objects) Next we will prove that α must access all the positions of at least one of P 1,..., P m. This will establish (1), and hence, complete the whole proof.

Analysis Proof (cont.): Let τ i hold that To see this, consider: be the A i -value of the last object in P i. It must m i=1 τ i score(o ). (2) Case 1: o seen in the first x 1 rounds. In this case, o must be the best object at the end of round x 1. By the fact that TA did not terminate at round x 1, (2) holds. Case 2: o seen in the x-th round. It thus follows that A i (o ) τ i for every i [1, m]. In this case, (2) also holds.

Analysis Proof (cont.): Assume for contradiction that, every P i has at least one position that is not accessed by algorithm α. Recall that there is an object o # such that α has not seen any attribute of o #. As a result, α cannot rule out the possibility that, o # is in P i for every i [1, m] namely, o # sits at the position in P i that is missed by α. In that possibility, score(o # ) m i=1 τ i score(o ); and hence, α is incorrect by not outputting o #.

We now return to high-dimensional spaces, and explain how the TA algorithm can be utilized for index designing. For this purpose, we will again use nearest neighbor search (and its approximate version) as the representative problem. Recall: Let P be a set of d-dimensional points in R d. The (Euclidean) nearest neighbor (NN) of a query point q R d is the point p P that has the smallest Euclidean distance to q. We denote the Euclidean distance between p and q as p, q.

Index Creation For every dimension i [1, d], simply sort P by coordinate i. This creates d sorted lists: P 1, P 2,..., P d, one for each dimension. For each P i, we create a suitable structure (e.g., a hash table) that allows us to obtain, in O(1) time, the i-th coordinate of any given point p P.

Answering a NN Query We will do so by turning the query into a distributed sum problem. Let q = (q[1], q[2],..., q[d]) be the query point. Suppose that, somehow, we have obtained d sorted lists L 1, L 2,..., L d of P where: In L i (i [1, d]), the points p P have been sorted in ascending order of p[i] q[i] 2. Then, the goal is essentially to minimize score(p) = d p[i] q[i] 2 i=0 which is a distributed sum problem with m = d lists.

Answering a NN Query At first glance, it appears that we do not have L 1, L 2,..., L d of P. In fact, we do. It will be left to you to think: How to implement the getnext operation in O(1) time (except for the first getnext, which takes O(log n) time). How to implement the probe operation in O(1) time.

2D Illustration q p The algorithm accesses (at most) the points of P in the shaded area (p is the NN of q).

Recall that the TA algorithm is (2m 2 )-competitive. In other words, the NN query algorithm is good only if m = d is small. Unfortunately, this contradicts our goal of supporting high-dimensional queries. Interestingly, this problem is significantly alleviated if it suffices to retrieve approximate NNs, where we only have to consider m = O(log n), no matter how large is d.

ɛ-approximate Nearest Neighbor Search Let P be a set of d-dimensional points in R d. Given a query point q R d, an ɛ-approximate nearest neighbor query returns a point p P satisfying: where p is the NN of q in P. p, q (1 + ɛ) p, q

The Johnson-Lindenstrauss (JL) Lemma We will not state the lemma formally, but we will explain how to use it for ɛ-approximate NN queries search. Our description will trade some rigor for simplicity, but should work quite well in practice. First, pick m = c log 2 n random lines l 1, l 2,..., l m passing the origin, where c is a sufficiently large constant (depending on m). Then, project all the points of P into the subspace defined by l 1, l 2,..., l m. Let P be the resulting dataset. Note that P is m-dimensional. Given a query q in the original space, we obtain its projection q in the subspace. Then, we find the (exact) NN p of q in the subspace (using the TA algorithm as explained before). Return the original image point p of p. The JL-lemma states that p is an (1 + ɛ)-approximate NN of q with a high probability.