Lecture 14 October 16, 2014

Similar documents
Lecture 17 03/21, 2017

Locality Sensitive Hashing

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Geometry of Similarity Search

Similarity Search in High Dimensions II. Piotr Indyk MIT

4 Locality-sensitive hashing using stable distributions

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Lecture 8 January 30, 2014

Algorithms for Data Science: Lecture on Finding Similar Items

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny)

Lower bounds on Locality Sensitive Hashing

Proximity problems in high dimensions

Approximate Nearest Neighbor (ANN) Search in High Dimensions

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

arxiv: v2 [cs.ds] 10 Sep 2016

On Symmetric and Asymmetric LSHs for Inner Product Search

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

Beyond Locality-Sensitive Hashing

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting

A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match

Algorithms for Nearest Neighbors

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

arxiv: v1 [cs.db] 2 Sep 2014

Super-Bit Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing

Optimal Data-Dependent Hashing for Approximate Near Neighbors

16 Embeddings of the Euclidean metric

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Lecture 6 September 13, 2016

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Lattice-based Locality Sensitive Hashing is Optimal

Hash-based Indexing: Application, Impact, and Realization Alternatives

Locality-sensitive Hashing without False Negatives

Distribution-specific analysis of nearest neighbor search and classification

Lecture 4: Hashing and Streaming Algorithms

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

arxiv: v2 [cs.ds] 21 May 2017

Lecture 6: Expander Codes

Linear Spectral Hashing

Lecture 1 September 3, 2013

Lecture 2 September 4, 2014

LOCALITY SENSITIVE HASHING FOR BIG DATA

Approximate Nearest Neighbor Problem in High Dimensions. Alexandr Andoni

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Lecture 16 Oct. 26, 2017

Jeffrey D. Ullman Stanford University

Lecture 3 September 15, 2014

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Term Filtering with Bounded Error

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

Lecture 4 February 2nd, 2017

Solutions to Problem Set 4

Lecture 4 Thursday Sep 11, 2014

arxiv: v3 [cs.ds] 7 Jan 2016

6.854 Advanced Algorithms

Approximate Voronoi Diagrams

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

On the Optimality of the Dimensionality Reduction Method

A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Randomized Algorithms

Faster Algorithms for Sparse Fourier Transform

Nearest Neighbor Preserving Embeddings

Nearest Neighbor Searching Under Uncertainty. Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University

Optimal compression of approximate Euclidean distances

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis

Set Similarity Search Beyond MinHash

Similarity searching, or how to find your neighbors efficiently

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Approximate Pattern Matching and the Query Complexity of Edit Distance

An Algorithmist s Toolkit Nov. 10, Lecture 17

Lecture: Face Recognition and Feature Reduction

Lecture 18 Oct. 30, 2014

COS598D Lecture 3 Pseudorandom generators from one-way functions

Ch. 10 Vector Quantization. Advantages & Design

Lecture Lecture 9 October 1, 2015

Lecture 7: Passive Learning

High-Dimensional Indexing by Distributed Aggregation

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Faster Algorithms for Sparse Fourier Transform

Sieving for shortest vectors in lattices using angular locality-sensitive hashing

LSH Forest: Practical Algorithms Made Theoretical

Lecture 24 Nov. 20, 2014

Lecture: Face Recognition and Feature Reduction

6.1 Occupancy Problem

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

arxiv: v2 [cs.ds] 17 Apr 2018

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

COMS 4771 Introduction to Machine Learning. Nakul Verma

Sketching Techniques for Collaborative Filtering

A New Algorithm for Finding Closest Pair of Vectors

CMPUT 675: Approximation Algorithms Fall 2014

Locality-Sensitive Hashing for Chi2 Distance

Optimal Las Vegas Locality Sensitive Data Structures

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

Practical and Optimal LSH for Angular Distance

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Transcription:

CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge. In this lecture we cover nearest neighbor search with guest lecturer Piotr Indyk. 2 Introduction We first define the nearest neighbor search problem. Definition 1 (Nearest Neighbor Search). Given a set P of n points in R n, for any query q, return the p P minimizing p q, i.e. we wish to find the point closest to q by some metric. Definition 2 (r-near Neighbor Search). Given parameter r and a set P of n points in R n, for any query q, if any such exist, return a p P s.t. p q r, i.e. we wish to find points within distance of q by some metric. Note we can consider r-near neighbor search to be a decision problem formulation of nearest neighbor search. Nearest neighbor search, especially high-dimensional instances, often appear in machine learning (consider the problems of determining image similarity or analyzing handwriting, in which given classified image data, we wish to determine the class of some query image as determined by closeness; the dimension here is the number of pixels); another example is near duplicate retrieval, as discussed in the last lecture, in which the dimension is the number of words. Example (d = 2). We briefly consider the simple case of d = 2, i.e. the problem of determining the nearest point in a plane. The standard solution is to use Voronoi diagrams, which split the space into regions according to proximity, in which case the problem reduces to determining point location within the Voronoi diagram. We can do this by building a BST quickly to determining the cell containing any given point: this takes O(n) space and O(log n) query. When we consider higher dimensions, i.e. d > 2, however, the Voronoi diagram has size n d/2, which is prohibitively large. Also note we can simply perform a linear scan in O(dn) time. Therefore, we instead consider approximations. 3 Approximate Nearest Neighbor We define the approximation version of nearest neighbor search. 1

Definition 3 (c-approximate Nearest Neighbor Search). Given approximation factor c and a set P of n points in R n, for any query q, return a p P s.t. p q cr where r = p q and p is the true nearest neighbor, i.e. we wish to find a point at most some factor c away from the point closest to q. Definition 4 (c-approximate r-near Neighbor Search). Given approximation factor c and parameter r and a set P of n points in R n, for any query q, if any p P exists s.t. p q r, return a p P s.t. p q rc, i.e. we wish to a point at most some factor c away from some point within a range of q. Also note that if we enumerate all approximate near neighbors, we can find the exact near neighbor. We build up a data structure to solve these problems; if, however, in the c-approximate r-near neighbor search, there is no such point, our data structure can return anything, so instead we can compute the distance to double check, and if the distance places us outside our ball, we know there was no such point. Now we consider algorithms to solve these search problems. 4 Algorithms Most algorithms are randomized, i.e. for any query we succeed with high probability over on the initial randomness used to build our data structure (which is the only randomness introduced into the system). In fact, because our data structure satisfied the desired behavior with probability bounded by a constant, by working iteratively, we can bring our probability of failure arbitrarily close to 0. Some algorithms and their bounds appear below. We will focus on locality-sensitive hashing as presented by Indyk and Motwani [IM98], Gionis, Indyk and Motwani [GIM99], and Charikar [Cha02], and touch on a few others that have appeared in recent work. 5 Locality-Sensitive Hashing Recall when we covered hashing that collisions happened more or less on an independent basis. Here, we will use hashing whose probability of collision depends on the similarity between the queries. Definition 5 (Sensitivity). We call a family H of functions h : R d U (where U is our hashing universe) (P 1, P 2, r, cr)-sensitive for a distance function D if for any p, q: D(p, q) < r P[h(p) = h(q)] > P 1, i.e. if p and q are close, the probability of collision is high; and D(p, q) > cr P[h(p) = h(q)] < P 2, i.e. if p and q are far, the probability of collision is low. 2

Example (Hamming distance). Suppose we have h(p) = p i, i.e. we hash to the ith bit of length d bitstring p, and let D(p, q) be the Hamming distance between p and q, i.e. the number of different bits (place-wise) between p and q. Then our probability of collision is always 1 D(p, q)/d (since D(p, q)/d is the probability of choosing a different bit). 5.1 Algorithm We use functions of the form g(p) = h 1 (p), h 2 (p),..., h k (p). In preprocessing, we build up our data structure: after selecting g 1,..., g L independently and at random (where k and L are functions of c and r that we will compute later), hash every p P to buckets g 1 (p),..., g L (p). When querying, we proceed as follows: retrieve points from buckets g 1 (q),..., g L (q) until either we have retrieved the points from all L buckets, or we have retrieved more than 3L points in total answer query based on retrieved points (whether procedure terminated from first or second case above) Note that hashing takes time d, and with L buckets, the total time here is O(dL). Next we will prove space and query performance bounds as well as the correctness of the parameters. 5.2 Analysis We will prove two lemmata for query bounds, the second specifically for Hamming LSH. Lemma 6. The above algorithm solves c-approximate nearest neighbor with L = Cn ρ hash functions, where ρ = log P 1 / log P 2, where C is a function of P 1 and P 2, for P 1 bounded away from 0, and hence constant; and a constant success probability for fixed query q Proof. Define and p point s.t. p q r; FAR(q) = {p P : p q > cr} the set of points far from q; B i (q) = {p P : g i (p ) = g i (q)} the set of points in the same bucket as q E 1 : g i (p) = g i (q) for some i = 1,..., L: the event of colliding in a bucket with the desired query; 3

E 2 : i B i(q) FAR(q) < 3L: the event of total number of far points in buckets not exceeding 3L. We will show that both E 1 and E 2 occur with nonzero probability. Set k = log 1/P2 n. Observe that for p FAR(q), P[g i (p ) = g i (q)] P k 2 1/n. Therefore E[ B i (1) FAR(q) ] 1 E[ i B i (1) FAR(q) ] L P[ i B i (1) FAR(q) 3L] 1 3 by Markov and P[g i (p) = g i (q)] 1 P k 1 P 1+log 1/P 2 n 1 P[g i (p) = g i (q)] 1 P 1 n ρ =: 1 L P[g i (p) g i (q), i = 1,..., L] (we choose L accordingly) ( 1 1 ) L 1 L e P[E 1 not true ] + P[E 2 not true ] 1 3 + 1 e.7 ( 1 P[E 1 E 2 ] 1 3 + 1 ).3 e Thus also note we can make this probability arbitrarily small. Lemma 7. For Hamming LSH functions, we have ρ = 1/c. Proof. Observe that with a Hamming distance, P 1 = 1 r/d and P 2 = 1 cr/d, so it suffices to show ρ = log P 1 / log P 2 1/c, or equivalently P c 1 P 2. But (1 x) c 1 cx for any 1 > x > 0, c > 1, so we are done. Also note that space is nl, so we have desired space bounds as well. 6 Beyond Now we briefly consider other metrics beyond the Hamming distance, and ways to reduce the exponent ρ. (Final project ideas?) 6.1 Random Projection LSH for L 2 This covers work by [DIIM04]. 4

We define h X,b (p) = (p X + b)/w, where w r, X = (X 1,..., X d are iid Gaussian random variables, and b is a scalar chosen uniformly at random from [0, w]. Conceptually, then, our fash function takes p, projects it on to X, shifts it by some amount b, then determines the interval of length w containing it. This only has a small improvement over the 1/c bound, however (please refer to slides for more details). Therefore we consider alternative ways of projection. 6.2 Ball Lattice Hashing This covers work by [AI06]. Instead of projecting onto R 1, we project onto R t for constant t. We quantize with a lattice of balls when we hit empty space, we rehash until we do hit a ball (observe that using a grid would degenerate back to the 1-d case). With this, we have ρ = 1/c 2 +O(log t/ t)), but hashing time increases to t O(t) because our changes of hitting a ball gets smaller and smaller with higher dimensions. For more details and analysis, as well as other LSH schemes (including data dependent hashing, in which we cluster related data until it is random, which presents better bounds) and schemes (including the Jaccard coefficient we explored in pset 2), please refer to the slides. References [AI06] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. FOCS 06, 459-468, 2006. [Cha02] Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. STOC 02, 380 388, 2002. [DIIM04] Mayur Datar, Nicole Immorlica, Piotr Indyk and Vahab S. Mirrokni. Locality-sensitive Hashing Scheme Based on P-stable Distributions. SCG 04, 253 262, 2004. [GIM99] Aristides Gionis, Piotr Indyk and Rajeev Motwani. Similarity search in high dimensions via hashing. VLDB, 518 529, 1999. [IM98] Piotr Indyk and Rajeeve Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 98, 604 613, 1998. 5