Locality Sensitive Hashing

Similar documents
Lecture 14 October 16, 2014

Lecture 17 03/21, 2017

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Approximate Nearest Neighbor (ANN) Search in High Dimensions

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny)

Lower bounds on Locality Sensitive Hashing

Geometry of Similarity Search

Similarity Search in High Dimensions II. Piotr Indyk MIT

16 Embeddings of the Euclidean metric

Proximity problems in high dimensions

4 Locality-sensitive hashing using stable distributions

Lecture 8 January 30, 2014

Beyond Locality-Sensitive Hashing

A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match

Bloom Filters and Locality-Sensitive Hashing

A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints

LSH Forest: Practical Algorithms Made Theoretical

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

Nearest Neighbor Preserving Embeddings

Set Similarity Search Beyond MinHash

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Faster Johnson-Lindenstrauss style reductions

On Approximating the Depth and Related Problems

On Symmetric and Asymmetric LSHs for Inner Product Search

Algorithms for Data Science: Lecture on Finding Similar Items

Succinct Data Structures for Approximating Convex Functions with Applications

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Algorithms for Nearest Neighbors

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Distribution-specific analysis of nearest neighbor search and classification

Random Feature Maps for Dot Product Kernels Supplementary Material

Super-Bit Locality-Sensitive Hashing

compare to comparison and pointer based sorting, binary trees

Tail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002

Locality-sensitive Hashing without False Negatives

A New Algorithm for Finding Closest Pair of Vectors

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search

An Algorithmist s Toolkit Nov. 10, Lecture 17

Introduction Long transparent proofs The real PCP theorem. Real Number PCPs. Klaus Meer. Brandenburg University of Technology, Cottbus, Germany

Coupling. 2/3/2010 and 2/5/2010

Linear Spectral Hashing

Distance-Sensitive Bloom Filters

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

arxiv: v1 [cs.db] 2 Sep 2014

Approximating the Minimum Closest Pair Distance and Nearest Neighbor Distances of Linearly Moving Points

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Lattice-based Locality Sensitive Hashing is Optimal

Optimal Las Vegas Locality Sensitive Data Structures

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Randomized Algorithms

Reporting Neighbors in High-Dimensional Euclidean Space

Randomness and Computation March 13, Lecture 3

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.

1 Estimating Frequency Moments in Streams

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

arxiv: v3 [cs.ds] 7 Jan 2016

On the Optimality of the Dimensionality Reduction Method

Trace Reconstruction Revisited

1 Randomized Computation

A list-decodable code with local encoding and decoding

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Approximate Voronoi Diagrams

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

The Tensor Product of Two Codes is Not Necessarily Robustly Testable

Measure and Integration: Solutions of CW2

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

1 Distributional problems

Computer Science Dept.

A Las Vegas approximation algorithm for metric 1-median selection

1 Maintaining a Dictionary

Analysis of Algorithms I: Perfect Hashing

Robust local testability of tensor products of LDPC codes

Hash-based Indexing: Application, Impact, and Realization Alternatives

Lecture Lecture 9 October 1, 2015

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis

Learning convex bodies is hard

HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F Introduction and Motivation

Lecture 9: List decoding Reed-Solomon and Folded Reed-Solomon codes

Locality-Sensitive Hashing for Chi2 Distance

Lecture 12: Lower Bounds for Element-Distinctness and Collision

Algortithms for the Min-Cut problem

Some notes on streaming algorithms continued

Lecture 4: Codes based on Concatenation

2 Completing the Hardness of approximation of Set Cover

Hashing. Martin Babka. January 12, 2011

Closest String and Closest Substring Problems

1 Approximate Counting by Random Sampling

Approximate Nearest Neighbor Problem in High Dimensions. Alexandr Andoni

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Lecture 4 Thursday Sep 11, 2014

LOCALITY SENSITIVE HASHING FOR BIG DATA

Navigating nets: Simple algorithms for proximity search

PRGs for space-bounded computation: INW, Nisan

Practical and Optimal LSH for Angular Distance

Transcription:

Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of Hamming metric but LSH can be seen as a general framework which applies to several metrics e.g. the l metric. Definition 1 (Hamming distance). Given two strings x, y {0, 1} d, the Hamming distance d H (x, y) is the number of positions at which x and y differ. For example, let x = 10010 and y = 10100. Then, d H (x, y) =. We focus on the problem of Approximate Nearest Neighbor search in subsets of ({0, 1} d, d H ) when the dimension is high (assume d log n). Instead of solving directly the Approximate Nearest Neighbor problem, we solve the Approximate Near Neighbor problem which is defined as follows. Definition (Approximate Near Neighbor problem). Let P {0, 1} d. Given r > 0 and ɛ > 0 build a data structure s.t. for any query q {0, 1} d do the following: if p P s.t. d H (p, q) r then report point p P s.t. d H (p, q) (1 + ɛ) r, if p P d H (p, q) > (1 + ɛ) r then report no. The data structure that we present here is randomized and there is a probability of failure. More preciselly, the following will be proven. Theorem 3. Let P {0, 1} d. Given r > 0 and ɛ > 0 the LSH data structure satisfies the following: fix any query q {0, 1} d, if p P s.t. d H (p, q) r then if the preprocessing succeeds for q then the data structure reports point p P s.t. d H (p, q) (1 + ɛ) r, if p P d H (p, q) > (1 + ɛ) r then report no. The preprocessing succeeds for q with constant probability. The space required is O(dn + n 1+ 1 1+ ln p 1 1+ɛ log n), the preprocessing time is O(dn ln p log n) and the query time O(dn 1 1+ɛ log n) The method is based on the idea of using hash functions which have the nice property that they probably map similar strings (or generally points) to the same buckets. Definition 4. Let r 1 < r and p 1 > p. We call a family H of hash functions (r 1, r, p 1, p )- sensitive if for any x, y {0, 1} d, d H (x, y) r 1 = P r[h(x) = h(y)] p 1, 1

d H (x, y) r = P r[h(x) = h(y)] p. In the Hamming metric case we define the following family of functions. Definition 5 (Family of hash functions). Let H = {h i (x) = x i x = (x 1,, x d ), i {1,, d}}. Obviously, H = d. Pick uniformly at random h H. Then P r[h(x) = h(y)] = 1 d H(x,y) d. Corollary 6. The family H is (r, cr, 1 r d, 1 cr d )-sensitive, where r > 0, c > 1. However probabilities 1 r d, 1 cr d can be close to each other. Definition 7. Given parameter k, define new family G(H): G(H) = {g : {0, 1} d {0, 1} k g(x) = h 1 (x),, h k (x)}. In other words, a function g chosen uniformly at random from G(H) projects point p {0, 1} d into k randomly and independently chosen coordinates. Obviously, G(F ) = d k. Now, we choose uniformly at random L functions g 1,, g L G(F ). Preprocessing algorithm. for i from 1 to L do Pick uniformly at random g i G(F ). For each p P, assign p in bucket g i (p) (in hash table T i ). The preprocessing time is O(L n d k). The space usage: L hash tables and n pointers to strings per table = O(L n). In order to store the n points we need O(d n) space. Query algorithm. for i from 1 to L do for each string p in bucket g i (q) do if number of retrieved strings > 3L then return no end if if d H (q, p) < cr then return p end if The query time is O(L(K + d)). Let p any r-near neighbor of q. The execution of our algorithm is successful if both events happen: A: i {1,, L} s.t. g i (p ) = g i (q). B: Less than 3L useless strings lie in g i (q), i {1,, L}. Let p 1 = 1 r d, p = 1 cr d. Given j, P r[g j(p ) = g j (q)] p k 1. Setting k = log 1/p n yields P r[g j (p ) = g j (q)] n ln 1/p 1 ln 1/p.

Hence, P r[a] (1 n ln 1/p 1 ln 1/p ) L. Setting L = n ln 1/p 1 ln 1/p P r[a] (1 1 L )L 1 e. we obtain: Let p P s.t. d H (p, q) c r. Given j, P r[g j (p ) = g j (q)] p k = 1 n. The expected number of strings p P s.t. d H (p, q) cr and also lie in the same bucket with q, is L. Hence 1, P r[b] 1 3. After setting the parameters we conclude: Query time: O(dn ln 1/p 1 ln 1/p log n), Space: O(dn + n 1+ ln 1/p 1 ln 1/p log n), Preprocessing time O(dn 1+ ln 1/p 1 ln 1/p log n). We finally notice that ln 1/p 1 c=1+ɛ = ln 1/p and we omit the technical details. ln(1 r/d) ln(1 (1 + ɛ)r/d) 1 1 + ɛ High probability. The probability can be amplified by repetition. We can achieve 1 n c for any constant c > 0 by building O(log n) data structures as in Theorem 3. Solving the Approximate Nearest Neighbor problem. The idea is to do binary search over the range of distances 1,..., d. Better complexity bounds can be obtained by binary search over the distances 1, (1 + ɛ), (1 + ɛ),..., d. However, in other metrics it is not obvious that someone can solve the Approximate Nearest Neighbor problem with Approximate Near Neighbor data structures. A solution to this problem is obtained in [4] and can be stated as follows. Theorem 8. Let P be a given set of n points in a metric space, and let c = 1 + ɛ > 1, f (0, 1), and γ (1/n, 1) be prescribed parameters. Assume that we are given a data structure for the (c, r)-approximate near neighbor that uses space S(n, c, f), has query time Q(n, c, f), and has failure probability f. Then there exists a data structure for answering c(1 + O(γ))-NN queries in time O(log n)q(n, c, f) with failure probability O(f log n). The resulting data structure uses O(S(n, c, f)/γ log n) space. LSH in l In the previous section we have seen an LSH family for the Hamming metric. It is known that the data structure obtained there can be used in order to solve the problem in l. This is obtained by a non-trivial reduction which translates the ANN problem in l to the ANN problem in the Hamming space [4]. The first LSH function directly applicable to the l metric can be described as follows. Definition 9 (LSH family for l ). Let p R d and v N(0, 1) d. Let also w a parameter (to be defined later) and t [0, w] chosen uniformly at random. Then, h(p) = p, v + t. w 1 Recall Markov s inquality: P r(x α) E[X], where α > 0. α Meaning the d-dimensional standard normal distribution. 3

Now let p, q R d. We have Pr[h(p) = h(q)] = Pr[ p, v q, v = x] (1 x w ) dx Now we have seen 3 that p, v q, v = p q, v N(0, p q ). Hence, Pr[h(p) = h(q) w] = exp( π p q p q ) (1 x w ) dx Notice that in l we can assume wlog that r = 1. Then for approximation ratio 1 + ɛ we need to make these two probabilities as distinct as possible: Pr[h(p) = h(q) p q = 1, w] = x π exp( x ) (1 x w ) dx Pr[h(p) = h(q) p q = 1 + ɛ, w] = exp( x π(1 + ɛ) (1 + ɛ) ) (1 x w ) dx. By the previous discussion on the LSH framework we can see that we need to focus on minimizing the term ρ w = Indeed in [3], they prove the following. Lemma 10. There exists w such that ρ w 1 1+ɛ. log(1/ Pr[h(p) = h(q) p q = 1, w]) log(1/ Pr[h(p) = h(q) p q = 1 + ɛ, w]). The above has been verified by numerical computations. Some intuition behind the LSH family. We will now try to give a more intuitive description of the LSH family defined above. First we randomly project the points 4 and then apply a randomly shifted grid 5 with cell-sidewidth w. Now the g : R d N k functions which are implied by the above discussion 6 is just the id of the corresponding cell in the randomly shifted grid. Better results. In [1], they achieve better exponent (roughly 1/(1 + ɛ) ) which is known to be nearly optimal. In [] they achieve even better exponent by designing an algorithmic sceme which depends on the dataset and it is no more oblivious to the points, namely data-dependent LSH. References [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117 1, 008. 3 jl.pdf 4 Target dimension log n/ɛ = appoximately preserve distances 5 Random shift implies positive probability of including two close points in the same cell. 6 Recall that in the LSH scheme we concatenate functions of the first family H. 4

[] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. of the 47th Annual ACM on Symposium on Theory of Computing, STOC 15, pages 793 801, New York, NY, USA, 015. ACM. [3] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 04, pages 53 6, New York, NY, USA, 004. ACM. [4] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):31 350, 01. [5] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symp. Theory of Computing, STOC 98, pages 604 613, 1998. 5