Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of Hamming metric but LSH can be seen as a general framework which applies to several metrics e.g. the l metric. Definition 1 (Hamming distance). Given two strings x, y {0, 1} d, the Hamming distance d H (x, y) is the number of positions at which x and y differ. For example, let x = 10010 and y = 10100. Then, d H (x, y) =. We focus on the problem of Approximate Nearest Neighbor search in subsets of ({0, 1} d, d H ) when the dimension is high (assume d log n). Instead of solving directly the Approximate Nearest Neighbor problem, we solve the Approximate Near Neighbor problem which is defined as follows. Definition (Approximate Near Neighbor problem). Let P {0, 1} d. Given r > 0 and ɛ > 0 build a data structure s.t. for any query q {0, 1} d do the following: if p P s.t. d H (p, q) r then report point p P s.t. d H (p, q) (1 + ɛ) r, if p P d H (p, q) > (1 + ɛ) r then report no. The data structure that we present here is randomized and there is a probability of failure. More preciselly, the following will be proven. Theorem 3. Let P {0, 1} d. Given r > 0 and ɛ > 0 the LSH data structure satisfies the following: fix any query q {0, 1} d, if p P s.t. d H (p, q) r then if the preprocessing succeeds for q then the data structure reports point p P s.t. d H (p, q) (1 + ɛ) r, if p P d H (p, q) > (1 + ɛ) r then report no. The preprocessing succeeds for q with constant probability. The space required is O(dn + n 1+ 1 1+ ln p 1 1+ɛ log n), the preprocessing time is O(dn ln p log n) and the query time O(dn 1 1+ɛ log n) The method is based on the idea of using hash functions which have the nice property that they probably map similar strings (or generally points) to the same buckets. Definition 4. Let r 1 < r and p 1 > p. We call a family H of hash functions (r 1, r, p 1, p )- sensitive if for any x, y {0, 1} d, d H (x, y) r 1 = P r[h(x) = h(y)] p 1, 1
d H (x, y) r = P r[h(x) = h(y)] p. In the Hamming metric case we define the following family of functions. Definition 5 (Family of hash functions). Let H = {h i (x) = x i x = (x 1,, x d ), i {1,, d}}. Obviously, H = d. Pick uniformly at random h H. Then P r[h(x) = h(y)] = 1 d H(x,y) d. Corollary 6. The family H is (r, cr, 1 r d, 1 cr d )-sensitive, where r > 0, c > 1. However probabilities 1 r d, 1 cr d can be close to each other. Definition 7. Given parameter k, define new family G(H): G(H) = {g : {0, 1} d {0, 1} k g(x) = h 1 (x),, h k (x)}. In other words, a function g chosen uniformly at random from G(H) projects point p {0, 1} d into k randomly and independently chosen coordinates. Obviously, G(F ) = d k. Now, we choose uniformly at random L functions g 1,, g L G(F ). Preprocessing algorithm. for i from 1 to L do Pick uniformly at random g i G(F ). For each p P, assign p in bucket g i (p) (in hash table T i ). The preprocessing time is O(L n d k). The space usage: L hash tables and n pointers to strings per table = O(L n). In order to store the n points we need O(d n) space. Query algorithm. for i from 1 to L do for each string p in bucket g i (q) do if number of retrieved strings > 3L then return no end if if d H (q, p) < cr then return p end if The query time is O(L(K + d)). Let p any r-near neighbor of q. The execution of our algorithm is successful if both events happen: A: i {1,, L} s.t. g i (p ) = g i (q). B: Less than 3L useless strings lie in g i (q), i {1,, L}. Let p 1 = 1 r d, p = 1 cr d. Given j, P r[g j(p ) = g j (q)] p k 1. Setting k = log 1/p n yields P r[g j (p ) = g j (q)] n ln 1/p 1 ln 1/p.
Hence, P r[a] (1 n ln 1/p 1 ln 1/p ) L. Setting L = n ln 1/p 1 ln 1/p P r[a] (1 1 L )L 1 e. we obtain: Let p P s.t. d H (p, q) c r. Given j, P r[g j (p ) = g j (q)] p k = 1 n. The expected number of strings p P s.t. d H (p, q) cr and also lie in the same bucket with q, is L. Hence 1, P r[b] 1 3. After setting the parameters we conclude: Query time: O(dn ln 1/p 1 ln 1/p log n), Space: O(dn + n 1+ ln 1/p 1 ln 1/p log n), Preprocessing time O(dn 1+ ln 1/p 1 ln 1/p log n). We finally notice that ln 1/p 1 c=1+ɛ = ln 1/p and we omit the technical details. ln(1 r/d) ln(1 (1 + ɛ)r/d) 1 1 + ɛ High probability. The probability can be amplified by repetition. We can achieve 1 n c for any constant c > 0 by building O(log n) data structures as in Theorem 3. Solving the Approximate Nearest Neighbor problem. The idea is to do binary search over the range of distances 1,..., d. Better complexity bounds can be obtained by binary search over the distances 1, (1 + ɛ), (1 + ɛ),..., d. However, in other metrics it is not obvious that someone can solve the Approximate Nearest Neighbor problem with Approximate Near Neighbor data structures. A solution to this problem is obtained in [4] and can be stated as follows. Theorem 8. Let P be a given set of n points in a metric space, and let c = 1 + ɛ > 1, f (0, 1), and γ (1/n, 1) be prescribed parameters. Assume that we are given a data structure for the (c, r)-approximate near neighbor that uses space S(n, c, f), has query time Q(n, c, f), and has failure probability f. Then there exists a data structure for answering c(1 + O(γ))-NN queries in time O(log n)q(n, c, f) with failure probability O(f log n). The resulting data structure uses O(S(n, c, f)/γ log n) space. LSH in l In the previous section we have seen an LSH family for the Hamming metric. It is known that the data structure obtained there can be used in order to solve the problem in l. This is obtained by a non-trivial reduction which translates the ANN problem in l to the ANN problem in the Hamming space [4]. The first LSH function directly applicable to the l metric can be described as follows. Definition 9 (LSH family for l ). Let p R d and v N(0, 1) d. Let also w a parameter (to be defined later) and t [0, w] chosen uniformly at random. Then, h(p) = p, v + t. w 1 Recall Markov s inquality: P r(x α) E[X], where α > 0. α Meaning the d-dimensional standard normal distribution. 3
Now let p, q R d. We have Pr[h(p) = h(q)] = Pr[ p, v q, v = x] (1 x w ) dx Now we have seen 3 that p, v q, v = p q, v N(0, p q ). Hence, Pr[h(p) = h(q) w] = exp( π p q p q ) (1 x w ) dx Notice that in l we can assume wlog that r = 1. Then for approximation ratio 1 + ɛ we need to make these two probabilities as distinct as possible: Pr[h(p) = h(q) p q = 1, w] = x π exp( x ) (1 x w ) dx Pr[h(p) = h(q) p q = 1 + ɛ, w] = exp( x π(1 + ɛ) (1 + ɛ) ) (1 x w ) dx. By the previous discussion on the LSH framework we can see that we need to focus on minimizing the term ρ w = Indeed in [3], they prove the following. Lemma 10. There exists w such that ρ w 1 1+ɛ. log(1/ Pr[h(p) = h(q) p q = 1, w]) log(1/ Pr[h(p) = h(q) p q = 1 + ɛ, w]). The above has been verified by numerical computations. Some intuition behind the LSH family. We will now try to give a more intuitive description of the LSH family defined above. First we randomly project the points 4 and then apply a randomly shifted grid 5 with cell-sidewidth w. Now the g : R d N k functions which are implied by the above discussion 6 is just the id of the corresponding cell in the randomly shifted grid. Better results. In [1], they achieve better exponent (roughly 1/(1 + ɛ) ) which is known to be nearly optimal. In [] they achieve even better exponent by designing an algorithmic sceme which depends on the dataset and it is no more oblivious to the points, namely data-dependent LSH. References [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117 1, 008. 3 jl.pdf 4 Target dimension log n/ɛ = appoximately preserve distances 5 Random shift implies positive probability of including two close points in the same cell. 6 Recall that in the LSH scheme we concatenate functions of the first family H. 4
[] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. of the 47th Annual ACM on Symposium on Theory of Computing, STOC 15, pages 793 801, New York, NY, USA, 015. ACM. [3] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 04, pages 53 6, New York, NY, USA, 004. ACM. [4] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):31 350, 01. [5] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symp. Theory of Computing, STOC 98, pages 604 613, 1998. 5