Approximate Nearest Neighbor (ANN) Search in High Dimensions

Size: px

Start display at page:

Download "Approximate Nearest Neighbor (ANN) Search in High Dimensions"

Lionel Price
5 years ago
Views:

1 Chapter 17 Approximate Nearest Neighbor (ANN) Search in High Dimensions By Sariel Har-Peled, February 4, ANN on the Hypercube Hypercube and Hamming distance Definition The set of points H d = { 0, 1 } d is the d-dimensional hypercube. A point p = (p 1,..., p d ) H d can be interpreted, naturally, as a binary string p 1 p 2... p d. The Hamming distance d H (p, q) between p, q H d, is the number of coordinates where p and q disagree. It is easy to verify that the Hamming distance comply with the triangle inequality, and is thus a metric. As we saw in previously, all we need to solve (1 + ε)-ann efficiently, is to efficiently solve the approximate near neighbor problem. Namely, given a set P of n points in H d, a radius r > 0 and parameter ε > 0, we want to decide for a query point q whether d H (q, P) r or d H (q, P) (1+ε)r, where d H (q, P) = min p P d H (q, p). Definition For a set P of points, a data-structure D = D D NearNbr (P, r, (1 + ε)r) solves the approximate near neighbor problem, if given a query point q, the data-structure works as follows. Near: If d H (q, P) r then D outputs a point p P such that d H (p, q) (1 + ε)r. Far: If d H (q, P) (1 + ε)r, then D outputs d H (q, P) r. Don t know: If r d(q, P) (1 + ε)r, then D can return either of the above answers. Given such a data-structure one can construct a data-structure that answers approximate nearest neighbor query using O ( log ( (log d)/ε )) queries using an approximate near-neighbor data-structure. Indeed, the desired distance d H (q, P) is an integer number in the range 0, 1,..., d. We can build a D D NearNbr data-structure for distances (1+ε) i, for i = 1,..., M, where M = O ( ε 1 log d ). Performing 1 This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy of this license, visit or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 221

2 a binary search over these distances using approximate near-neighbor data-structures would resolve the approximate nearest-neighbor query, and requires O(log M) queries. As such, in the following, we concentrate on constructing the approximate near-neighbor datastructure (i.e. D D NearNbr ) Construction of the near-neighbor data-structure On sense and sensitivity Let P = {p 1,..., p n } be a subset of vertices of the hypercube in d dimensions. In the following we assume that d = n O(1). Let r, ε > 0 be two prespecified parameters. We are interested in building an approximate near neighbor data-structure (i.e., D D NearNbr ) for balls of radius r in the Hamming distance. Definition Let U be a (small) positive integer. A family F of functions (defined over H d ), is ( r, R, α, β ) -sensitive if for any q, s H d, we have that (A) If q b(s, r) then Pr [ f (q) = f (s) ] α. (B) If q b(s, R) then Pr [ f (q) = f (s) ] β, where f is a randomly picked function from F, r < R, and α > β. Intuitively, if we can construct a (r, R, α, β)-sensitive family, then we can distinguish between two points which are close together, and two points which are far away from each other. Of course, the probabilities α and β might be very close to each other, and we need a way to do amplification. A simple sensitive family. A priori it is not even clear such a sensitive family exists, but it turns out that the family exposing randomly one coordinate is sensitive. Lemma Let f i (p) denote the function that returns the ith coordinate of p, for i = 1,..., d. Consider the family of functions F = { f 1,..., f d }. Then, for any r > 0 and ε, the family F is (r, (1 + ε)r, α, β)-sensitive, where α = 1 r/d and β = 1 r(1 + ε)/d. Proof : If q, s {0, 1} d are in distance smaller than r from each other (under the Hamming distance), then they differ in at most r coordinates. The probability that a random h F would project into a coordinate that q and s agree on is 1 r/d. Similarly, if d H (q, s) (1 + ε)r then the probability that a random h F would map into a coordinate that q and s agree on is 1 (1 + ε)r/d. A family with a large sensitivity gap. Let k be a parameter to be specified shortly, and consider the family of functions G that concatenates k of the given functions. Formally, let { G = combine(f, k) = g g(p) = ( f 1 (p),..., f k (p) ) }, for f 1,..., f k F be the set of all such functions. Lemma Given a (r, R, α, β)-sensitive family F, the family G = combine(f, k) is ( r, R, α k, β k) - sensitive. 222

3 Proof : For two fixed points q, s H d such that d H (q, s) r, we have that Pr[h(q) = h(s)] α for a random h F. As such, for a random g G, we have that Pr [ g(q) = g(s) ] = Pr [ f 1 (q) = f 1 (s) and f 2 (q) = f 2 (s) and... and f k (q) = f k (s) ] = k Pr [ f i (q) = f i (s) ] α k. i=1 Similarly, if d H (q, s) > R then Pr [ g(q) = g(s) ] = k i=1 Pr [ f i (q) = f i (s) ] β k. The above lemma implies that we can build a family that has a gap between the lower and upper sensitivities; namely, α k /β k = (α/β) k is arbitrarily large. The problem is that if α k is too small then we will have to use too many functions to detect whether or not there is a point close to the query point. Nevertheless, consider the task of building a data-structure that finds all the points of P = {p 1,..., p n } that are equal, under a given function g G = combine(f, k), to a query point. To this end, we compute the strings g(p 1 ),..., g(p n ) and store them (together with their associated point) in a hash table (or a prefix tree). Now, given a query point q, we compute g(q) and fetch from this data-structure all the strings equal to it that are stored in it. Clearly, this is a simple and efficient data-structure. All the points colliding with q would be the natural candidates to be the nearest-neighbor to q. By not storing the points explicitly, but using a pointer to the original input set, we get the following easy result. Lemma Given a function g G = combine(f, k) (see Lemma ), and a set P H d of n points, one can construct a data-structure, in O(nk) time and using { O(nk) additional space, such } that given a query point q, one can report all the points in X = p P g(p) = g(q) in O(k + X ) time. Amplifying sensitivity. Our task is now to amplify the sensitive family we currently have. To this end, for two τ-dimensional points x and y, let x y be the boolean function that returns true if there exists an index i such that x i = y i, and false otherwise. Now, the regular = operator requires vectors to be equal in all coordinates (i.e., its equal to i (x i = y i )) while x y is i(x i = y i ). The previous construction of Lemma using this alternative equal operator, provides us with the required amplification. Lemma Given a ( r, R, α k, β k) -sensitive family G, the family H uses the operator to check equality, is ( r, R, 1 ( 1 α k) τ, 1 ( 1 β k ) τ) -sensitive. = combine(g, τ), if one Proof : For two fixed points q, s H d such that d H (q, s) r, we have that Pr [ g(q) = g(s) ] α k for a random g G. As such, for a random h H, we have that Pr[h(q) h(s)] = Pr [ g 1 (q) = g 1 (s) or g 2 (q) = g 2 (s) or... or g τ (q) = g τ (s) ] = 1 τ Pr [ g i (q) g i (s) ] 1 ( 1 α k) τ. i=1 Similarly, if d H (q, s) > R then Pr[h(q) h(s)] = 1 τ i=1 Pr [ g i (q) g i (s) ] 1 ( 1 β k) τ. 223

4 To see the effect of Lemma , it is useful to play with a concrete example. Consider a (α k, β k, r, R) sensitive family where β k = α k /2, and yet α k is very small. Setting τ = 1/α k the resulting family is (roughly) (1 1/e, 1 1/ e, r, R)-sensitive. Namely, the gap had shrank, but the threshold sensitivity is considerably higher. In particular, it is now a constant, and the gap is also a constant. Using Lemma as a data-structure to store P is more involved than before. Indeed, for a random function h = ( g 1,..., g τ) H = combine(g, τ), requires us to build τ data-structures for each one of the functions g 1,..., g τ, using Lemma Now, given a query point, we retrieve all the points of P that collide with each one of these functions, by querying each of these datastructures. Lemma Given a function h H = combine(g, τ) (see Lemma ), and a set P H d of n points, one can construct a data-structure, in O(nkτ) time and using O(nkτ) { additional space, } such that given a query point q, one can report all the points in X = p P h(p) h(q) in O(kτ + X ) time The near-neighbor data-structure and handling a query We construct the data-structure D of Lemma with parameters k and τ to be determined shortly, for a random function h H. Given a query point q, we retrieve all the points that collide with h, compute their distance to the query point. Next, scan these points one by one, compute their distance to q. As soon as encountering a point s P such that d H (q, s) R, the data-structures returns true together with s. Let assume that we know that the expected number of points of P \ b(q, R) (i.e., R = (1 + ε)r) that will collide with q in D is in expectation L (we will figure out the value of L below). To ensure worst case query time, the query would abort after checking 4L + 1 points and would return false. Naturally, the data-structure would also return false if all points encountered is in distance larger than R from q. Clearly, the query time of this data-structure is O(kτ + dl). We are left with the task of fine tuning the parameters τ and k to get the fastest possible query time, while the data-structure has reasonable probability to succeed. Figuring the right values is technically tedious, and we do it next Setting the parameters If there exists p P such that d H (q, p) r, then the probability of this point to collide with q under the function h is φ 1 ( 1 α k) τ. Let us demand that this data-structure succeeds with probability 3/4. To this end, we set τ = 4 1/α k = φ 1 ( 1 α k) τ 1 exp ( α k τ ) 1 exp( 4) 3/4, (17.1) since 1 x exp( x), for x 0. Lemma The expected number of points of P \ b(q, R) colliding with the query point is L = O ( n(β/α) k). 224

5 Proof : Consider the points in P \ b(q, R). We would like to bound the number of points of this set that collide with the query point. Observe that in this case, the probability of a point p P \ b(q, R) to collide with the query point is ψ = 1 ( 1 β k) τ ( ) ) ) = 1 1 β k 4/α k 1 exp ( 8 βk 1 (1 8 βk α k α k ( β ) k = 8, α since 1 x exp( 2x), for x [0, 1/2], and exp( z) 1 z, for z 0. Namely, the expected number of points of P \ b(q, R) colliding with the query point is ψn. By Lemma , extracting the O(L) points takes O(kτ + L) time. Computing the distance of the query time for each one of these points takes O(kτ + Ld) time. As such, by Lemma , the query time is O(kτ + Ld) = O ( kτ + nd(β/α) k) = To minimize query time, we approximately solve the equation requiring the above two terms to be equal (we ignore d since, intuitively, it should be small compared to n). We get that kτ = n(β/α) k k α k n βk α k = k nβ k 1/β k n = k ln 1/β n. Setting k = ln 1/β n we have that β k = 1/n and, by Eq. (17.1), that τ = 4 1/α k ( ) ln n = exp ln 1/α = O(n ρ ), for ρ = ln 1/β ln 1/α ln 1/β. (17.2) ln(1 x) Lemma (A) For x [0, 1) and t 1 such that 1 tx > 0 we have ln(1 tx) 1 t. ln 1/α (B) For α = 1 r/d and β = 1 r(1 + ε)/d, we have that ρ = ln 1/β ε. Proof : (A) Since ln(1 tx) < 0, it follows that the claim is equivalent to t ln(1 x) ln(1 tx). This in turn is equivalent to g(x) (1 tx) (1 x) t 0. This is trivially true for x = 0. Furthermore, taking the derivative, we see g (x) = t + t(1 x) t 1, which is non-positive for x [0, 1) and t > 0. Therefore, g is non-increasing in the interval of interest, and so g(x) 0 for all values in this interval. (B) Indeed ρ = ln 1/α ln 1/β = ln α ln β = d r ln d ln d (1+ε)r d = ln ( 1 r d ) ln ( 1 (1 + ε) r d ) 1, by part (A). 1 + ε In the following, it would be convenient to consider d to be considerably larger than r. This can be ensured by (conceptually) padding the points with fake coordinates that are all zero. It is easy to verify that this hack would not effect the algorithm performance in any way, and it is just a trick to make our analysis simpler. In particular, we assume that d > 2(1 + ε)r. Lemma For α = 1 r/d, β = 1 r(1 + ε)/d, n and d as above, we have that (i) τ = O ( n 1/(1+ε)), (ii) k = O(ln n), and (iii) L = O ( n 1/(1+ε)). 225

6 Proof : By Eq. (17.1), τ = 4 1/α k = O(n ρ ) = O ( n 1/(1+ε)), by Lemma (B). Now, β = 1 r(1 + ε)/d 1/2, since we assumed that d > 2(1 + ε)r. As such, we have k = ln 1/β n = ln n = O(ln n). ln 1/β By Lemma , L = O ( n(β/α) k). Now β k = 1/n and as such L = O ( 1/α k) = O(τ) = O ( n 1/(1+ε)) The result Theorem Given a set P of n points on the hypercube H d, parameters ε > 0 and r > 0, one can build a data-structure D = D D NearNbr (P, r, (1 + ε)r) that solves the approximate near neighbor problem (see Definition ). The data-structure answers a query successfully with high probability. In addition we have: (A) The query time is O ( dn 1/(1+ε) log n ). (B) The preprocessing time to build this data-structure is O ( n 1+1/(1+ε) log 2 n ) (C) The space required to store this data-structure is O ( nd + n 1+1/(1+ε) log 2 n ). Proof : Our building block is the data-structure described above. By Markov s inequality, the probability that the algorithm has to abort because of too many collisions with points of P\b(q, (1+ ε)r) is bounded by 1/4 (since the algorithm tries 4L + 1 points). Also, in there is a point inside b(q, r), the algorithm would find it with probability 3/4, by Eq. (17.1). As such, with probability at least 1/2 this data-structure returns the correct answer in this case. By Lemma , the query time is O(kτ + Ld). This data-structure succeeds only with constant probability. To achieve high probability we construct O(log n) such data-structures and perform the near-neighbor query in each one of them. As such, the query time is O ( (kτ + Ld) log n ) = O ( n 1/(1+ε) log 2 n + dn 1/(1+ε) log n ) = O ( dn 1/(1+ε) log n ), by Lemma and since d = Ω ( lg n ), if P contains n distinct points of H d. As for the preprocessing time, by Lemma and Lemma , it is O ( nkτ log n ) = O ( n 1+1/(1+ε) log 2 n ). Finally, this data-structure requires O(dn) space to store the input points. By Lemma , we need an additional O ( nkτ log n ) = O ( n 1+1/(1+ε) log 2 n ) space. In the hypercube case, when d = n O(1), we can just build M = O ( log 1+ε d ) = O(ε 1 log d) such data-structures such that (1+ε)-ANN can be answered using binary search on those data-structures, which corresponds to radiuses r 1,..., r M, where r i = (1 + ε) i, for i = 1,..., M. Theorem Given a set P of n points on the hypercube H d (where d = n O(1) ), and a parameter ε > 0, one can build a data-structure to answer approximate nearest-neighbor queries (under the Hamming distance) using O ( dn + n 1/(1+ε) ε 1 log 2 n log d ) space, such that given a query point q, one can returns an (1+ε)-ANN in P (under the Hamming distance) in O(dn 1/(1+ε) log n log(ε 1 log d)) time. The result returned is correct with high probability. 226

7 Remark The result of Theorem needs to be oblivious the queries used. Indeed, for any instantiation of the data-structure of Theorem there exists query points for which it would fail. In particular, formally, if we perform a sequence of ANN queries using such a data-structure, where the queries depend on earlier returned answers, then the guarantee of high probability success is longer guaranteed by the above analysis (it might hold because of some other reasons, naturally) LSH and ANN in Euclidean Space Preliminaries Lemma Let X = (X 1,..., X d ) be a vector of d independent variables which have normal distribution N, and let v = (v 1,..., v d ) IR d. We have that v, X = i v i X i is distributed as v Z, where Z N. Proof : By Lemma the point X has multidimensional normal distribution N d. As such, if v = 1 then this holds by the symmetry of the normal distribution. Indeed, let e 1 = (1, 0,..., 0). By the symmetry of the d-dimensional normal distribution, we have that v, X e 1, X = X 1 N. Otherwise, v, X / v N, and as such v, X N ( 0, v 2), which is indeed the distribution of v Z. Definition A distribution D over IR is called p-stable, if there exists p 0 such that for any n real numbers v 1... v n and n independent variables X 1... X n with distribution D, the random variable i v i X i has the same distribution as the variable ( i v i p ) 1/p X, where X is a random variable with distribution D. By Lemma , the normal distribution is 2-stable distribution Locality Sensitive Hashing Let p, q be two points in IR d. We want to perform an experiment to decide if p q 1 or p q η, where η = 1 + ε. We will randomly choose a vector v from the d-dimensional normal distribution N d (which is 2-stable). Next, let r be a parameter, and let t be a random number chosen uniformly from the interval [0, r]. For p IR d, consider the random hash function p, v + t h(p) =. (17.3) r Assume that the distance between p and q is η, and the distance between the projection of the two points to the direction v is β. Then, the probability that p and q get the same hash value is max(1 β/r, 0), since this is the probability that the random sliding will not separate them. Indeed, consider the line through v to be the x-axis, and assume q is projected to r, and s is projected to r β (assuming r β). Clearly, q and s get mapped to the same value by h( ) if and only if t [0, r β], as claimed. 227

8 As such, we have that the probability of collusion is α(η) = Pr [ h(p) = h(q) ] = r β=0 Pr [ p, v q, v ]( = β 1 β ) dβ. r However, since v is chosen from a 2-stable distribution, we have that p, v q, v = p q, v N ( 0, p q 2). Since we are considering the absolute value of the variable, we need to multiply this by two. Thus, we have r 2 )( α(η, r) = exp ( β2 1 β ) dβ. 2πη 2η 2 r β=0 Intuitively, we care about the difference α(1 + ε, r) α(1, r), and we would like to maximize it as much as possible (by choosing the right value of r). Unfortunately, this integral is unfriendly, and we have to resort to numerical computation. In fact, if are going to use this hashing scheme for constructing locality sensitive hashing, like in hypercube case, then we care about the ratio ρ(1 + ε) = min r log(1/α(1, r)) log(1/α(1 + ε, r)), see Eq. (17.2) p225. The following is verified using numerical computations on a computer, Lemma ([DNIM04]) One can choose r, such that ρ(1 + ε) 1 1+ε. Lemma implies that the hash functions defined by Eq. (17.3) are (1, 1 + ε, α, β )- sensitive, and furthermore, ρ = log(1/α ) 1, for some values of log(1/β ) 1+ε α and β. As such, we can use this hashing family to construct an approximate near neighbor data-structure D D NearNbr (P, r, (1+ε)r) for the set P of points in IR d. Following the same argumentation of Theorem , we have the following. Theorem Given a set P of n points in IR d, parameters ε > 0 and r > 0, one can build a D D NearNbr = D D NearNbr (P, r, (1 + ε)r), such that given a query point q, one can decide if: b(q, r) P, then D D NearNbr returns a point u P, such that d H (u, q) (1 + ε)r. b(q, (1 + ε)r) P = then D D NearNbr returns that no point is in distance r from q. In any other case, any of the answers is correct. The query time is O(dn 1/(1+ε) log n) and the space used is O ( dn + n 1/(1+ε) n log n ). The result returned is correct with high probability ANN in High Dimensional Euclidean Space Unlike the hypercube case, where we could just do direct binary search on the distances. Here we need to use the reduction from ANN to near-neighbor queries. We will need the following result (which follows from what we had seen in previous lectures). 228

9 Theorem Given a set P of n points in IR d, then one can construct data-structures D that answers (1 + ε)-ann queries, by performing O(log(n/ε)) (1 + ε)-approximate near-neighbor queries. The total number of points stored at these approximate near-neighbor data-structures of D is O(nε 1 log(n/ε)). Constructing the data-structure of Theorem requires building a low quality HST. Unfortunately, the previous construction seen for HST are exponential in the dimension, or take quadratic time. We next present a faster scheme Low quality HST in high dimensional Euclidean space Lemma Let P be a set of n in IR d. One can compute a nd-hst of P in O(nd log 2 n) time (note, that the constant hidden by the O notation does not depend on d). Proof : Our construction is based on a recursive decomposition of the point-set. In each stage, we split the point-set into two subsets. We recursively compute a nd-hst for each point-set, and we merge the two trees into a single tree, by creating a new vertex, assigning it an appropriate value, and hung the two subtrees from this node. To carry this out, we try to separate the set into two subsets that are furthest away from each other. Let R = R(P) be the minimum axis parallel box containing P, and let ν = l(p) = d i=1 I i (R), where I i (R) is the projection of R to the ith dimension. Clearly, one can find an axis parallel strip H of width ν/((n 1)d), such that there is at least one point of P on each of its sides, and there is no points of P inside H. Indeed, to find this strip, project the point-set into the ith dimension, and find the longest interval between two consecutive points. Repeat this process for i = 1,..., d, and use the longest interval encountered. Clearly, the strip H corresponding to this interval is of width ν/((n 1)d). On the other hand, diam(p) ν. Now recursively continue the construction of two trees T +, T, for P +, P, respectively, where P +, P is the splitting of P into two sets by H. We hung T + and T on the root node v, and set v = ν. We claim that the resulting tree T is a nd-hst. To this end, observe that diam(p) v, and for a point p P and a point q P +, we have pq ν/((n 1)d), which implies the claim. To construct this efficiently, we use an efficient search trees to store the points according to their order in each coordinate. Let D 1,..., D d be those trees, where D i store the points of P in ascending order according to the ith axis, for i = 1,..., d. We modify them, such that for every node v D i, we know what is the largest empty interval along the ith axis for the points P v (i.e., the points stored in the subtree of v in D i ). Thus, finding the largest strip to split along, can be done in O(d log n) time. Now, we need to split the d trees into two families of d trees. Assume we split according to the first axis. We can split D 1 in O(log n) time using the splitting operation provided by the search tree (Treaps for example can do this split in O(log n) time). Let assume that this split P into two sets L and R, where L < R. We still need to split the other d 1 search trees. This is going to be done by deleting all the points of L from those trees, and building d 1 new search trees for L. This takes O( L d log n) time. We charge this work to the points of L. Since in every split, only the points in the smaller portion of the split get charged, it follows that every point can be charged at most O(log n) time during this construction algorithm. Thus, the overall construction time is O(dn log 2 n) time. 229

10 The overall result Plugging Theorem into Theorem , we have: Theorem Given a set P of n points in IR d, parameters ε > 0 and r > 0, one can build ANN data-structure using O ( dn + n 1+1/(1+ε) ε 2 log 3 (n/ε) ) space, such that given a query point q, one can returns an (1 + ε)-ann in P in ( O dn 1/(1+ε)( log n ) log n ) ε time. The result returned is correct with high probability. The construction time is O ( dn 1+1/(1+ε) ε 2 log 3 (n/ε) ). Proof : We compute the low quality HST using Lemma This takes O(nd log 2 n) time. Using this HST, we can construct the data-structure D of Theorem , where we do not compute the D D NearNbr data-structures. We next traverse the tree D, and construct the data-structure D D NearNbr data-structures using Theorem We only need to prove the bound on the space. Observe, that we need to store each point only once, since other place can refer to the point by a pointer. Thus, this is the O(nd) space requirement. The other term comes from plugging the bound of Theorem into the bound of Theorem Bibliographical notes Section 17.1 follows the exposition of Indyk and Motwani [IM98]. The fact that one can perform approximate nearest neighbor in high dimensions in time and space polynomial in the dimension is quite surprising, One can reduce the approximate near-neighbor in euclidean space to the same question on the hypercube (we show the details below). This implies together with the reduction from ANN to approximate near-neighbor (seen in previous lectures) that one can answer ANN in high dimensional euclidean space with similar performance. Kushilevitz, Ostrovsky and Rabani [KOR00] offered an alternative data-structure with somewhat inferior performance. The value of the results showed in this write-up depend to large extent on the reader perspective. Indeed, for small value of ε > 0, the query time O(dn 1/(1+ε) ) is very close to linear dependency on n, and is almost equivalent to just scanning the points. Thus, from low dimension perspective, where ε is assumed to be small, this result is slightly sublinear. On the other hand, if one is willing to pick ε to be large (say 10), then the result is clearly better than the naive algorithm, suggesting running time for an ANN query which takes (roughly) n 1/11. The idea of doing locality sensitive hashing directly on the Euclidean space, as done in Section 17.2 is not shocking after seeing the Johnson-Lindenstrauss lemma. It is taken from a recent paper of Datar et al. [DNIM04]. In particular, the current analysis which relies on computerized estimates is far from being satisfactory. It would be nice to have a simpler and more elegant scheme for this case. This is an open problem for further research. Another open problem is to improve the performance of the LSH scheme. 230

11 The low-quality high-dimensional HST construction of Lemma , is taken from [Har01]. The running time of this lemma can be further improved to O(dn log n) by more careful and involved implementation, see [CK95] for details. From approximate near-neighbor in IR d to approximate near-neighbor on the hypercube. The reduction is quite involved, and we only sketch the details. Let P Be a set of n points in IR d. We first reduce the dimension to k = O(ε 2 log n) using the Johnson-Lindenstrauss lemma. Next, we embed this space into l1 k (this is the space IRk, where distances are the L 1 metric instead of the regular L 2 metric), where k = O(k/ε 2 ). This can be done with distortion (1 + ε). Let Q the resulting set of points in IR k. We want to solve approximate near-neighbor queries on this set of points, for radius r. As a first step, we partition the space into cells by taking a grid with sidelength (say) k r, and randomly translating it, clipping the points inside each grid cell. It is now sufficient to solve the approximate near-neighbor problem inside this grid cell (which has bounded diameter as a function of r), since with small probability that the result would be correct. We amplify the probability by repeating this polylogarithmic number of times. Thus, we can assume that P is contained inside a cube of side length k nr, and it is in IR k, and the distance metric is the L 1 metric. We next, snap the points of P to a grid of sidelength (say) εr/k. Thus, every point of P now has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the coordinates of the points of P using unary notation. (Thus, a point (2, 5) would be written as (010, 101) assuming the number of bits for each coordinates is 3.) It is now easy to verify that the hamming distance on the resulting strings, is equivalent to the L 1 distance between the points. Thus, we can solve the near-neighbor problem for points in IR d by solving it on the hypercube under the Hamming distance. See Indyk and Motwani [IM98] for more details. This relationship indicates that the ANN on the hypercube is equivalent to the ANN in Euclidean space. In particular, making progress on the ANN on the hypercube would probably lead to similar progress on the Euclidean ANN problem. We had only scratched the surface of proximity problems in high dimensions. The interested reader is referred to the survey by Indyk [Ind04] for more information From previous lectures Lemma (A) The multidimensional normal distribution is symmetric; that is, for any two points p, q IR d such that p = q we have that g(p) = g(q), where g( ) is the density function of the multidimensional normal distribution N d. (B) The projection of the normal distribution on any direction is a one dimensional normal distribution. (C) Picking d variables X 1,..., X d using one dimensional normal distribution N results in a point (X 1,..., X d ) that has multidimensional normal distribution N d. Bibliography [CK95] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. J. Assoc. Comput. Mach., 42:67 90,

12 [DNIM04] M. Datar, Immorlica N, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages , [Har01] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci., pages , [IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages , [Ind04] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E. Goodman and J. O Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 39, pages CRC Press LLC, Boca Raton, FL, 2nd edition, [KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput., 2(30): ,

Locality Sensitive Hashing

Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of