Approximate Nearest Neighbor (ANN) Search in High Dimensions

Size: px
Start display at page:

Download "Approximate Nearest Neighbor (ANN) Search in High Dimensions"

Transcription

1 Chapter 17 Approximate Nearest Neighbor (ANN) Search in High Dimensions By Sariel Har-Peled, February 4, ANN on the Hypercube Hypercube and Hamming distance Definition The set of points H d = { 0, 1 } d is the d-dimensional hypercube. A point p = (p 1,..., p d ) H d can be interpreted, naturally, as a binary string p 1 p 2... p d. The Hamming distance d H (p, q) between p, q H d, is the number of coordinates where p and q disagree. It is easy to verify that the Hamming distance comply with the triangle inequality, and is thus a metric. As we saw in previously, all we need to solve (1 + ε)-ann efficiently, is to efficiently solve the approximate near neighbor problem. Namely, given a set P of n points in H d, a radius r > 0 and parameter ε > 0, we want to decide for a query point q whether d H (q, P) r or d H (q, P) (1+ε)r, where d H (q, P) = min p P d H (q, p). Definition For a set P of points, a data-structure D = D D NearNbr (P, r, (1 + ε)r) solves the approximate near neighbor problem, if given a query point q, the data-structure works as follows. Near: If d H (q, P) r then D outputs a point p P such that d H (p, q) (1 + ε)r. Far: If d H (q, P) (1 + ε)r, then D outputs d H (q, P) r. Don t know: If r d(q, P) (1 + ε)r, then D can return either of the above answers. Given such a data-structure one can construct a data-structure that answers approximate nearest neighbor query using O ( log ( (log d)/ε )) queries using an approximate near-neighbor data-structure. Indeed, the desired distance d H (q, P) is an integer number in the range 0, 1,..., d. We can build a D D NearNbr data-structure for distances (1+ε) i, for i = 1,..., M, where M = O ( ε 1 log d ). Performing 1 This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy of this license, visit or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 221

2 a binary search over these distances using approximate near-neighbor data-structures would resolve the approximate nearest-neighbor query, and requires O(log M) queries. As such, in the following, we concentrate on constructing the approximate near-neighbor datastructure (i.e. D D NearNbr ) Construction of the near-neighbor data-structure On sense and sensitivity Let P = {p 1,..., p n } be a subset of vertices of the hypercube in d dimensions. In the following we assume that d = n O(1). Let r, ε > 0 be two prespecified parameters. We are interested in building an approximate near neighbor data-structure (i.e., D D NearNbr ) for balls of radius r in the Hamming distance. Definition Let U be a (small) positive integer. A family F of functions (defined over H d ), is ( r, R, α, β ) -sensitive if for any q, s H d, we have that (A) If q b(s, r) then Pr [ f (q) = f (s) ] α. (B) If q b(s, R) then Pr [ f (q) = f (s) ] β, where f is a randomly picked function from F, r < R, and α > β. Intuitively, if we can construct a (r, R, α, β)-sensitive family, then we can distinguish between two points which are close together, and two points which are far away from each other. Of course, the probabilities α and β might be very close to each other, and we need a way to do amplification. A simple sensitive family. A priori it is not even clear such a sensitive family exists, but it turns out that the family exposing randomly one coordinate is sensitive. Lemma Let f i (p) denote the function that returns the ith coordinate of p, for i = 1,..., d. Consider the family of functions F = { f 1,..., f d }. Then, for any r > 0 and ε, the family F is (r, (1 + ε)r, α, β)-sensitive, where α = 1 r/d and β = 1 r(1 + ε)/d. Proof : If q, s {0, 1} d are in distance smaller than r from each other (under the Hamming distance), then they differ in at most r coordinates. The probability that a random h F would project into a coordinate that q and s agree on is 1 r/d. Similarly, if d H (q, s) (1 + ε)r then the probability that a random h F would map into a coordinate that q and s agree on is 1 (1 + ε)r/d. A family with a large sensitivity gap. Let k be a parameter to be specified shortly, and consider the family of functions G that concatenates k of the given functions. Formally, let { G = combine(f, k) = g g(p) = ( f 1 (p),..., f k (p) ) }, for f 1,..., f k F be the set of all such functions. Lemma Given a (r, R, α, β)-sensitive family F, the family G = combine(f, k) is ( r, R, α k, β k) - sensitive. 222

3 Proof : For two fixed points q, s H d such that d H (q, s) r, we have that Pr[h(q) = h(s)] α for a random h F. As such, for a random g G, we have that Pr [ g(q) = g(s) ] = Pr [ f 1 (q) = f 1 (s) and f 2 (q) = f 2 (s) and... and f k (q) = f k (s) ] = k Pr [ f i (q) = f i (s) ] α k. i=1 Similarly, if d H (q, s) > R then Pr [ g(q) = g(s) ] = k i=1 Pr [ f i (q) = f i (s) ] β k. The above lemma implies that we can build a family that has a gap between the lower and upper sensitivities; namely, α k /β k = (α/β) k is arbitrarily large. The problem is that if α k is too small then we will have to use too many functions to detect whether or not there is a point close to the query point. Nevertheless, consider the task of building a data-structure that finds all the points of P = {p 1,..., p n } that are equal, under a given function g G = combine(f, k), to a query point. To this end, we compute the strings g(p 1 ),..., g(p n ) and store them (together with their associated point) in a hash table (or a prefix tree). Now, given a query point q, we compute g(q) and fetch from this data-structure all the strings equal to it that are stored in it. Clearly, this is a simple and efficient data-structure. All the points colliding with q would be the natural candidates to be the nearest-neighbor to q. By not storing the points explicitly, but using a pointer to the original input set, we get the following easy result. Lemma Given a function g G = combine(f, k) (see Lemma ), and a set P H d of n points, one can construct a data-structure, in O(nk) time and using { O(nk) additional space, such } that given a query point q, one can report all the points in X = p P g(p) = g(q) in O(k + X ) time. Amplifying sensitivity. Our task is now to amplify the sensitive family we currently have. To this end, for two τ-dimensional points x and y, let x y be the boolean function that returns true if there exists an index i such that x i = y i, and false otherwise. Now, the regular = operator requires vectors to be equal in all coordinates (i.e., its equal to i (x i = y i )) while x y is i(x i = y i ). The previous construction of Lemma using this alternative equal operator, provides us with the required amplification. Lemma Given a ( r, R, α k, β k) -sensitive family G, the family H uses the operator to check equality, is ( r, R, 1 ( 1 α k) τ, 1 ( 1 β k ) τ) -sensitive. = combine(g, τ), if one Proof : For two fixed points q, s H d such that d H (q, s) r, we have that Pr [ g(q) = g(s) ] α k for a random g G. As such, for a random h H, we have that Pr[h(q) h(s)] = Pr [ g 1 (q) = g 1 (s) or g 2 (q) = g 2 (s) or... or g τ (q) = g τ (s) ] = 1 τ Pr [ g i (q) g i (s) ] 1 ( 1 α k) τ. i=1 Similarly, if d H (q, s) > R then Pr[h(q) h(s)] = 1 τ i=1 Pr [ g i (q) g i (s) ] 1 ( 1 β k) τ. 223

4 To see the effect of Lemma , it is useful to play with a concrete example. Consider a (α k, β k, r, R) sensitive family where β k = α k /2, and yet α k is very small. Setting τ = 1/α k the resulting family is (roughly) (1 1/e, 1 1/ e, r, R)-sensitive. Namely, the gap had shrank, but the threshold sensitivity is considerably higher. In particular, it is now a constant, and the gap is also a constant. Using Lemma as a data-structure to store P is more involved than before. Indeed, for a random function h = ( g 1,..., g τ) H = combine(g, τ), requires us to build τ data-structures for each one of the functions g 1,..., g τ, using Lemma Now, given a query point, we retrieve all the points of P that collide with each one of these functions, by querying each of these datastructures. Lemma Given a function h H = combine(g, τ) (see Lemma ), and a set P H d of n points, one can construct a data-structure, in O(nkτ) time and using O(nkτ) { additional space, } such that given a query point q, one can report all the points in X = p P h(p) h(q) in O(kτ + X ) time The near-neighbor data-structure and handling a query We construct the data-structure D of Lemma with parameters k and τ to be determined shortly, for a random function h H. Given a query point q, we retrieve all the points that collide with h, compute their distance to the query point. Next, scan these points one by one, compute their distance to q. As soon as encountering a point s P such that d H (q, s) R, the data-structures returns true together with s. Let assume that we know that the expected number of points of P \ b(q, R) (i.e., R = (1 + ε)r) that will collide with q in D is in expectation L (we will figure out the value of L below). To ensure worst case query time, the query would abort after checking 4L + 1 points and would return false. Naturally, the data-structure would also return false if all points encountered is in distance larger than R from q. Clearly, the query time of this data-structure is O(kτ + dl). We are left with the task of fine tuning the parameters τ and k to get the fastest possible query time, while the data-structure has reasonable probability to succeed. Figuring the right values is technically tedious, and we do it next Setting the parameters If there exists p P such that d H (q, p) r, then the probability of this point to collide with q under the function h is φ 1 ( 1 α k) τ. Let us demand that this data-structure succeeds with probability 3/4. To this end, we set τ = 4 1/α k = φ 1 ( 1 α k) τ 1 exp ( α k τ ) 1 exp( 4) 3/4, (17.1) since 1 x exp( x), for x 0. Lemma The expected number of points of P \ b(q, R) colliding with the query point is L = O ( n(β/α) k). 224

5 Proof : Consider the points in P \ b(q, R). We would like to bound the number of points of this set that collide with the query point. Observe that in this case, the probability of a point p P \ b(q, R) to collide with the query point is ψ = 1 ( 1 β k) τ ( ) ) ) = 1 1 β k 4/α k 1 exp ( 8 βk 1 (1 8 βk α k α k ( β ) k = 8, α since 1 x exp( 2x), for x [0, 1/2], and exp( z) 1 z, for z 0. Namely, the expected number of points of P \ b(q, R) colliding with the query point is ψn. By Lemma , extracting the O(L) points takes O(kτ + L) time. Computing the distance of the query time for each one of these points takes O(kτ + Ld) time. As such, by Lemma , the query time is O(kτ + Ld) = O ( kτ + nd(β/α) k) = To minimize query time, we approximately solve the equation requiring the above two terms to be equal (we ignore d since, intuitively, it should be small compared to n). We get that kτ = n(β/α) k k α k n βk α k = k nβ k 1/β k n = k ln 1/β n. Setting k = ln 1/β n we have that β k = 1/n and, by Eq. (17.1), that τ = 4 1/α k ( ) ln n = exp ln 1/α = O(n ρ ), for ρ = ln 1/β ln 1/α ln 1/β. (17.2) ln(1 x) Lemma (A) For x [0, 1) and t 1 such that 1 tx > 0 we have ln(1 tx) 1 t. ln 1/α (B) For α = 1 r/d and β = 1 r(1 + ε)/d, we have that ρ = ln 1/β ε. Proof : (A) Since ln(1 tx) < 0, it follows that the claim is equivalent to t ln(1 x) ln(1 tx). This in turn is equivalent to g(x) (1 tx) (1 x) t 0. This is trivially true for x = 0. Furthermore, taking the derivative, we see g (x) = t + t(1 x) t 1, which is non-positive for x [0, 1) and t > 0. Therefore, g is non-increasing in the interval of interest, and so g(x) 0 for all values in this interval. (B) Indeed ρ = ln 1/α ln 1/β = ln α ln β = d r ln d ln d (1+ε)r d = ln ( 1 r d ) ln ( 1 (1 + ε) r d ) 1, by part (A). 1 + ε In the following, it would be convenient to consider d to be considerably larger than r. This can be ensured by (conceptually) padding the points with fake coordinates that are all zero. It is easy to verify that this hack would not effect the algorithm performance in any way, and it is just a trick to make our analysis simpler. In particular, we assume that d > 2(1 + ε)r. Lemma For α = 1 r/d, β = 1 r(1 + ε)/d, n and d as above, we have that (i) τ = O ( n 1/(1+ε)), (ii) k = O(ln n), and (iii) L = O ( n 1/(1+ε)). 225

6 Proof : By Eq. (17.1), τ = 4 1/α k = O(n ρ ) = O ( n 1/(1+ε)), by Lemma (B). Now, β = 1 r(1 + ε)/d 1/2, since we assumed that d > 2(1 + ε)r. As such, we have k = ln 1/β n = ln n = O(ln n). ln 1/β By Lemma , L = O ( n(β/α) k). Now β k = 1/n and as such L = O ( 1/α k) = O(τ) = O ( n 1/(1+ε)) The result Theorem Given a set P of n points on the hypercube H d, parameters ε > 0 and r > 0, one can build a data-structure D = D D NearNbr (P, r, (1 + ε)r) that solves the approximate near neighbor problem (see Definition ). The data-structure answers a query successfully with high probability. In addition we have: (A) The query time is O ( dn 1/(1+ε) log n ). (B) The preprocessing time to build this data-structure is O ( n 1+1/(1+ε) log 2 n ) (C) The space required to store this data-structure is O ( nd + n 1+1/(1+ε) log 2 n ). Proof : Our building block is the data-structure described above. By Markov s inequality, the probability that the algorithm has to abort because of too many collisions with points of P\b(q, (1+ ε)r) is bounded by 1/4 (since the algorithm tries 4L + 1 points). Also, in there is a point inside b(q, r), the algorithm would find it with probability 3/4, by Eq. (17.1). As such, with probability at least 1/2 this data-structure returns the correct answer in this case. By Lemma , the query time is O(kτ + Ld). This data-structure succeeds only with constant probability. To achieve high probability we construct O(log n) such data-structures and perform the near-neighbor query in each one of them. As such, the query time is O ( (kτ + Ld) log n ) = O ( n 1/(1+ε) log 2 n + dn 1/(1+ε) log n ) = O ( dn 1/(1+ε) log n ), by Lemma and since d = Ω ( lg n ), if P contains n distinct points of H d. As for the preprocessing time, by Lemma and Lemma , it is O ( nkτ log n ) = O ( n 1+1/(1+ε) log 2 n ). Finally, this data-structure requires O(dn) space to store the input points. By Lemma , we need an additional O ( nkτ log n ) = O ( n 1+1/(1+ε) log 2 n ) space. In the hypercube case, when d = n O(1), we can just build M = O ( log 1+ε d ) = O(ε 1 log d) such data-structures such that (1+ε)-ANN can be answered using binary search on those data-structures, which corresponds to radiuses r 1,..., r M, where r i = (1 + ε) i, for i = 1,..., M. Theorem Given a set P of n points on the hypercube H d (where d = n O(1) ), and a parameter ε > 0, one can build a data-structure to answer approximate nearest-neighbor queries (under the Hamming distance) using O ( dn + n 1/(1+ε) ε 1 log 2 n log d ) space, such that given a query point q, one can returns an (1+ε)-ANN in P (under the Hamming distance) in O(dn 1/(1+ε) log n log(ε 1 log d)) time. The result returned is correct with high probability. 226

7 Remark The result of Theorem needs to be oblivious the queries used. Indeed, for any instantiation of the data-structure of Theorem there exists query points for which it would fail. In particular, formally, if we perform a sequence of ANN queries using such a data-structure, where the queries depend on earlier returned answers, then the guarantee of high probability success is longer guaranteed by the above analysis (it might hold because of some other reasons, naturally) LSH and ANN in Euclidean Space Preliminaries Lemma Let X = (X 1,..., X d ) be a vector of d independent variables which have normal distribution N, and let v = (v 1,..., v d ) IR d. We have that v, X = i v i X i is distributed as v Z, where Z N. Proof : By Lemma the point X has multidimensional normal distribution N d. As such, if v = 1 then this holds by the symmetry of the normal distribution. Indeed, let e 1 = (1, 0,..., 0). By the symmetry of the d-dimensional normal distribution, we have that v, X e 1, X = X 1 N. Otherwise, v, X / v N, and as such v, X N ( 0, v 2), which is indeed the distribution of v Z. Definition A distribution D over IR is called p-stable, if there exists p 0 such that for any n real numbers v 1... v n and n independent variables X 1... X n with distribution D, the random variable i v i X i has the same distribution as the variable ( i v i p ) 1/p X, where X is a random variable with distribution D. By Lemma , the normal distribution is 2-stable distribution Locality Sensitive Hashing Let p, q be two points in IR d. We want to perform an experiment to decide if p q 1 or p q η, where η = 1 + ε. We will randomly choose a vector v from the d-dimensional normal distribution N d (which is 2-stable). Next, let r be a parameter, and let t be a random number chosen uniformly from the interval [0, r]. For p IR d, consider the random hash function p, v + t h(p) =. (17.3) r Assume that the distance between p and q is η, and the distance between the projection of the two points to the direction v is β. Then, the probability that p and q get the same hash value is max(1 β/r, 0), since this is the probability that the random sliding will not separate them. Indeed, consider the line through v to be the x-axis, and assume q is projected to r, and s is projected to r β (assuming r β). Clearly, q and s get mapped to the same value by h( ) if and only if t [0, r β], as claimed. 227

8 As such, we have that the probability of collusion is α(η) = Pr [ h(p) = h(q) ] = r β=0 Pr [ p, v q, v ]( = β 1 β ) dβ. r However, since v is chosen from a 2-stable distribution, we have that p, v q, v = p q, v N ( 0, p q 2). Since we are considering the absolute value of the variable, we need to multiply this by two. Thus, we have r 2 )( α(η, r) = exp ( β2 1 β ) dβ. 2πη 2η 2 r β=0 Intuitively, we care about the difference α(1 + ε, r) α(1, r), and we would like to maximize it as much as possible (by choosing the right value of r). Unfortunately, this integral is unfriendly, and we have to resort to numerical computation. In fact, if are going to use this hashing scheme for constructing locality sensitive hashing, like in hypercube case, then we care about the ratio ρ(1 + ε) = min r log(1/α(1, r)) log(1/α(1 + ε, r)), see Eq. (17.2) p225. The following is verified using numerical computations on a computer, Lemma ([DNIM04]) One can choose r, such that ρ(1 + ε) 1 1+ε. Lemma implies that the hash functions defined by Eq. (17.3) are (1, 1 + ε, α, β )- sensitive, and furthermore, ρ = log(1/α ) 1, for some values of log(1/β ) 1+ε α and β. As such, we can use this hashing family to construct an approximate near neighbor data-structure D D NearNbr (P, r, (1+ε)r) for the set P of points in IR d. Following the same argumentation of Theorem , we have the following. Theorem Given a set P of n points in IR d, parameters ε > 0 and r > 0, one can build a D D NearNbr = D D NearNbr (P, r, (1 + ε)r), such that given a query point q, one can decide if: b(q, r) P, then D D NearNbr returns a point u P, such that d H (u, q) (1 + ε)r. b(q, (1 + ε)r) P = then D D NearNbr returns that no point is in distance r from q. In any other case, any of the answers is correct. The query time is O(dn 1/(1+ε) log n) and the space used is O ( dn + n 1/(1+ε) n log n ). The result returned is correct with high probability ANN in High Dimensional Euclidean Space Unlike the hypercube case, where we could just do direct binary search on the distances. Here we need to use the reduction from ANN to near-neighbor queries. We will need the following result (which follows from what we had seen in previous lectures). 228

9 Theorem Given a set P of n points in IR d, then one can construct data-structures D that answers (1 + ε)-ann queries, by performing O(log(n/ε)) (1 + ε)-approximate near-neighbor queries. The total number of points stored at these approximate near-neighbor data-structures of D is O(nε 1 log(n/ε)). Constructing the data-structure of Theorem requires building a low quality HST. Unfortunately, the previous construction seen for HST are exponential in the dimension, or take quadratic time. We next present a faster scheme Low quality HST in high dimensional Euclidean space Lemma Let P be a set of n in IR d. One can compute a nd-hst of P in O(nd log 2 n) time (note, that the constant hidden by the O notation does not depend on d). Proof : Our construction is based on a recursive decomposition of the point-set. In each stage, we split the point-set into two subsets. We recursively compute a nd-hst for each point-set, and we merge the two trees into a single tree, by creating a new vertex, assigning it an appropriate value, and hung the two subtrees from this node. To carry this out, we try to separate the set into two subsets that are furthest away from each other. Let R = R(P) be the minimum axis parallel box containing P, and let ν = l(p) = d i=1 I i (R), where I i (R) is the projection of R to the ith dimension. Clearly, one can find an axis parallel strip H of width ν/((n 1)d), such that there is at least one point of P on each of its sides, and there is no points of P inside H. Indeed, to find this strip, project the point-set into the ith dimension, and find the longest interval between two consecutive points. Repeat this process for i = 1,..., d, and use the longest interval encountered. Clearly, the strip H corresponding to this interval is of width ν/((n 1)d). On the other hand, diam(p) ν. Now recursively continue the construction of two trees T +, T, for P +, P, respectively, where P +, P is the splitting of P into two sets by H. We hung T + and T on the root node v, and set v = ν. We claim that the resulting tree T is a nd-hst. To this end, observe that diam(p) v, and for a point p P and a point q P +, we have pq ν/((n 1)d), which implies the claim. To construct this efficiently, we use an efficient search trees to store the points according to their order in each coordinate. Let D 1,..., D d be those trees, where D i store the points of P in ascending order according to the ith axis, for i = 1,..., d. We modify them, such that for every node v D i, we know what is the largest empty interval along the ith axis for the points P v (i.e., the points stored in the subtree of v in D i ). Thus, finding the largest strip to split along, can be done in O(d log n) time. Now, we need to split the d trees into two families of d trees. Assume we split according to the first axis. We can split D 1 in O(log n) time using the splitting operation provided by the search tree (Treaps for example can do this split in O(log n) time). Let assume that this split P into two sets L and R, where L < R. We still need to split the other d 1 search trees. This is going to be done by deleting all the points of L from those trees, and building d 1 new search trees for L. This takes O( L d log n) time. We charge this work to the points of L. Since in every split, only the points in the smaller portion of the split get charged, it follows that every point can be charged at most O(log n) time during this construction algorithm. Thus, the overall construction time is O(dn log 2 n) time. 229

10 The overall result Plugging Theorem into Theorem , we have: Theorem Given a set P of n points in IR d, parameters ε > 0 and r > 0, one can build ANN data-structure using O ( dn + n 1+1/(1+ε) ε 2 log 3 (n/ε) ) space, such that given a query point q, one can returns an (1 + ε)-ann in P in ( O dn 1/(1+ε)( log n ) log n ) ε time. The result returned is correct with high probability. The construction time is O ( dn 1+1/(1+ε) ε 2 log 3 (n/ε) ). Proof : We compute the low quality HST using Lemma This takes O(nd log 2 n) time. Using this HST, we can construct the data-structure D of Theorem , where we do not compute the D D NearNbr data-structures. We next traverse the tree D, and construct the data-structure D D NearNbr data-structures using Theorem We only need to prove the bound on the space. Observe, that we need to store each point only once, since other place can refer to the point by a pointer. Thus, this is the O(nd) space requirement. The other term comes from plugging the bound of Theorem into the bound of Theorem Bibliographical notes Section 17.1 follows the exposition of Indyk and Motwani [IM98]. The fact that one can perform approximate nearest neighbor in high dimensions in time and space polynomial in the dimension is quite surprising, One can reduce the approximate near-neighbor in euclidean space to the same question on the hypercube (we show the details below). This implies together with the reduction from ANN to approximate near-neighbor (seen in previous lectures) that one can answer ANN in high dimensional euclidean space with similar performance. Kushilevitz, Ostrovsky and Rabani [KOR00] offered an alternative data-structure with somewhat inferior performance. The value of the results showed in this write-up depend to large extent on the reader perspective. Indeed, for small value of ε > 0, the query time O(dn 1/(1+ε) ) is very close to linear dependency on n, and is almost equivalent to just scanning the points. Thus, from low dimension perspective, where ε is assumed to be small, this result is slightly sublinear. On the other hand, if one is willing to pick ε to be large (say 10), then the result is clearly better than the naive algorithm, suggesting running time for an ANN query which takes (roughly) n 1/11. The idea of doing locality sensitive hashing directly on the Euclidean space, as done in Section 17.2 is not shocking after seeing the Johnson-Lindenstrauss lemma. It is taken from a recent paper of Datar et al. [DNIM04]. In particular, the current analysis which relies on computerized estimates is far from being satisfactory. It would be nice to have a simpler and more elegant scheme for this case. This is an open problem for further research. Another open problem is to improve the performance of the LSH scheme. 230

11 The low-quality high-dimensional HST construction of Lemma , is taken from [Har01]. The running time of this lemma can be further improved to O(dn log n) by more careful and involved implementation, see [CK95] for details. From approximate near-neighbor in IR d to approximate near-neighbor on the hypercube. The reduction is quite involved, and we only sketch the details. Let P Be a set of n points in IR d. We first reduce the dimension to k = O(ε 2 log n) using the Johnson-Lindenstrauss lemma. Next, we embed this space into l1 k (this is the space IRk, where distances are the L 1 metric instead of the regular L 2 metric), where k = O(k/ε 2 ). This can be done with distortion (1 + ε). Let Q the resulting set of points in IR k. We want to solve approximate near-neighbor queries on this set of points, for radius r. As a first step, we partition the space into cells by taking a grid with sidelength (say) k r, and randomly translating it, clipping the points inside each grid cell. It is now sufficient to solve the approximate near-neighbor problem inside this grid cell (which has bounded diameter as a function of r), since with small probability that the result would be correct. We amplify the probability by repeating this polylogarithmic number of times. Thus, we can assume that P is contained inside a cube of side length k nr, and it is in IR k, and the distance metric is the L 1 metric. We next, snap the points of P to a grid of sidelength (say) εr/k. Thus, every point of P now has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the coordinates of the points of P using unary notation. (Thus, a point (2, 5) would be written as (010, 101) assuming the number of bits for each coordinates is 3.) It is now easy to verify that the hamming distance on the resulting strings, is equivalent to the L 1 distance between the points. Thus, we can solve the near-neighbor problem for points in IR d by solving it on the hypercube under the Hamming distance. See Indyk and Motwani [IM98] for more details. This relationship indicates that the ANN on the hypercube is equivalent to the ANN in Euclidean space. In particular, making progress on the ANN on the hypercube would probably lead to similar progress on the Euclidean ANN problem. We had only scratched the surface of proximity problems in high dimensions. The interested reader is referred to the survey by Indyk [Ind04] for more information From previous lectures Lemma (A) The multidimensional normal distribution is symmetric; that is, for any two points p, q IR d such that p = q we have that g(p) = g(q), where g( ) is the density function of the multidimensional normal distribution N d. (B) The projection of the normal distribution on any direction is a one dimensional normal distribution. (C) Picking d variables X 1,..., X d using one dimensional normal distribution N results in a point (X 1,..., X d ) that has multidimensional normal distribution N d. Bibliography [CK95] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. J. Assoc. Comput. Mach., 42:67 90,

12 [DNIM04] M. Datar, Immorlica N, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages , [Har01] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci., pages , [IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages , [Ind04] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E. Goodman and J. O Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 39, pages CRC Press LLC, Boca Raton, FL, 2nd edition, [KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput., 2(30): ,

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1. Chapter 11 Min Cut By Sariel Har-Peled, December 10, 013 1 Version: 1.0 I built on the sand And it tumbled down, I built on a rock And it tumbled down. Now when I build, I shall begin With the smoke from

More information

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

Proximity problems in high dimensions

Proximity problems in high dimensions Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition

More information

An Algorithmist s Toolkit Nov. 10, Lecture 17

An Algorithmist s Toolkit Nov. 10, Lecture 17 8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from

More information

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is

More information

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Manor Mendel, CMI, Caltech 1 Finite Metric Spaces Definition of (semi) metric. (M, ρ): M a (finite) set of points. ρ a distance function

More information

4 Locality-sensitive hashing using stable distributions

4 Locality-sensitive hashing using stable distributions 4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family

More information

Optimal compression of approximate Euclidean distances

Optimal compression of approximate Euclidean distances Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch

More information

Succinct Data Structures for Approximating Convex Functions with Applications

Succinct Data Structures for Approximating Convex Functions with Applications Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca

More information

A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees

A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees Sunil Arya Hong Kong University of Science and Technology and David Mount University of Maryland Arya and Mount HALG

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

Lecture 17 03/21, 2017

Lecture 17 03/21, 2017 CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Fly Cheaply: On the Minimum Fuel Consumption Problem

Fly Cheaply: On the Minimum Fuel Consumption Problem Journal of Algorithms 41, 330 337 (2001) doi:10.1006/jagm.2001.1189, available online at http://www.idealibrary.com on Fly Cheaply: On the Minimum Fuel Consumption Problem Timothy M. Chan Department of

More information

Similarity Search in High Dimensions II. Piotr Indyk MIT

Similarity Search in High Dimensions II. Piotr Indyk MIT Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance

More information

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled Sepideh Mahabadi November 24, 2015 Abstract We introduce a new variant of the nearest neighbor search problem,

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

arxiv:cs/ v1 [cs.cg] 7 Feb 2006

arxiv:cs/ v1 [cs.cg] 7 Feb 2006 Approximate Weighted Farthest Neighbors and Minimum Dilation Stars John Augustine, David Eppstein, and Kevin A. Wortman Computer Science Department University of California, Irvine Irvine, CA 92697, USA

More information

Approximate Clustering via Core-Sets

Approximate Clustering via Core-Sets Approximate Clustering via Core-Sets Mihai Bădoiu Sariel Har-Peled Piotr Indyk 22/02/2002 17:40 Abstract In this paper, we show that for several clustering problems one can extract a small set of points,

More information

Similarity searching, or how to find your neighbors efficiently

Similarity searching, or how to find your neighbors efficiently Similarity searching, or how to find your neighbors efficiently Robert Krauthgamer Weizmann Institute of Science CS Research Day for Prospective Students May 1, 2009 Background Geometric spaces and techniques

More information

Approximate Voronoi Diagrams

Approximate Voronoi Diagrams CS468, Mon. Oct. 30 th, 2006 Approximate Voronoi Diagrams Presentation by Maks Ovsjanikov S. Har-Peled s notes, Chapters 6 and 7 1-1 Outline Preliminaries Problem Statement ANN using PLEB } Bounds and

More information

On Approximating the Depth and Related Problems

On Approximating the Depth and Related Problems On Approximating the Depth and Related Problems Boris Aronov Polytechnic University, Brooklyn, NY Sariel Har-Peled UIUC, Urbana, IL 1: Motivation: Operation Inflicting Freedom Input: R - set red points

More information

Approximating a Convex Body by An Ellipsoid

Approximating a Convex Body by An Ellipsoid Chapter 1 Approximating a Convex Body by An Ellipsoid By Sariel Har-Peled, May 10, 010 1 Is there anything in the Geneva Convention about the rules of war in peacetime? Stalnko wanted to know, crawling

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

Coresets for k-means and k-median Clustering and their Applications

Coresets for k-means and k-median Clustering and their Applications Coresets for k-means and k-median Clustering and their Applications Sariel Har-Peled Soham Mazumdar November 7, 2003 Abstract In this paper, we show the existence of small coresets for the problems of

More information

Clustering in High Dimensions. Mihai Bădoiu

Clustering in High Dimensions. Mihai Bădoiu Clustering in High Dimensions by Mihai Bădoiu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Bachelor of Science

More information

Transforming Hierarchical Trees on Metric Spaces

Transforming Hierarchical Trees on Metric Spaces CCCG 016, Vancouver, British Columbia, August 3 5, 016 Transforming Hierarchical Trees on Metric Spaces Mahmoodreza Jahanseir Donald R. Sheehy Abstract We show how a simple hierarchical tree called a cover

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Algorithms, Geometry and Learning. Reading group Paris Syminelakis

Algorithms, Geometry and Learning. Reading group Paris Syminelakis Algorithms, Geometry and Learning Reading group Paris Syminelakis October 11, 2016 2 Contents 1 Local Dimensionality Reduction 5 1 Introduction.................................... 5 2 Definitions and Results..............................

More information

Geometric Optimization Problems over Sliding Windows

Geometric Optimization Problems over Sliding Windows Geometric Optimization Problems over Sliding Windows Timothy M. Chan and Bashir S. Sadjad School of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {tmchan,bssadjad}@uwaterloo.ca

More information

Lecture 18: March 15

Lecture 18: March 15 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Chapter 23. Fast Fourier Transform Introduction. By Sariel Har-Peled, November 28, Version: 0.11

Chapter 23. Fast Fourier Transform Introduction. By Sariel Har-Peled, November 28, Version: 0.11 Chapter 23 Fast Fourier Transform By Sariel Har-Peled, November 28, 208 Version: 0 But now, reflecting further, there begins to creep into his breast a touch of fellow-feeling for his imitators For it

More information

Partitioning Metric Spaces

Partitioning Metric Spaces Partitioning Metric Spaces Computational and Metric Geometry Instructor: Yury Makarychev 1 Multiway Cut Problem 1.1 Preliminaries Definition 1.1. We are given a graph G = (V, E) and a set of terminals

More information

Projective Clustering in High Dimensions using Core-Sets

Projective Clustering in High Dimensions using Core-Sets Projective Clustering in High Dimensions using Core-Sets Sariel Har-Peled Kasturi R. Varadarajan January 31, 2003 Abstract In this paper, we show that there exists a small core-set for the problem of computing

More information

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric 16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.

More information

Reporting Neighbors in High-Dimensional Euclidean Space

Reporting Neighbors in High-Dimensional Euclidean Space Reporting Neighbors in High-Dimensional Euclidean Space Dror Aiger Haim Kaplan Micha Sharir Abstract We consider the following problem, which arises in many database and web-based applications: Given a

More information

An efficient approximation for point-set diameter in higher dimensions

An efficient approximation for point-set diameter in higher dimensions CCCG 2018, Winnipeg, Canada, August 8 10, 2018 An efficient approximation for point-set diameter in higher dimensions Mahdi Imanparast Seyed Naser Hashemi Ali Mohades Abstract In this paper, we study the

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni Ilya Razenshteyn January 7, 015 Abstract We show an optimal data-dependent hashing scheme for the approximate near neighbor

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia

More information

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform

More information

Space Exploration via Proximity Search

Space Exploration via Proximity Search Space Exploration via Proximity Search Sariel Har-Peled Nirman Kumar David M. Mount Benjamin Raichel July 7, 2018 arxiv:1412.1398v1 [cs.cg] 3 Dec 2014 Abstract We investigate what computational tasks can

More information

Geometry of Similarity Search

Geometry of Similarity Search Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms Antoine Vigneron King Abdullah University of Science and Technology December 5, 2012 Antoine Vigneron (KAUST) CS 372 Lecture

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Polynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M.

Polynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M. Polynomial Representations of Threshold Functions and Algorithmic Applications Ryan Williams Stanford Joint with Josh Alman (Stanford) and Timothy M. Chan (Waterloo) Outline The Context: Polynomial Representations,

More information

Nearest Neighbor Preserving Embeddings

Nearest Neighbor Preserving Embeddings Nearest Neighbor Preserving Embeddings Piotr Indyk MIT Assaf Naor Microsoft Research Abstract In this paper we introduce the notion of nearest neighbor preserving embeddings. These are randomized embeddings

More information

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data 1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

More information

Lecture 22. m n c (k) i,j x i x j = c (k) k=1

Lecture 22. m n c (k) i,j x i x j = c (k) k=1 Notes on Complexity Theory Last updated: June, 2014 Jonathan Katz Lecture 22 1 N P PCP(poly, 1) We show here a probabilistically checkable proof for N P in which the verifier reads only a constant number

More information

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Higher Cell Probe Lower Bounds for Evaluating Polynomials Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we

More information

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized

More information

12.1. Branching processes Galton-Watson Process

12.1. Branching processes Galton-Watson Process Chapter 1 Min Cut To acknowledge the corn - This purely American expression means to admit the losing of an argument, especially in regard to a detail; to retract; to admit defeat. It is over a hundred

More information

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Emanuele Viola July 6, 2009 Abstract We prove that to store strings x {0, 1} n so that each prefix sum a.k.a. rank query Sumi := k i x k can

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

Error Detection and Correction: Small Applications of Exclusive-Or

Error Detection and Correction: Small Applications of Exclusive-Or Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,

More information

Tail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002

Tail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002 Tail Inequalities 497 - Randomized Algorithms Sariel Har-Peled December 0, 00 Wir mssen wissen, wir werden wissen (We must know, we shall know) David Hilbert 1 Tail Inequalities 1.1 The Chernoff Bound

More information

1 Cryptographic hash functions

1 Cryptographic hash functions CSCI 5440: Cryptography Lecture 6 The Chinese University of Hong Kong 23 February 2011 1 Cryptographic hash functions Last time we saw a construction of message authentication codes (MACs) for fixed-length

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

WELL-SEPARATED PAIR DECOMPOSITION FOR THE UNIT-DISK GRAPH METRIC AND ITS APPLICATIONS

WELL-SEPARATED PAIR DECOMPOSITION FOR THE UNIT-DISK GRAPH METRIC AND ITS APPLICATIONS WELL-SEPARATED PAIR DECOMPOSITION FOR THE UNIT-DISK GRAPH METRIC AND ITS APPLICATIONS JIE GAO AND LI ZHANG Abstract. We extend the classic notion of well-separated pair decomposition [10] to the unit-disk

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10] Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter A, you write 3,004,572,688. B is 28,731,569½.

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

Randomized Algorithms III Min Cut

Randomized Algorithms III Min Cut Chapter 11 Randomized Algorithms III Min Cut CS 57: Algorithms, Fall 01 October 1, 01 11.1 Min Cut 11.1.1 Problem Definition 11. Min cut 11..0.1 Min cut G = V, E): undirected graph, n vertices, m edges.

More information

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny)

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny) Innovations in Computer Science 20 Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny Ryan O Donnell Yi Wu 3 Yuan Zhou 2 Computer Science Department, Carnegie Mellon University,

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Partitions and Covers

Partitions and Covers University of California, Los Angeles CS 289A Communication Complexity Instructor: Alexander Sherstov Scribe: Dong Wang Date: January 2, 2012 LECTURE 4 Partitions and Covers In previous lectures, we saw

More information

The min cost flow problem Course notes for Optimization Spring 2007

The min cost flow problem Course notes for Optimization Spring 2007 The min cost flow problem Course notes for Optimization Spring 2007 Peter Bro Miltersen February 7, 2007 Version 3.0 1 Definition of the min cost flow problem We shall consider a generalization of the

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Lecture 23: Alternation vs. Counting

Lecture 23: Alternation vs. Counting CS 710: Complexity Theory 4/13/010 Lecture 3: Alternation vs. Counting Instructor: Dieter van Melkebeek Scribe: Jeff Kinne & Mushfeq Khan We introduced counting complexity classes in the previous lecture

More information

Nearest-Neighbor Searching Under Uncertainty

Nearest-Neighbor Searching Under Uncertainty Nearest-Neighbor Searching Under Uncertainty Wuzhou Zhang Joint work with Pankaj K. Agarwal, Alon Efrat, and Swaminathan Sankararaman. To appear in PODS 2012. Nearest-Neighbor Searching S: a set of n points

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Lecture 6: September 22

Lecture 6: September 22 CS294 Markov Chain Monte Carlo: Foundations & Applications Fall 2009 Lecture 6: September 22 Lecturer: Prof. Alistair Sinclair Scribes: Alistair Sinclair Disclaimer: These notes have not been subjected

More information

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d CS161, Lecture 4 Median, Selection, and the Substitution Method Scribe: Albert Chen and Juliana Cook (2015), Sam Kim (2016), Gregory Valiant (2017) Date: January 23, 2017 1 Introduction Last lecture, we

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

1 Randomized Computation

1 Randomized Computation CS 6743 Lecture 17 1 Fall 2007 1 Randomized Computation Why is randomness useful? Imagine you have a stack of bank notes, with very few counterfeit ones. You want to choose a genuine bank note to pay at

More information

Nearest-Neighbor Searching Under Uncertainty

Nearest-Neighbor Searching Under Uncertainty Nearest-Neighbor Searching Under Uncertainty Pankaj K. Agarwal Department of Computer Science Duke University pankaj@cs.duke.edu Alon Efrat Department of Computer Science The University of Arizona alon@cs.arizona.edu

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms CSE 101, Winter 2018 Design and Analysis of Algorithms Lecture 5: Divide and Conquer (Part 2) Class URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ A Lower Bound on Convex Hull Lecture 4 Task: sort the

More information

Approximating MAX-E3LIN is NP-Hard

Approximating MAX-E3LIN is NP-Hard Approximating MAX-E3LIN is NP-Hard Evan Chen May 4, 2016 This lecture focuses on the MAX-E3LIN problem. We prove that approximating it is NP-hard by a reduction from LABEL-COVER. 1 Introducing MAX-E3LIN

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

Ramsey partitions and proximity data structures

Ramsey partitions and proximity data structures Ramsey partitions and proximity data structures Manor Mendel The Open University of Israel Assaf Naor Microsoft Research Abstract This paper addresses two problems lying at the intersection of geometric

More information

A Generalized Turán Problem and its Applications

A Generalized Turán Problem and its Applications A Generalized Turán Problem and its Applications Lior Gishboliner Asaf Shapira Abstract The investigation of conditions guaranteeing the appearance of cycles of certain lengths is one of the most well-studied

More information

Limitations of Algorithm Power

Limitations of Algorithm Power Limitations of Algorithm Power Objectives We now move into the third and final major theme for this course. 1. Tools for analyzing algorithms. 2. Design strategies for designing algorithms. 3. Identifying

More information

CS Communication Complexity: Applications and New Directions

CS Communication Complexity: Applications and New Directions CS 2429 - Communication Complexity: Applications and New Directions Lecturer: Toniann Pitassi 1 Introduction In this course we will define the basic two-party model of communication, as introduced in the

More information

Interval Selection in the streaming model

Interval Selection in the streaming model Interval Selection in the streaming model Pascal Bemmann Abstract In the interval selection problem we are given a set of intervals via a stream and want to nd the maximum set of pairwise independent intervals.

More information

20.1 2SAT. CS125 Lecture 20 Fall 2016

20.1 2SAT. CS125 Lecture 20 Fall 2016 CS125 Lecture 20 Fall 2016 20.1 2SAT We show yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logical expression that is the conunction (AND) of a set of clauses,

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

Space-Time Tradeoffs for Approximate Nearest Neighbor Searching

Space-Time Tradeoffs for Approximate Nearest Neighbor Searching 1 Space-Time Tradeoffs for Approximate Nearest Neighbor Searching SUNIL ARYA Hong Kong University of Science and Technology, Kowloon, Hong Kong, China THEOCHARIS MALAMATOS University of Peloponnese, Tripoli,

More information

1 Cryptographic hash functions

1 Cryptographic hash functions CSCI 5440: Cryptography Lecture 6 The Chinese University of Hong Kong 24 October 2012 1 Cryptographic hash functions Last time we saw a construction of message authentication codes (MACs) for fixed-length

More information

On the Optimality of the Dimensionality Reduction Method

On the Optimality of the Dimensionality Reduction Method On the Optimality of the Dimensionality Reduction Method Alexandr Andoni MIT andoni@mit.edu Piotr Indyk MIT indyk@mit.edu Mihai Pǎtraşcu MIT mip@mit.edu Abstract We investigate the optimality of (1+)-approximation

More information

Lecture Hardness of Set Cover

Lecture Hardness of Set Cover PCPs and Inapproxiability CIS 6930 October 5, 2009 Lecture Hardness of Set Cover Lecturer: Dr. My T. Thai Scribe: Ying Xuan 1 Preliminaries 1.1 Two-Prover-One-Round Proof System A new PCP model 2P1R Think

More information