Approximate Nearest Neighbor (ANN) Search in High Dimensions
|
|
- Lionel Price
- 5 years ago
- Views:
Transcription
1 Chapter 17 Approximate Nearest Neighbor (ANN) Search in High Dimensions By Sariel Har-Peled, February 4, ANN on the Hypercube Hypercube and Hamming distance Definition The set of points H d = { 0, 1 } d is the d-dimensional hypercube. A point p = (p 1,..., p d ) H d can be interpreted, naturally, as a binary string p 1 p 2... p d. The Hamming distance d H (p, q) between p, q H d, is the number of coordinates where p and q disagree. It is easy to verify that the Hamming distance comply with the triangle inequality, and is thus a metric. As we saw in previously, all we need to solve (1 + ε)-ann efficiently, is to efficiently solve the approximate near neighbor problem. Namely, given a set P of n points in H d, a radius r > 0 and parameter ε > 0, we want to decide for a query point q whether d H (q, P) r or d H (q, P) (1+ε)r, where d H (q, P) = min p P d H (q, p). Definition For a set P of points, a data-structure D = D D NearNbr (P, r, (1 + ε)r) solves the approximate near neighbor problem, if given a query point q, the data-structure works as follows. Near: If d H (q, P) r then D outputs a point p P such that d H (p, q) (1 + ε)r. Far: If d H (q, P) (1 + ε)r, then D outputs d H (q, P) r. Don t know: If r d(q, P) (1 + ε)r, then D can return either of the above answers. Given such a data-structure one can construct a data-structure that answers approximate nearest neighbor query using O ( log ( (log d)/ε )) queries using an approximate near-neighbor data-structure. Indeed, the desired distance d H (q, P) is an integer number in the range 0, 1,..., d. We can build a D D NearNbr data-structure for distances (1+ε) i, for i = 1,..., M, where M = O ( ε 1 log d ). Performing 1 This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy of this license, visit or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 221
2 a binary search over these distances using approximate near-neighbor data-structures would resolve the approximate nearest-neighbor query, and requires O(log M) queries. As such, in the following, we concentrate on constructing the approximate near-neighbor datastructure (i.e. D D NearNbr ) Construction of the near-neighbor data-structure On sense and sensitivity Let P = {p 1,..., p n } be a subset of vertices of the hypercube in d dimensions. In the following we assume that d = n O(1). Let r, ε > 0 be two prespecified parameters. We are interested in building an approximate near neighbor data-structure (i.e., D D NearNbr ) for balls of radius r in the Hamming distance. Definition Let U be a (small) positive integer. A family F of functions (defined over H d ), is ( r, R, α, β ) -sensitive if for any q, s H d, we have that (A) If q b(s, r) then Pr [ f (q) = f (s) ] α. (B) If q b(s, R) then Pr [ f (q) = f (s) ] β, where f is a randomly picked function from F, r < R, and α > β. Intuitively, if we can construct a (r, R, α, β)-sensitive family, then we can distinguish between two points which are close together, and two points which are far away from each other. Of course, the probabilities α and β might be very close to each other, and we need a way to do amplification. A simple sensitive family. A priori it is not even clear such a sensitive family exists, but it turns out that the family exposing randomly one coordinate is sensitive. Lemma Let f i (p) denote the function that returns the ith coordinate of p, for i = 1,..., d. Consider the family of functions F = { f 1,..., f d }. Then, for any r > 0 and ε, the family F is (r, (1 + ε)r, α, β)-sensitive, where α = 1 r/d and β = 1 r(1 + ε)/d. Proof : If q, s {0, 1} d are in distance smaller than r from each other (under the Hamming distance), then they differ in at most r coordinates. The probability that a random h F would project into a coordinate that q and s agree on is 1 r/d. Similarly, if d H (q, s) (1 + ε)r then the probability that a random h F would map into a coordinate that q and s agree on is 1 (1 + ε)r/d. A family with a large sensitivity gap. Let k be a parameter to be specified shortly, and consider the family of functions G that concatenates k of the given functions. Formally, let { G = combine(f, k) = g g(p) = ( f 1 (p),..., f k (p) ) }, for f 1,..., f k F be the set of all such functions. Lemma Given a (r, R, α, β)-sensitive family F, the family G = combine(f, k) is ( r, R, α k, β k) - sensitive. 222
3 Proof : For two fixed points q, s H d such that d H (q, s) r, we have that Pr[h(q) = h(s)] α for a random h F. As such, for a random g G, we have that Pr [ g(q) = g(s) ] = Pr [ f 1 (q) = f 1 (s) and f 2 (q) = f 2 (s) and... and f k (q) = f k (s) ] = k Pr [ f i (q) = f i (s) ] α k. i=1 Similarly, if d H (q, s) > R then Pr [ g(q) = g(s) ] = k i=1 Pr [ f i (q) = f i (s) ] β k. The above lemma implies that we can build a family that has a gap between the lower and upper sensitivities; namely, α k /β k = (α/β) k is arbitrarily large. The problem is that if α k is too small then we will have to use too many functions to detect whether or not there is a point close to the query point. Nevertheless, consider the task of building a data-structure that finds all the points of P = {p 1,..., p n } that are equal, under a given function g G = combine(f, k), to a query point. To this end, we compute the strings g(p 1 ),..., g(p n ) and store them (together with their associated point) in a hash table (or a prefix tree). Now, given a query point q, we compute g(q) and fetch from this data-structure all the strings equal to it that are stored in it. Clearly, this is a simple and efficient data-structure. All the points colliding with q would be the natural candidates to be the nearest-neighbor to q. By not storing the points explicitly, but using a pointer to the original input set, we get the following easy result. Lemma Given a function g G = combine(f, k) (see Lemma ), and a set P H d of n points, one can construct a data-structure, in O(nk) time and using { O(nk) additional space, such } that given a query point q, one can report all the points in X = p P g(p) = g(q) in O(k + X ) time. Amplifying sensitivity. Our task is now to amplify the sensitive family we currently have. To this end, for two τ-dimensional points x and y, let x y be the boolean function that returns true if there exists an index i such that x i = y i, and false otherwise. Now, the regular = operator requires vectors to be equal in all coordinates (i.e., its equal to i (x i = y i )) while x y is i(x i = y i ). The previous construction of Lemma using this alternative equal operator, provides us with the required amplification. Lemma Given a ( r, R, α k, β k) -sensitive family G, the family H uses the operator to check equality, is ( r, R, 1 ( 1 α k) τ, 1 ( 1 β k ) τ) -sensitive. = combine(g, τ), if one Proof : For two fixed points q, s H d such that d H (q, s) r, we have that Pr [ g(q) = g(s) ] α k for a random g G. As such, for a random h H, we have that Pr[h(q) h(s)] = Pr [ g 1 (q) = g 1 (s) or g 2 (q) = g 2 (s) or... or g τ (q) = g τ (s) ] = 1 τ Pr [ g i (q) g i (s) ] 1 ( 1 α k) τ. i=1 Similarly, if d H (q, s) > R then Pr[h(q) h(s)] = 1 τ i=1 Pr [ g i (q) g i (s) ] 1 ( 1 β k) τ. 223
4 To see the effect of Lemma , it is useful to play with a concrete example. Consider a (α k, β k, r, R) sensitive family where β k = α k /2, and yet α k is very small. Setting τ = 1/α k the resulting family is (roughly) (1 1/e, 1 1/ e, r, R)-sensitive. Namely, the gap had shrank, but the threshold sensitivity is considerably higher. In particular, it is now a constant, and the gap is also a constant. Using Lemma as a data-structure to store P is more involved than before. Indeed, for a random function h = ( g 1,..., g τ) H = combine(g, τ), requires us to build τ data-structures for each one of the functions g 1,..., g τ, using Lemma Now, given a query point, we retrieve all the points of P that collide with each one of these functions, by querying each of these datastructures. Lemma Given a function h H = combine(g, τ) (see Lemma ), and a set P H d of n points, one can construct a data-structure, in O(nkτ) time and using O(nkτ) { additional space, } such that given a query point q, one can report all the points in X = p P h(p) h(q) in O(kτ + X ) time The near-neighbor data-structure and handling a query We construct the data-structure D of Lemma with parameters k and τ to be determined shortly, for a random function h H. Given a query point q, we retrieve all the points that collide with h, compute their distance to the query point. Next, scan these points one by one, compute their distance to q. As soon as encountering a point s P such that d H (q, s) R, the data-structures returns true together with s. Let assume that we know that the expected number of points of P \ b(q, R) (i.e., R = (1 + ε)r) that will collide with q in D is in expectation L (we will figure out the value of L below). To ensure worst case query time, the query would abort after checking 4L + 1 points and would return false. Naturally, the data-structure would also return false if all points encountered is in distance larger than R from q. Clearly, the query time of this data-structure is O(kτ + dl). We are left with the task of fine tuning the parameters τ and k to get the fastest possible query time, while the data-structure has reasonable probability to succeed. Figuring the right values is technically tedious, and we do it next Setting the parameters If there exists p P such that d H (q, p) r, then the probability of this point to collide with q under the function h is φ 1 ( 1 α k) τ. Let us demand that this data-structure succeeds with probability 3/4. To this end, we set τ = 4 1/α k = φ 1 ( 1 α k) τ 1 exp ( α k τ ) 1 exp( 4) 3/4, (17.1) since 1 x exp( x), for x 0. Lemma The expected number of points of P \ b(q, R) colliding with the query point is L = O ( n(β/α) k). 224
5 Proof : Consider the points in P \ b(q, R). We would like to bound the number of points of this set that collide with the query point. Observe that in this case, the probability of a point p P \ b(q, R) to collide with the query point is ψ = 1 ( 1 β k) τ ( ) ) ) = 1 1 β k 4/α k 1 exp ( 8 βk 1 (1 8 βk α k α k ( β ) k = 8, α since 1 x exp( 2x), for x [0, 1/2], and exp( z) 1 z, for z 0. Namely, the expected number of points of P \ b(q, R) colliding with the query point is ψn. By Lemma , extracting the O(L) points takes O(kτ + L) time. Computing the distance of the query time for each one of these points takes O(kτ + Ld) time. As such, by Lemma , the query time is O(kτ + Ld) = O ( kτ + nd(β/α) k) = To minimize query time, we approximately solve the equation requiring the above two terms to be equal (we ignore d since, intuitively, it should be small compared to n). We get that kτ = n(β/α) k k α k n βk α k = k nβ k 1/β k n = k ln 1/β n. Setting k = ln 1/β n we have that β k = 1/n and, by Eq. (17.1), that τ = 4 1/α k ( ) ln n = exp ln 1/α = O(n ρ ), for ρ = ln 1/β ln 1/α ln 1/β. (17.2) ln(1 x) Lemma (A) For x [0, 1) and t 1 such that 1 tx > 0 we have ln(1 tx) 1 t. ln 1/α (B) For α = 1 r/d and β = 1 r(1 + ε)/d, we have that ρ = ln 1/β ε. Proof : (A) Since ln(1 tx) < 0, it follows that the claim is equivalent to t ln(1 x) ln(1 tx). This in turn is equivalent to g(x) (1 tx) (1 x) t 0. This is trivially true for x = 0. Furthermore, taking the derivative, we see g (x) = t + t(1 x) t 1, which is non-positive for x [0, 1) and t > 0. Therefore, g is non-increasing in the interval of interest, and so g(x) 0 for all values in this interval. (B) Indeed ρ = ln 1/α ln 1/β = ln α ln β = d r ln d ln d (1+ε)r d = ln ( 1 r d ) ln ( 1 (1 + ε) r d ) 1, by part (A). 1 + ε In the following, it would be convenient to consider d to be considerably larger than r. This can be ensured by (conceptually) padding the points with fake coordinates that are all zero. It is easy to verify that this hack would not effect the algorithm performance in any way, and it is just a trick to make our analysis simpler. In particular, we assume that d > 2(1 + ε)r. Lemma For α = 1 r/d, β = 1 r(1 + ε)/d, n and d as above, we have that (i) τ = O ( n 1/(1+ε)), (ii) k = O(ln n), and (iii) L = O ( n 1/(1+ε)). 225
6 Proof : By Eq. (17.1), τ = 4 1/α k = O(n ρ ) = O ( n 1/(1+ε)), by Lemma (B). Now, β = 1 r(1 + ε)/d 1/2, since we assumed that d > 2(1 + ε)r. As such, we have k = ln 1/β n = ln n = O(ln n). ln 1/β By Lemma , L = O ( n(β/α) k). Now β k = 1/n and as such L = O ( 1/α k) = O(τ) = O ( n 1/(1+ε)) The result Theorem Given a set P of n points on the hypercube H d, parameters ε > 0 and r > 0, one can build a data-structure D = D D NearNbr (P, r, (1 + ε)r) that solves the approximate near neighbor problem (see Definition ). The data-structure answers a query successfully with high probability. In addition we have: (A) The query time is O ( dn 1/(1+ε) log n ). (B) The preprocessing time to build this data-structure is O ( n 1+1/(1+ε) log 2 n ) (C) The space required to store this data-structure is O ( nd + n 1+1/(1+ε) log 2 n ). Proof : Our building block is the data-structure described above. By Markov s inequality, the probability that the algorithm has to abort because of too many collisions with points of P\b(q, (1+ ε)r) is bounded by 1/4 (since the algorithm tries 4L + 1 points). Also, in there is a point inside b(q, r), the algorithm would find it with probability 3/4, by Eq. (17.1). As such, with probability at least 1/2 this data-structure returns the correct answer in this case. By Lemma , the query time is O(kτ + Ld). This data-structure succeeds only with constant probability. To achieve high probability we construct O(log n) such data-structures and perform the near-neighbor query in each one of them. As such, the query time is O ( (kτ + Ld) log n ) = O ( n 1/(1+ε) log 2 n + dn 1/(1+ε) log n ) = O ( dn 1/(1+ε) log n ), by Lemma and since d = Ω ( lg n ), if P contains n distinct points of H d. As for the preprocessing time, by Lemma and Lemma , it is O ( nkτ log n ) = O ( n 1+1/(1+ε) log 2 n ). Finally, this data-structure requires O(dn) space to store the input points. By Lemma , we need an additional O ( nkτ log n ) = O ( n 1+1/(1+ε) log 2 n ) space. In the hypercube case, when d = n O(1), we can just build M = O ( log 1+ε d ) = O(ε 1 log d) such data-structures such that (1+ε)-ANN can be answered using binary search on those data-structures, which corresponds to radiuses r 1,..., r M, where r i = (1 + ε) i, for i = 1,..., M. Theorem Given a set P of n points on the hypercube H d (where d = n O(1) ), and a parameter ε > 0, one can build a data-structure to answer approximate nearest-neighbor queries (under the Hamming distance) using O ( dn + n 1/(1+ε) ε 1 log 2 n log d ) space, such that given a query point q, one can returns an (1+ε)-ANN in P (under the Hamming distance) in O(dn 1/(1+ε) log n log(ε 1 log d)) time. The result returned is correct with high probability. 226
7 Remark The result of Theorem needs to be oblivious the queries used. Indeed, for any instantiation of the data-structure of Theorem there exists query points for which it would fail. In particular, formally, if we perform a sequence of ANN queries using such a data-structure, where the queries depend on earlier returned answers, then the guarantee of high probability success is longer guaranteed by the above analysis (it might hold because of some other reasons, naturally) LSH and ANN in Euclidean Space Preliminaries Lemma Let X = (X 1,..., X d ) be a vector of d independent variables which have normal distribution N, and let v = (v 1,..., v d ) IR d. We have that v, X = i v i X i is distributed as v Z, where Z N. Proof : By Lemma the point X has multidimensional normal distribution N d. As such, if v = 1 then this holds by the symmetry of the normal distribution. Indeed, let e 1 = (1, 0,..., 0). By the symmetry of the d-dimensional normal distribution, we have that v, X e 1, X = X 1 N. Otherwise, v, X / v N, and as such v, X N ( 0, v 2), which is indeed the distribution of v Z. Definition A distribution D over IR is called p-stable, if there exists p 0 such that for any n real numbers v 1... v n and n independent variables X 1... X n with distribution D, the random variable i v i X i has the same distribution as the variable ( i v i p ) 1/p X, where X is a random variable with distribution D. By Lemma , the normal distribution is 2-stable distribution Locality Sensitive Hashing Let p, q be two points in IR d. We want to perform an experiment to decide if p q 1 or p q η, where η = 1 + ε. We will randomly choose a vector v from the d-dimensional normal distribution N d (which is 2-stable). Next, let r be a parameter, and let t be a random number chosen uniformly from the interval [0, r]. For p IR d, consider the random hash function p, v + t h(p) =. (17.3) r Assume that the distance between p and q is η, and the distance between the projection of the two points to the direction v is β. Then, the probability that p and q get the same hash value is max(1 β/r, 0), since this is the probability that the random sliding will not separate them. Indeed, consider the line through v to be the x-axis, and assume q is projected to r, and s is projected to r β (assuming r β). Clearly, q and s get mapped to the same value by h( ) if and only if t [0, r β], as claimed. 227
8 As such, we have that the probability of collusion is α(η) = Pr [ h(p) = h(q) ] = r β=0 Pr [ p, v q, v ]( = β 1 β ) dβ. r However, since v is chosen from a 2-stable distribution, we have that p, v q, v = p q, v N ( 0, p q 2). Since we are considering the absolute value of the variable, we need to multiply this by two. Thus, we have r 2 )( α(η, r) = exp ( β2 1 β ) dβ. 2πη 2η 2 r β=0 Intuitively, we care about the difference α(1 + ε, r) α(1, r), and we would like to maximize it as much as possible (by choosing the right value of r). Unfortunately, this integral is unfriendly, and we have to resort to numerical computation. In fact, if are going to use this hashing scheme for constructing locality sensitive hashing, like in hypercube case, then we care about the ratio ρ(1 + ε) = min r log(1/α(1, r)) log(1/α(1 + ε, r)), see Eq. (17.2) p225. The following is verified using numerical computations on a computer, Lemma ([DNIM04]) One can choose r, such that ρ(1 + ε) 1 1+ε. Lemma implies that the hash functions defined by Eq. (17.3) are (1, 1 + ε, α, β )- sensitive, and furthermore, ρ = log(1/α ) 1, for some values of log(1/β ) 1+ε α and β. As such, we can use this hashing family to construct an approximate near neighbor data-structure D D NearNbr (P, r, (1+ε)r) for the set P of points in IR d. Following the same argumentation of Theorem , we have the following. Theorem Given a set P of n points in IR d, parameters ε > 0 and r > 0, one can build a D D NearNbr = D D NearNbr (P, r, (1 + ε)r), such that given a query point q, one can decide if: b(q, r) P, then D D NearNbr returns a point u P, such that d H (u, q) (1 + ε)r. b(q, (1 + ε)r) P = then D D NearNbr returns that no point is in distance r from q. In any other case, any of the answers is correct. The query time is O(dn 1/(1+ε) log n) and the space used is O ( dn + n 1/(1+ε) n log n ). The result returned is correct with high probability ANN in High Dimensional Euclidean Space Unlike the hypercube case, where we could just do direct binary search on the distances. Here we need to use the reduction from ANN to near-neighbor queries. We will need the following result (which follows from what we had seen in previous lectures). 228
9 Theorem Given a set P of n points in IR d, then one can construct data-structures D that answers (1 + ε)-ann queries, by performing O(log(n/ε)) (1 + ε)-approximate near-neighbor queries. The total number of points stored at these approximate near-neighbor data-structures of D is O(nε 1 log(n/ε)). Constructing the data-structure of Theorem requires building a low quality HST. Unfortunately, the previous construction seen for HST are exponential in the dimension, or take quadratic time. We next present a faster scheme Low quality HST in high dimensional Euclidean space Lemma Let P be a set of n in IR d. One can compute a nd-hst of P in O(nd log 2 n) time (note, that the constant hidden by the O notation does not depend on d). Proof : Our construction is based on a recursive decomposition of the point-set. In each stage, we split the point-set into two subsets. We recursively compute a nd-hst for each point-set, and we merge the two trees into a single tree, by creating a new vertex, assigning it an appropriate value, and hung the two subtrees from this node. To carry this out, we try to separate the set into two subsets that are furthest away from each other. Let R = R(P) be the minimum axis parallel box containing P, and let ν = l(p) = d i=1 I i (R), where I i (R) is the projection of R to the ith dimension. Clearly, one can find an axis parallel strip H of width ν/((n 1)d), such that there is at least one point of P on each of its sides, and there is no points of P inside H. Indeed, to find this strip, project the point-set into the ith dimension, and find the longest interval between two consecutive points. Repeat this process for i = 1,..., d, and use the longest interval encountered. Clearly, the strip H corresponding to this interval is of width ν/((n 1)d). On the other hand, diam(p) ν. Now recursively continue the construction of two trees T +, T, for P +, P, respectively, where P +, P is the splitting of P into two sets by H. We hung T + and T on the root node v, and set v = ν. We claim that the resulting tree T is a nd-hst. To this end, observe that diam(p) v, and for a point p P and a point q P +, we have pq ν/((n 1)d), which implies the claim. To construct this efficiently, we use an efficient search trees to store the points according to their order in each coordinate. Let D 1,..., D d be those trees, where D i store the points of P in ascending order according to the ith axis, for i = 1,..., d. We modify them, such that for every node v D i, we know what is the largest empty interval along the ith axis for the points P v (i.e., the points stored in the subtree of v in D i ). Thus, finding the largest strip to split along, can be done in O(d log n) time. Now, we need to split the d trees into two families of d trees. Assume we split according to the first axis. We can split D 1 in O(log n) time using the splitting operation provided by the search tree (Treaps for example can do this split in O(log n) time). Let assume that this split P into two sets L and R, where L < R. We still need to split the other d 1 search trees. This is going to be done by deleting all the points of L from those trees, and building d 1 new search trees for L. This takes O( L d log n) time. We charge this work to the points of L. Since in every split, only the points in the smaller portion of the split get charged, it follows that every point can be charged at most O(log n) time during this construction algorithm. Thus, the overall construction time is O(dn log 2 n) time. 229
10 The overall result Plugging Theorem into Theorem , we have: Theorem Given a set P of n points in IR d, parameters ε > 0 and r > 0, one can build ANN data-structure using O ( dn + n 1+1/(1+ε) ε 2 log 3 (n/ε) ) space, such that given a query point q, one can returns an (1 + ε)-ann in P in ( O dn 1/(1+ε)( log n ) log n ) ε time. The result returned is correct with high probability. The construction time is O ( dn 1+1/(1+ε) ε 2 log 3 (n/ε) ). Proof : We compute the low quality HST using Lemma This takes O(nd log 2 n) time. Using this HST, we can construct the data-structure D of Theorem , where we do not compute the D D NearNbr data-structures. We next traverse the tree D, and construct the data-structure D D NearNbr data-structures using Theorem We only need to prove the bound on the space. Observe, that we need to store each point only once, since other place can refer to the point by a pointer. Thus, this is the O(nd) space requirement. The other term comes from plugging the bound of Theorem into the bound of Theorem Bibliographical notes Section 17.1 follows the exposition of Indyk and Motwani [IM98]. The fact that one can perform approximate nearest neighbor in high dimensions in time and space polynomial in the dimension is quite surprising, One can reduce the approximate near-neighbor in euclidean space to the same question on the hypercube (we show the details below). This implies together with the reduction from ANN to approximate near-neighbor (seen in previous lectures) that one can answer ANN in high dimensional euclidean space with similar performance. Kushilevitz, Ostrovsky and Rabani [KOR00] offered an alternative data-structure with somewhat inferior performance. The value of the results showed in this write-up depend to large extent on the reader perspective. Indeed, for small value of ε > 0, the query time O(dn 1/(1+ε) ) is very close to linear dependency on n, and is almost equivalent to just scanning the points. Thus, from low dimension perspective, where ε is assumed to be small, this result is slightly sublinear. On the other hand, if one is willing to pick ε to be large (say 10), then the result is clearly better than the naive algorithm, suggesting running time for an ANN query which takes (roughly) n 1/11. The idea of doing locality sensitive hashing directly on the Euclidean space, as done in Section 17.2 is not shocking after seeing the Johnson-Lindenstrauss lemma. It is taken from a recent paper of Datar et al. [DNIM04]. In particular, the current analysis which relies on computerized estimates is far from being satisfactory. It would be nice to have a simpler and more elegant scheme for this case. This is an open problem for further research. Another open problem is to improve the performance of the LSH scheme. 230
11 The low-quality high-dimensional HST construction of Lemma , is taken from [Har01]. The running time of this lemma can be further improved to O(dn log n) by more careful and involved implementation, see [CK95] for details. From approximate near-neighbor in IR d to approximate near-neighbor on the hypercube. The reduction is quite involved, and we only sketch the details. Let P Be a set of n points in IR d. We first reduce the dimension to k = O(ε 2 log n) using the Johnson-Lindenstrauss lemma. Next, we embed this space into l1 k (this is the space IRk, where distances are the L 1 metric instead of the regular L 2 metric), where k = O(k/ε 2 ). This can be done with distortion (1 + ε). Let Q the resulting set of points in IR k. We want to solve approximate near-neighbor queries on this set of points, for radius r. As a first step, we partition the space into cells by taking a grid with sidelength (say) k r, and randomly translating it, clipping the points inside each grid cell. It is now sufficient to solve the approximate near-neighbor problem inside this grid cell (which has bounded diameter as a function of r), since with small probability that the result would be correct. We amplify the probability by repeating this polylogarithmic number of times. Thus, we can assume that P is contained inside a cube of side length k nr, and it is in IR k, and the distance metric is the L 1 metric. We next, snap the points of P to a grid of sidelength (say) εr/k. Thus, every point of P now has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the coordinates of the points of P using unary notation. (Thus, a point (2, 5) would be written as (010, 101) assuming the number of bits for each coordinates is 3.) It is now easy to verify that the hamming distance on the resulting strings, is equivalent to the L 1 distance between the points. Thus, we can solve the near-neighbor problem for points in IR d by solving it on the hypercube under the Hamming distance. See Indyk and Motwani [IM98] for more details. This relationship indicates that the ANN on the hypercube is equivalent to the ANN in Euclidean space. In particular, making progress on the ANN on the hypercube would probably lead to similar progress on the Euclidean ANN problem. We had only scratched the surface of proximity problems in high dimensions. The interested reader is referred to the survey by Indyk [Ind04] for more information From previous lectures Lemma (A) The multidimensional normal distribution is symmetric; that is, for any two points p, q IR d such that p = q we have that g(p) = g(q), where g( ) is the density function of the multidimensional normal distribution N d. (B) The projection of the normal distribution on any direction is a one dimensional normal distribution. (C) Picking d variables X 1,..., X d using one dimensional normal distribution N results in a point (X 1,..., X d ) that has multidimensional normal distribution N d. Bibliography [CK95] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. J. Assoc. Comput. Mach., 42:67 90,
12 [DNIM04] M. Datar, Immorlica N, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages , [Har01] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci., pages , [IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages , [Ind04] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E. Goodman and J. O Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 39, pages CRC Press LLC, Boca Raton, FL, 2nd edition, [KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput., 2(30): ,
Locality Sensitive Hashing
Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationChapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.
Chapter 11 Min Cut By Sariel Har-Peled, December 10, 013 1 Version: 1.0 I built on the sand And it tumbled down, I built on a rock And it tumbled down. Now when I build, I shall begin With the smoke from
More informationLecture 14 October 16, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.
More informationProximity problems in high dimensions
Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition
More informationAn Algorithmist s Toolkit Nov. 10, Lecture 17
8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from
More informationLecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.
COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is
More informationFinite Metric Spaces & Their Embeddings: Introduction and Basic Tools
Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Manor Mendel, CMI, Caltech 1 Finite Metric Spaces Definition of (semi) metric. (M, ρ): M a (finite) set of points. ρ a distance function
More information4 Locality-sensitive hashing using stable distributions
4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family
More informationOptimal compression of approximate Euclidean distances
Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch
More informationSuccinct Data Structures for Approximating Convex Functions with Applications
Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca
More informationA Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees
A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees Sunil Arya Hong Kong University of Science and Technology and David Mount University of Maryland Arya and Mount HALG
More informationCell-Probe Proofs and Nondeterministic Cell-Probe Complexity
Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static
More informationLecture 17 03/21, 2017
CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationFly Cheaply: On the Minimum Fuel Consumption Problem
Journal of Algorithms 41, 330 337 (2001) doi:10.1006/jagm.2001.1189, available online at http://www.idealibrary.com on Fly Cheaply: On the Minimum Fuel Consumption Problem Timothy M. Chan Department of
More informationSimilarity Search in High Dimensions II. Piotr Indyk MIT
Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance
More informationProximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search
Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled Sepideh Mahabadi November 24, 2015 Abstract We introduce a new variant of the nearest neighbor search problem,
More informationLecture 7: Passive Learning
CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing
More informationarxiv:cs/ v1 [cs.cg] 7 Feb 2006
Approximate Weighted Farthest Neighbors and Minimum Dilation Stars John Augustine, David Eppstein, and Kevin A. Wortman Computer Science Department University of California, Irvine Irvine, CA 92697, USA
More informationApproximate Clustering via Core-Sets
Approximate Clustering via Core-Sets Mihai Bădoiu Sariel Har-Peled Piotr Indyk 22/02/2002 17:40 Abstract In this paper, we show that for several clustering problems one can extract a small set of points,
More informationSimilarity searching, or how to find your neighbors efficiently
Similarity searching, or how to find your neighbors efficiently Robert Krauthgamer Weizmann Institute of Science CS Research Day for Prospective Students May 1, 2009 Background Geometric spaces and techniques
More informationApproximate Voronoi Diagrams
CS468, Mon. Oct. 30 th, 2006 Approximate Voronoi Diagrams Presentation by Maks Ovsjanikov S. Har-Peled s notes, Chapters 6 and 7 1-1 Outline Preliminaries Problem Statement ANN using PLEB } Bounds and
More informationOn Approximating the Depth and Related Problems
On Approximating the Depth and Related Problems Boris Aronov Polytechnic University, Brooklyn, NY Sariel Har-Peled UIUC, Urbana, IL 1: Motivation: Operation Inflicting Freedom Input: R - set red points
More informationApproximating a Convex Body by An Ellipsoid
Chapter 1 Approximating a Convex Body by An Ellipsoid By Sariel Har-Peled, May 10, 010 1 Is there anything in the Geneva Convention about the rules of war in peacetime? Stalnko wanted to know, crawling
More informationarxiv: v2 [cs.ds] 3 Oct 2017
Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract
More informationCoresets for k-means and k-median Clustering and their Applications
Coresets for k-means and k-median Clustering and their Applications Sariel Har-Peled Soham Mazumdar November 7, 2003 Abstract In this paper, we show the existence of small coresets for the problems of
More informationClustering in High Dimensions. Mihai Bădoiu
Clustering in High Dimensions by Mihai Bădoiu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Bachelor of Science
More informationTransforming Hierarchical Trees on Metric Spaces
CCCG 016, Vancouver, British Columbia, August 3 5, 016 Transforming Hierarchical Trees on Metric Spaces Mahmoodreza Jahanseir Donald R. Sheehy Abstract We show how a simple hierarchical tree called a cover
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationAlgorithms, Geometry and Learning. Reading group Paris Syminelakis
Algorithms, Geometry and Learning Reading group Paris Syminelakis October 11, 2016 2 Contents 1 Local Dimensionality Reduction 5 1 Introduction.................................... 5 2 Definitions and Results..............................
More informationGeometric Optimization Problems over Sliding Windows
Geometric Optimization Problems over Sliding Windows Timothy M. Chan and Bashir S. Sadjad School of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {tmchan,bssadjad}@uwaterloo.ca
More informationLecture 18: March 15
CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may
More informationChapter 23. Fast Fourier Transform Introduction. By Sariel Har-Peled, November 28, Version: 0.11
Chapter 23 Fast Fourier Transform By Sariel Har-Peled, November 28, 208 Version: 0 But now, reflecting further, there begins to creep into his breast a touch of fellow-feeling for his imitators For it
More informationPartitioning Metric Spaces
Partitioning Metric Spaces Computational and Metric Geometry Instructor: Yury Makarychev 1 Multiway Cut Problem 1.1 Preliminaries Definition 1.1. We are given a graph G = (V, E) and a set of terminals
More informationProjective Clustering in High Dimensions using Core-Sets
Projective Clustering in High Dimensions using Core-Sets Sariel Har-Peled Kasturi R. Varadarajan January 31, 2003 Abstract In this paper, we show that there exists a small core-set for the problem of computing
More information16 Embeddings of the Euclidean metric
16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.
More informationReporting Neighbors in High-Dimensional Euclidean Space
Reporting Neighbors in High-Dimensional Euclidean Space Dror Aiger Haim Kaplan Micha Sharir Abstract We consider the following problem, which arises in many database and web-based applications: Given a
More informationAn efficient approximation for point-set diameter in higher dimensions
CCCG 2018, Winnipeg, Canada, August 8 10, 2018 An efficient approximation for point-set diameter in higher dimensions Mahdi Imanparast Seyed Naser Hashemi Ali Mohades Abstract In this paper, we study the
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni Ilya Razenshteyn January 7, 015 Abstract We show an optimal data-dependent hashing scheme for the approximate near neighbor
More informationWolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia
More informationSome Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform
Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform
More informationSpace Exploration via Proximity Search
Space Exploration via Proximity Search Sariel Har-Peled Nirman Kumar David M. Mount Benjamin Raichel July 7, 2018 arxiv:1412.1398v1 [cs.cg] 3 Dec 2014 Abstract We investigate what computational tasks can
More informationGeometry of Similarity Search
Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationLower Bounds for Testing Bipartiteness in Dense Graphs
Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due
More informationCS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms
CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms Antoine Vigneron King Abdullah University of Science and Technology December 5, 2012 Antoine Vigneron (KAUST) CS 372 Lecture
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationPolynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M.
Polynomial Representations of Threshold Functions and Algorithmic Applications Ryan Williams Stanford Joint with Josh Alman (Stanford) and Timothy M. Chan (Waterloo) Outline The Context: Polynomial Representations,
More informationNearest Neighbor Preserving Embeddings
Nearest Neighbor Preserving Embeddings Piotr Indyk MIT Assaf Naor Microsoft Research Abstract In this paper we introduce the notion of nearest neighbor preserving embeddings. These are randomized embeddings
More informationMultimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data
1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
More informationLecture 22. m n c (k) i,j x i x j = c (k) k=1
Notes on Complexity Theory Last updated: June, 2014 Jonathan Katz Lecture 22 1 N P PCP(poly, 1) We show here a probabilistically checkable proof for N P in which the verifier reads only a constant number
More informationHigher Cell Probe Lower Bounds for Evaluating Polynomials
Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we
More informationLecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane
Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized
More information12.1. Branching processes Galton-Watson Process
Chapter 1 Min Cut To acknowledge the corn - This purely American expression means to admit the losing of an argument, especially in regard to a detail; to retract; to admit defeat. It is over a hundred
More informationCell-Probe Lower Bounds for Prefix Sums and Matching Brackets
Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Emanuele Viola July 6, 2009 Abstract We prove that to store strings x {0, 1} n so that each prefix sum a.k.a. rank query Sumi := k i x k can
More informationStreaming and communication complexity of Hamming distance
Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern
More informationError Detection and Correction: Small Applications of Exclusive-Or
Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,
More informationTail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002
Tail Inequalities 497 - Randomized Algorithms Sariel Har-Peled December 0, 00 Wir mssen wissen, wir werden wissen (We must know, we shall know) David Hilbert 1 Tail Inequalities 1.1 The Chernoff Bound
More information1 Cryptographic hash functions
CSCI 5440: Cryptography Lecture 6 The Chinese University of Hong Kong 23 February 2011 1 Cryptographic hash functions Last time we saw a construction of message authentication codes (MACs) for fixed-length
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationWELL-SEPARATED PAIR DECOMPOSITION FOR THE UNIT-DISK GRAPH METRIC AND ITS APPLICATIONS
WELL-SEPARATED PAIR DECOMPOSITION FOR THE UNIT-DISK GRAPH METRIC AND ITS APPLICATIONS JIE GAO AND LI ZHANG Abstract. We extend the classic notion of well-separated pair decomposition [10] to the unit-disk
More informationLower bounds on Locality Sensitive Hashing
Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,
More information12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]
Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter A, you write 3,004,572,688. B is 28,731,569½.
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline
More informationCSE 190, Great ideas in algorithms: Pairwise independent hash functions
CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required
More informationRandomized Algorithms III Min Cut
Chapter 11 Randomized Algorithms III Min Cut CS 57: Algorithms, Fall 01 October 1, 01 11.1 Min Cut 11.1.1 Problem Definition 11. Min cut 11..0.1 Min cut G = V, E): undirected graph, n vertices, m edges.
More informationOptimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny)
Innovations in Computer Science 20 Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny Ryan O Donnell Yi Wu 3 Yuan Zhou 2 Computer Science Department, Carnegie Mellon University,
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationPartitions and Covers
University of California, Los Angeles CS 289A Communication Complexity Instructor: Alexander Sherstov Scribe: Dong Wang Date: January 2, 2012 LECTURE 4 Partitions and Covers In previous lectures, we saw
More informationThe min cost flow problem Course notes for Optimization Spring 2007
The min cost flow problem Course notes for Optimization Spring 2007 Peter Bro Miltersen February 7, 2007 Version 3.0 1 Definition of the min cost flow problem We shall consider a generalization of the
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationLecture 23: Alternation vs. Counting
CS 710: Complexity Theory 4/13/010 Lecture 3: Alternation vs. Counting Instructor: Dieter van Melkebeek Scribe: Jeff Kinne & Mushfeq Khan We introduced counting complexity classes in the previous lecture
More informationNearest-Neighbor Searching Under Uncertainty
Nearest-Neighbor Searching Under Uncertainty Wuzhou Zhang Joint work with Pankaj K. Agarwal, Alon Efrat, and Swaminathan Sankararaman. To appear in PODS 2012. Nearest-Neighbor Searching S: a set of n points
More informationTrace Reconstruction Revisited
Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationLecture 6: September 22
CS294 Markov Chain Monte Carlo: Foundations & Applications Fall 2009 Lecture 6: September 22 Lecturer: Prof. Alistair Sinclair Scribes: Alistair Sinclair Disclaimer: These notes have not been subjected
More informationb + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d
CS161, Lecture 4 Median, Selection, and the Substitution Method Scribe: Albert Chen and Juliana Cook (2015), Sam Kim (2016), Gregory Valiant (2017) Date: January 23, 2017 1 Introduction Last lecture, we
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More information1 Randomized Computation
CS 6743 Lecture 17 1 Fall 2007 1 Randomized Computation Why is randomness useful? Imagine you have a stack of bank notes, with very few counterfeit ones. You want to choose a genuine bank note to pay at
More informationNearest-Neighbor Searching Under Uncertainty
Nearest-Neighbor Searching Under Uncertainty Pankaj K. Agarwal Department of Computer Science Duke University pankaj@cs.duke.edu Alon Efrat Department of Computer Science The University of Arizona alon@cs.arizona.edu
More informationarxiv: v1 [cs.ds] 9 Apr 2018
From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract
More informationOptimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs
Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together
More informationDesign and Analysis of Algorithms
CSE 101, Winter 2018 Design and Analysis of Algorithms Lecture 5: Divide and Conquer (Part 2) Class URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ A Lower Bound on Convex Hull Lecture 4 Task: sort the
More informationApproximating MAX-E3LIN is NP-Hard
Approximating MAX-E3LIN is NP-Hard Evan Chen May 4, 2016 This lecture focuses on the MAX-E3LIN problem. We prove that approximating it is NP-hard by a reduction from LABEL-COVER. 1 Introducing MAX-E3LIN
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationRamsey partitions and proximity data structures
Ramsey partitions and proximity data structures Manor Mendel The Open University of Israel Assaf Naor Microsoft Research Abstract This paper addresses two problems lying at the intersection of geometric
More informationA Generalized Turán Problem and its Applications
A Generalized Turán Problem and its Applications Lior Gishboliner Asaf Shapira Abstract The investigation of conditions guaranteeing the appearance of cycles of certain lengths is one of the most well-studied
More informationLimitations of Algorithm Power
Limitations of Algorithm Power Objectives We now move into the third and final major theme for this course. 1. Tools for analyzing algorithms. 2. Design strategies for designing algorithms. 3. Identifying
More informationCS Communication Complexity: Applications and New Directions
CS 2429 - Communication Complexity: Applications and New Directions Lecturer: Toniann Pitassi 1 Introduction In this course we will define the basic two-party model of communication, as introduced in the
More informationInterval Selection in the streaming model
Interval Selection in the streaming model Pascal Bemmann Abstract In the interval selection problem we are given a set of intervals via a stream and want to nd the maximum set of pairwise independent intervals.
More information20.1 2SAT. CS125 Lecture 20 Fall 2016
CS125 Lecture 20 Fall 2016 20.1 2SAT We show yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logical expression that is the conunction (AND) of a set of clauses,
More informationCS 6820 Fall 2014 Lectures, October 3-20, 2014
Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given
More informationAsymptotic redundancy and prolixity
Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends
More informationSpace-Time Tradeoffs for Approximate Nearest Neighbor Searching
1 Space-Time Tradeoffs for Approximate Nearest Neighbor Searching SUNIL ARYA Hong Kong University of Science and Technology, Kowloon, Hong Kong, China THEOCHARIS MALAMATOS University of Peloponnese, Tripoli,
More information1 Cryptographic hash functions
CSCI 5440: Cryptography Lecture 6 The Chinese University of Hong Kong 24 October 2012 1 Cryptographic hash functions Last time we saw a construction of message authentication codes (MACs) for fixed-length
More informationOn the Optimality of the Dimensionality Reduction Method
On the Optimality of the Dimensionality Reduction Method Alexandr Andoni MIT andoni@mit.edu Piotr Indyk MIT indyk@mit.edu Mihai Pǎtraşcu MIT mip@mit.edu Abstract We investigate the optimality of (1+)-approximation
More informationLecture Hardness of Set Cover
PCPs and Inapproxiability CIS 6930 October 5, 2009 Lecture Hardness of Set Cover Lecturer: Dr. My T. Thai Scribe: Ying Xuan 1 Preliminaries 1.1 Two-Prover-One-Round Proof System A new PCP model 2P1R Think
More information