1 Coding for Random Projetions and Approximate Near Neighbor Searh Ping Li Department of Statistis & Biostatistis Department of Computer Siene Rutgers University Pisataay, NJ 8854, USA Mihael Mitzenmaher Shool of Engineering and Applied Sienes Harvard University Cambridge, MA 238, USA Anshumali Shrivastava Department of Computer Siene Cornell University Ithaa, NY 4853, USA Abstrat This tehnial note ompares to oding (quantization) shemes for random projetions in the ontext of sub-linear time approximate near neighbor searh. The first sheme is based on uniform quantization [4] hile the seond sheme utilizes a uniform quantization plus a uniformly random offset [] (hih has been popular in pratie). The prior ork [4] ompared the to shemes in the ontext of similarity estimation and training linear lassifiers, ith the onlusion that the step of random offset is not neessary and may hurt the performane (depending on the similarity level). The task of near neighbor searh is related to similarity estimation ith importane distintions and requires on study. In this paper, e demonstrate that in the ontext of near neighbor searh, the step of random offset is not needed either and may hurt the performane (sometimes signifiantly so, depending on the similarity and other parameters). For approximate near neighbor searh, hen the target similarity level is high (e.g., orrelation >.85), our analysis suggest to use a uniform quantization to build hash tables, ith a bin idth =.5. On the other hand, hen the target similarity level is not that high, it is preferable to use larger values (e.g., 2 3). This is equivalent to say that it suffies to use only a small number of bits (or even just bit) to ode eah hashed value in the ontext of sublinear time near neighbor searh. An extensive experimental study on to reasonably large datasets onfirms the theoretial finding. Coding for building hash tables is a different task from oding for similarity estimation. For near neighbor searh, e need oding of the projeted data to determine hih bukets the data points should be plaed in (and the oded values are not stored). For similarity estimation, the purpose of oding is for aurately estimating the similarities using small storage spae. Therefore, if neessary, e an atually ode the projeted data tie (ith different bin idths). In this paper, e do not study the important issue of re-ranking of retrieved data points by using estimated similarities. That step is needed hen exat (all pairise) similarities an not be pratially stored or omputed on the fly. In a onurrent ork [5], e demonstrate that the retrieval auray an be further improved by using nonlinear estimators of the similarities based on a 2-bit oding sheme.

2 Introdution This paper fouses on the omparison of to quantization shemes for random projetions in the ontext of sublinear time near neighbor searh. The task of near neighbor searh is to identify a set of data points hih are most similar (in some measure of similarity) to a query data point. Effiient algorithms for near neighbor searh have numerous appliations in searh, databases, mahine learning, reommending systems, omputer vision, et. Developing effiient algorithms for finding near neighbors has been an ative researh topi sine the early days of modern omputing [2]. Near neighbor searh ith extremely high-dimensional data (e.g., texts or images) is still a hallenging task and an ative researh problem. Among many types of similarity measures, the (squared) Eulidian distane (denoted by d) and the orrelation (denoted by ρ) are most ommonly used. Without loss of generality, onsider to high-dimensional data vetors u, v R D. The squared Eulidean distane and orrelation are defined as follos: d = D u i v i 2, ρ = i= D i= u iv i D D i= u2 i i= v2 i In pratie, it appears that the orrelation is more often used than the distane, partly beause ρ is niely normalized ithin and. In fat, in this study, e ill assume that the marginal l 2 norms D i= u i 2 and D i= v i 2 are knon. This is a reasonable assumption. Computing the marginal l 2 norms only requires sanning the data one, hih is anyay needed during the data olletion proess. In mahine learning pratie, it is ommon to first normalize the data (to have unit l 2 norm) before feeding the data to lassifiation (e.g., SVM) or lustering (e.g., K-means) algorithms. For onveniene, throughout this paper, e assume unit l 2 norms, i.e., D i= ρ = u iv i D D D D D = u i v i, here u 2 i = vi 2 = (2) i= u2 i i= v2 i i= i= i=. Random Projetions As an effetive tool for dimensionality redution, the idea of random projetions is to multiply the data, e.g., u, v R D, ith a random normal projetion matrix R R D k (here k D), to generate: x = u R R k, y = v R R k, R = {r ij } D i= k j=, r ij N(, ) i.i.d. (3) The method of random projetions has beome popular for large-sale mahine learning appliations suh as lassifiation, regression, matrix fatorization, singular value deomposition, near neighbor searh, et. The potential benefits of oding ith a small number of bits arise beause the (unoded) projeted data, x j = D i= u ir ij and y j = D i= v ir ij, being real-valued numbers, are neither onvenient/eonomial for storage and transmission, nor ell-suited for indexing. The fous of this paper is on approximate (sublinear time) near neighbor searh in the frameork of loality sensitive hashing [3]. In partiular, e ill ompare to oding (quantization) shemes of random projetions [, 4] in the ontext of near neighbor searh..2 Uniform Quantization The reent ork [4] proposed an intuitive oding sheme, based on a simple uniform quantization: h (j) (u) = x j /, here > is the bin idth and. is the standard floor operation. () h (j) (v) = y j / (4) The folloing theorem is proved in [4] about the ollision probability P = Pr 2 ( ) h (j) (u) = h (j) (v).

3 Theorem ( ) P = Pr h (j) (u) = h (j) (v) = 2 i= (i+) i ϕ(z) In addition, P is a monotonially inreasing funtion of ρ. [ Φ ( ) ( )] (i + ) ρz i ρz Φ dz (5) ρ 2 ρ 2 The fat that P is a monotonially inreasing funtion of ρ makes (4) a suitable oding sheme for approximate near neighbor searh in the general frameork of loality sensitive hashing (LSH)..3 Uniform Quantization ith Random Offset [] proposed the folloing ell-knon oding sheme, hih uses indos and a random offset: h,q(u) (j) xj + q j =, h (j) yj + q j,q(v) = (6) here q j uniform(, ). [] shoed that the ollision probability an be ritten as ( ) ( ) ( P,q =Pr h,q(u) (j) = h,q(v) (j) t = 2ϕ t ) dt (7) d d here d = u v 2 = 2( ρ) is the Eulidean distane beteen u and v. Compared ith (6), the sheme (4) does not use the additional randomization ith q uniform(, ) (i.e., the offset). [4] elaborated the folloing advantages of (4) in the ontext of similarity estimation:. Operationally, h is simpler than h,q. 2. With a fixed, h is alays more aurate than h,q, often signifiantly so. 3. For eah oding sheme, one an separately find the optimum bin idth. The optimized h is also more aurate than optimized h,q, often signifiantly so. 4. h requires a smaller number of bits than h,q. In this paper, e ill ompare h,q ith h in the ontext of sublinear time near neighbor searh..4 Sublinear Time -Approximate Near Neighbor Searh Consider a data vetor u. Suppose there exists another vetor hose Eulidian distane ( d) from u is at most d (the target distane). The goal of -approximate d -near neighbor algorithms is to return data vetors (ith high probability) hose Eulidian distanes from u are at most d ith >. Reall that, in our definition, d = 2( ρ) is the squared Eulidian distane. To be onsistent ith [], e present the results in terms of d. Corresponding to the target distane d, the target similarity ρ an be omputed from d = 2( ρ ) i.e., ρ = d /2. To simplify the presentation, e fous on ρ (as is ommon in pratie), i.e., d 2. One e fix a target similarity ρ, an not exeed a ertain value: 2( ρ ) 2 = (8) ρ For example, hen ρ =.5, e must have 2. 3

4 Under the general frameork, the performane of an LSH algorithm largely depends on the differene (gap) beteen the to ollision probabilities P () and P (2) (respetively orresponding to d and d ): P () = Pr (h (u) = h (v)) hen d = u v 2 2 = d (9) P (2) = Pr (h (u) = h (v)) hen d = u v 2 2 = 2 d () Corresponding to h,q, the ollision probabilities P (),q and P (2),q are analogously defined. A larger differene beteen P () and P (2) implies a more effiient LSH algorithm. The folloing G values (G for h and G,q for h,q ) haraterize the gaps: G = () log /P log /P (2), G,q = log /P (),q log /P (2),q () A smaller G (i.e., larger differene beteen P () and P (2) ) leads to a potentially more effiient LSH algorithm and ρ < is partiularly desirable [3]. The general theory says the query time for -approximate d -near neighbor is dominated by O(N G ) distane evaluations, here N is the total number of data vetors in the olletion. This is better than O(N), the ost of a linear san. 2 Comparison of the Collision Probabilities To help understand the intuition hy h may lead to better performane than h,q, in this setion e examine their ollision probabilities P and P,q, hih an be expressed in terms of the standard normal pdf and df funtions: ϕ(x) = 2π e x2 2 and Φ(x) = x ϕ(x)dx, ( ) ( P,q = Pr h,q(u) (j) = h (j),q(v) = 2Φ ( ) P = Pr h (j) (u) = h (j) (v) = 2 It is lear that P,q as. i= d (i+) i ) ϕ(z) [ π/ d Φ ( (i + ) ρz ρ 2 ( / d ϕ d ) ( Φ ) i ρz ρ 2 )] (2) dz (3) Figure plots both P and P,q for seleted ρ values. The differene beteen P and P,q beomes apparent hen is not small. For example, hen ρ =, P quikly approahes the limit.5 hile P,q keeps inreasing (to ) as inreases. Intuitively, the fat that P,q hen ρ =, is undesirable beause it means to orthogonal vetors ill have the same oded value. Thus, it is not surprising that h ill have better performane than h,q, for both similarity estimation and sublinear time near neighbor searh. 4

5 Prob Prob.9 P.8.7 P,q ρ = P.6.5 P,q ρ = Prob Prob.9 P.8.7 P,q ρ = ρ =.9.8 P.7.6 P,q Prob Prob P.5 P.4,q.3.2 ρ = P.7 P,q ρ = Figure : Collision probabilities, P and P,q, for ρ =,.25,.5,.75,.9, and.99. The sheme h has smaller ollision probabilities than the sheme [] h,q, espeially hen > 2. 3 Theoretial Comparison of the s Figure 2 ompares G ith G,q at their optimum values, as funtions of, for a ide range of target similarity ρ levels. Basially, at eah and ρ, e hoose the to minimize G and the to minimize G,q. This figure illustrates that G is smaller than G,q, notieably so in the lo similarity region. Figure 3, Figure 4, Figure 5, and Figure 6 present G and G,q as funtions of, for ρ =.99, ρ =.95, ρ =.9 and ρ =.5, respetively. In eah figure, e plot the urves for a ide range of values. These figures illustrate here the optimum values are obtained. Clearly, in the high similarity region, the smallest G values are obtained at lo values, espeially at small. In the lo (or moderate) similarity region, the smallest G values are usually attained at relatively large. In pratie, e normally have to pre-speify a, for all and ρ values. In other ords, the optimum G values presented in Figure 2 are in general not attainable. Therefore, Figure 7, Figure 8, Figure 9, and Figure present G and G,q as funtions of, for ρ =.99, ρ =.95, ρ =.9 and ρ =.5, respetively. In eah figure, e plot the urves for a ide range of values. These figures again onfirm that G is smaller than G,q. 5

6 ρ =. G G,q / ρ = G G,q /.4 ρ = G G,q / G G,q /.3.2. ρ = ρ =.2 G G,q / ρ =.5 G G,q / G G,q /.4 ρ = G G,q /.3.2. ρ = G G,q /.75 ρ = ρ =.6 G G,q / G G,q /.4 ρ = G G,q /.3.2. ρ = Figure 2: Comparison of the optimum gaps (smaller the better) for h and h,q. For eah ρ and, e an find the smallest gaps individually for h and h,q, over the entire range of. We an see that for all target similarity levels ρ, both h,q and h exhibit better performane than /. h alays has smaller gap than h,q, although in high similarity region both shemes perform similarly. 6

7 ρ =.99, =.5 G G,q ρ =.99, =. G G,q.95 ρ =.99, =.2 G G,q ρ =.99, =.3 G G,q ρ =.99, =.4 G G,q ρ =.99, =.5 G G,q ρ =.99, =.7 G G,q ρ =.99, = 3 G G,q ρ =.99, = 2 G G,q G ρ.8 =.99, = 4 G,q ρ =.99, = 2.5 G G,q G ρ.8 =.99, = 5 G,q Figure 3: The gaps G and G,q as funtions of, for ρ =.99. In eah panel, e plot both G and G,q for a partiular value. The plots illustrate here the optimum values are obtained. 7

8 ρ =.95, =.5 G G,q ρ =.95, =. G G,q.95 ρ =.95, =.2 G G,q ρ =.95, =.3 G G,q ρ =.95, =.7 ρ =.95, = 3 G G,q G G,q ρ =.95, =.4 G G,q ρ =.95, = 2 G G,q ρ =.95, = 4 G.8.7 G,q ρ =.95, =.5 G G,q ρ =.95, = 2.5 G G,q ρ =.95, = 4.4 G.8.7 G,q Figure 4: The gaps G and G,q as funtions of, for ρ =.95 and a range of values. 8

9 ρ =.9, =.5 G G,q ρ =.9, =. G G,q.95 ρ =.9, =.2 G G,q ρ =.9, =.3 G G,q ρ =.9, =.7 G G,q ρ =.9, =.4 G G,q ρ =.9, = 2 G G,q ρ =.9, =.5 G G,q ρ =.9, = 2.5 G G,q Figure 5: The gaps G and G,q as funtions of, for ρ =.9 and a range of values. 9

10 ρ =.5, =.5 G G,q ρ =.5, =.3 G G,q ρ =.5, =. G G,q ρ =.5, =.35 G G,q ρ =.5, =.2 G G,q ρ =.5, =.4 G G,q Figure 6: The gaps G and G,q as funtions of, for ρ =.5 and a range of values.

11 G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = G G,q / ρ =.99, = Figure 7: The gaps G and G,q as funtions of, for ρ =.99. In eah panel, e plot both G and G,q for a partiular value.

12 G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ =.95, = G G,q / ρ. =.95, = Figure 8: The gaps G and G,q as funtions of, for ρ =.95. In eah panel, e plot both G and G,q for a partiular value. 2

13 G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = ρ =.9, = G G,q / G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = G G,q /.3 ρ =.9, = Figure 9: The gaps G and G,q as funtions of, for ρ =.9. In eah panel, e plot both G and G,q for a partiular value. 3

14 G G,q / G G,q / G G,q /.6 ρ =.5, = ρ =.5, = G G,q / ρ =.5, =.75 ρ =.5, = 3 G G,q / G G,q / ρ =.5, = G G,q /.6 ρ =.5, = G G,q /.6 ρ =.5, = G G,q /.6 ρ =.5, = ρ =.5, = ρ =.5, =.5 G G,q / ρ =.5, = 2.5 G G,q / G G,q /.6 ρ =.5, = Figure : The gaps G and G,q as funtions of, for ρ =.5. In eah panel, e plot both G and G,q for a partiular value. 4

15 4 Optimal s To vie the optimal gaps more learly, Figure and Figure 2 plot the best gaps (left panels) and the optimal values (right panels) at hih the best gaps are attained, for seleted values of and the entire range of ρ. The results an be summarized as follos At any ρ and, the optimal gap G,q is alays at least as large as the optimal gap G. At relatively lo similarities, the optimal G,q an be substantially larger than the optimal G. When the target similarity level ρ is high (e.g., ρ >.85), for both shemes h and h,q, the optimal values are relatively lo, for example, =.5 hen.85 < ρ <.9. In this region, both h,q and h behavior similarly. When the target similarity level ρ is not so high, for h, it is best to use a large value of, in partiular 2 3. In omparison, for h,q, the optimal values gro smoothly ith dereasing ρ. These plots again onfirm the previous omparisons: (i) e should alays replae h,q ith h ; (ii) if e use h and target at very high similarity, a good hoie of might be =.5; (iii) if e use h and the target similarity is not too high, then e an safely use = 2 3. We should also mention that, although the optimal values for h appear to exhibit a jump in the right panels of Figure and Figure 2, the hoie of does not influene the performane muh, as shon in previous plots. In Figures 3 to 6, e have seen that even hen the optimal appear to approah, the atual gaps are not muh differene beteen = 3 and 3. In the real-data evaluations in the next setion, e ill see the same phenomenon for h. Note that the Gaussian density deays rapidly at the tail, for example, Φ(6) = 9.9. If e hoose =.5, or 2, or 3, then e just need a small number of bits to ode eah hashed value. 5

16 =.5 G G,q ρ =. G G,q ρ =.3 G G,q ρ Optimum Optimum Optimum =.5 2 G G,q.2.4 ρ =. 2 G G,q.2.4 ρ =.3 2 G G,q.2.4 ρ.6.8 Figure : Left panels: the optimal (smallest) gaps at given values and the entire range of ρ. We an see that G,q is alays larger than G, onfirming that it is better to use h instead of h,q. Right panels: the optimal values of at hih the optimal gaps are attained. When the target similarity ρ is very high, it is best to use a relatively small. When the target similarity is not that high, if e use h, it is best to use > 3. 6

17 = G G,q ρ =.7 G.48 G,q ρ = 2 G G,q ρ Optimum Optimum Optimum =.5 2 G G,q ρ =.7 2 G G,q ρ.9 6 = G G,q ρ.9.95 Figure 2: Left panels: the optimal (smallest) gaps at given values and the entire range of ρ. We an see that G,q is alays larger than G, onfirming that it is better to use h instead of h,q. Right panels: the optimal values of at hih the optimal gaps are attained. When the target similarity ρ is very high, it is best to use a relatively small. When the target similarity is not that high, if e use h, it is best to use > 3. 7

18 5 An Experimental Study To datasets, Peekaboom and Youtube, are used in our experiments for validating the theoretial results. Peekaboom is a standard image retrieval dataset, hih is divided into to subsets, one ith 998 data points and another ith data points. We use the larger subset for building hash tables and the smaller subset for query data points. The reported experimental results are averaged over all query data points. Available in the UCI repository, Youtube is a multi-vie dataset. For simpliity, e only use the largest set of audio features. The original training set, ith data points, is used for building hash tables. 5 data points, randomly seleted from the original test set, are used as query data points. We use the standard (K, L)-LSH implementation [3]. We generate K L independent hash funtions h i,j, i = to K, j = to L. For eah hash table j, j = to L, e onatenate K hash funtions < h,j, h 2,j, h 3,j,..., h K,j >. For eah data point, e ompute the hash values and plae them (in fat, their pointers) into the appropriate bukets of the hash table i. In the query phase, e ompute the hash value of the query data points using the same hash funtions to find the buket in hih the query data point belongs to and only searh for near neighbor among the data points in that buket of hash table i. We repeat the proess for eah hash table and the final retrieved data points are the union of the retrieved data points in all the hash tables. Ideally, the number of retrieved data points ill be substantially smaller than the total number of data points. We use the term fration retrieved to indiate the ratio of the number of retrieved data points over the total number of data points. A smaller value of fration retrieved ould be more desirable. To thoroughly evaluate the to oding shemes, e ondut extensive experiments on the to datasets, by using many ombinations of K (from 3 to 4) and L (from to 2). At eah hoie of (K, L), e vary from.5 to 5. Thus, the total number of ombinations is large, and the experiments are very time-onsuming. There are many ays to evaluate the performane of an LSH sheme. We ould speify a threshold of similarity and only ount the retrieved data points hose (exat) similarity is above the threshold as true positives. To avoid speifying a threshold and onsider the fat that in pratie people often ould like to retrieve the top-t nearest neighbors, e take a simple approah by omputing the reall based on top-t neighbors. For example, suppose the number of retrieved data points is 2, among hih 7 data points belong to the top-t. Then the reall value ould be 7/T = 7% if T =. Ideally, e hope the realls ould be as high as possible and in the meanhile e hope to keep the fration retrieved as lo as possible. Figure 3 presents the results on Youtube for T = and target realls from. to.99. In every panel, e set a target reall threshold. At every bin idth, e find the smallest fration retrieved over a ide range of LSH parameters, K and L. Note that, if the target reall is high (e.g.,.95), e basially have to effetively loer the target threshold ρ, so that e do not have to go don the re-ranked list too far. The plots sho that, for high target realls, e need to use relatively large (e.g., 2 3), and for lo target realls, e should use a relatively small (e.g., =.5). Figures 4 to 8 present similar results on the Youtube dataset for T = 5, 2,, 5, 3. We only inlude plots ith relatively high realls hih are often more useful in pratie. Figures 9 to 24 present the results on the Peekaboom dataset, hih are essentially very similar to the results on the Youtube dataset. These plots onfirm the previous theoretial analysis: (i) it is essentially alays better to use h instead of h,q, i.e., the random offset is not needed; (ii) hen using h and the target reall is high (hih essentially means hen the target similarity is lo), it is better to use a relatively large (e.g., = 2 3); (iii) hen using h and the target reall is lo, it is better to use a smaller (e.g., =.5); (iv) hen using h, the influene is is not that muh as long as it is in a reasonable range, hih is important in pratie. 8

19 Fration Retrieved Youtube: Top Reall =.99 h,q h Fration Retrieved Youtube: Top Reall =.98 h,q h Fration Retrieved Youtube: Top Reall =.97 h,q h Fration Retrieved Youtube: Top Reall =.95 h,q h Fration Retrieved Youtube: Top Reall =.93 h,q h Fration Retrieved Youtube: Top Reall =.9 h,q h Fration Retrieved Fration Retrieved Youtube: Top Reall =.85 h,q h Youtube: Top Reall =.6 h,q h Fration Retrieved Youtube: Top Reall =.8 h,q h Fration Retrieved Youtube: Top Reall =.5 h,q h Fration Retrieved Fration Retrieved Youtube: Top Reall =.7 h,q h Youtube: Top Reall =.4 h,q h Fration Retrieved Youtube: Top Reall =.3 h,q h Fration Retrieved x 3 Youtube: Top Reall = h,q h Fration Retrieved 5 x 3 Youtube: Top Reall = h,q h Figure 3: Youtube Top. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-) ith respet to for both oding shemes h and h,q. Loer is better. 9

20 Fration Retrieved Youtube: Top 5 Reall =.99 h,q h Fration Retrieved Youtube: Top 5 Reall =.97 h,q h Fration Retrieved Youtube: Top 5 Reall =.95 h,q h Fration Retrieved Youtube: Top 5 Reall =.9 h,q h Fration Retrieved Fration Retrieved Youtube: Top 5 Reall =.85 h,q h Youtube: Top 5 Reall =.7 h,q h Fration Retrieved Fration Retrieved Youtube: Top 5 Reall =.8 h,q h Youtube: Top 5 Reall =.6 h,q h Figure 4: Youtube Top 5. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-5) ith respet to for both oding shemes h and h,q. 2

21 Fration Retrieved Youtube: Top 2 Reall =.99 h,q h Fration Retrieved Youtube: Top 2 Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Youtube: Top 2 Reall =.95 h,q h Youtube: Top 2 Reall =.85 Youtube: Top 2 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Youtube: Top 2 Reall =.9 h,q h Youtube: Top 2 Reall =.8 h,q h Fration Retrieved Youtube: Top 2 Reall =.6 h,q h Figure 5: Youtube Top 2. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-2) ith respet to for both oding shemes h and h,q. 2

22 Fration Retrieved Youtube: Top Reall =.99 h,q h Fration Retrieved Youtube: Top Reall =.97 h,q h Fration Retrieved Youtube: Top Reall =.95 h,q h Fration Retrieved Youtube: Top Reall =.9 h,q h Fration Retrieved Youtube: Top Reall =.85 h,q h Fration Retrieved Youtube: Top Reall =.8 h,q h Fration Retrieved Youtube: Top Reall =.7 h,q h Fration Retrieved Youtube: Top Reall =.6 h,q h Figure 6: Youtube Top. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-) ith respet to for both oding shemes h and h,q. 22

23 Fration Retrieved Youtube: Top 5 Reall =.99 h,q h Fration Retrieved Youtube: Top 5 Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Youtube: Top 5 Reall =.95 h,q h Youtube: Top 5 Reall =.85 Youtube: Top 5 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Youtube: Top 5 Reall =.9 h,q h Youtube: Top 5 Reall =.8 h,q h Youtube: Top 5 Reall =.6 h,q h Figure 7: Youtube Top 5. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-5) ith respet to for both oding shemes h and h,q. 23

24 Fration Retrieved Youtube: Top 3 Reall =.99 h,q h Fration Retrieved Youtube: Top 3 Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Youtube: Top 3 Reall =.95 h,q h Youtube: Top 3 Reall =.85 Youtube: Top 3 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Youtube: Top 3 Reall =.9 h,q h Youtube: Top 3 Reall =.8 h,q h Youtube: Top 3 Reall =.6 h,q h Figure 8: Youtube Top 3. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-3) ith respet to for both oding shemes h and h,q. 24

25 Fration Retrieved Peekaboom: Top Reall =.99 h,q h Fration Retrieved Peekaboom: Top Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top Reall =.95 h,q h Peekaboom: Top Reall =.85 Peekaboom: Top Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top Reall =.9 h,q h Peekaboom: Top Reall =.8 h,q h Peekaboom: Top Reall =.6 h,q h Figure 9: Peekaboom Top. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-) ith respet to for both oding shemes h and h,q. 25

26 Fration Retrieved Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 5 Reall =.99 h,q h Peekaboom: Top 5 Reall =.95 h,q h Peekaboom: Top 5 Reall =.85 Peekaboom: Top 5 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 5 Reall =.97 h,q h Peekaboom: Top 5 Reall =.9 h,q h Peekaboom: Top 5 Reall =.8 h,q h Peekaboom: Top 5 Reall =.6 h,q h Figure 2: Peekaboom Top 5. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-5) ith respet to for both oding shemes h and h,q. 26

27 Fration Retrieved.6.5 Peekaboom: Top 2 Reall =.99 h,q h Fration Retrieved Peekaboom: Top 2 Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 2 Reall =.95 h,q h Peekaboom: Top 2 Reall =.85 Peekaboom: Top 2 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 2 Reall =.9 h,q h Peekaboom: Top 2 Reall =.8 h,q h Peekaboom: Top 2 Reall =.6 h,q h Figure 2: Peekaboom Top 2. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-2) ith respet to for both oding shemes h and h,q. 27

28 Fration Retrieved Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top Reall =.99 h,q h Peekaboom: Top Reall =.95 h,q h Peekaboom: Top Reall =.85 Peekaboom: Top Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top Reall =.97 h,q h Peekaboom: Top Reall =.9 h,q h Peekaboom: Top Reall =.8 h,q h Peekaboom: Top Reall =.6 h,q h Figure 22: Peekaboom Top. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-) ith respet to for both oding shemes h and h,q. 28

29 Fration Retrieved Peekaboom: Top 5 Reall =.99 h,q h Fration Retrieved Peekaboom: Top 5 Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 5 Reall =.95 h,q h Peekaboom: Top 5 Reall =.85 Peekaboom: Top 5 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 5 Reall =.9 h,q h Peekaboom: Top 5 Reall =.8 h,q h Peekaboom: Top 5 Reall =.6 h,q h Figure 23: Peekaboom Top 5. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-5) ith respet to for both oding shemes h and h,q. 29

30 Fration Retrieved Peekaboom: Top 3 Reall =.99 h,q h Fration Retrieved Peekaboom: Top 3 Reall =.97 h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 3 Reall =.95 h,q h Peekaboom: Top 3 Reall =.85 Peekaboom: Top 3 Reall =.7 h,q h h,q h Fration Retrieved Fration Retrieved Fration Retrieved Peekaboom: Top 3 Reall =.9 h,q h Peekaboom: Top 3 Reall =.8 h,q h Peekaboom: Top 3 Reall =.6 h,q h Figure 24: Peekaboom Top 3. In eah panel, e plot the optimal fration retrieved at a target reall value (for top-3) ith respet to for both oding shemes h and h,q. 3

31 6 Conlusion We have ompared to quantization (oding) shemes for random projetions in the ontext of sublinear time approximate near neighbor searh. The reently proposed sheme based on uniform quantization [4] is simpler than the influential existing ork [] (hih used uniform quantization ith a random offset). Our analysis onfirms that, under the general theory of LSH, the ne sheme [4] is simpler and more aurate than []. In other ords, the step of random offset in [] is not needed and may hurt the performane. Our analysis provides the pratial guidelines for using the proposed oding sheme to build hash tables. Our reommendation is to use a bin idth about =.5 hen the target similarity is high and a bin idth about = 3 hen the target similarity is not that high. In addition, using the proposed oding sheme based on uniform quantization (ithout the random offset), the influene of is not very sensitive, hih makes it very onvenient in pratial appliations. Referenes [] Mayur Datar, Niole Immorlia, Piotr Indyk, and Vahab S. Mirrokn. Loality-sensitive hashing sheme based on p-stable distributions. In SCG, pages , Brooklyn, NY, 24. [2] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE Transations on Computers, 24: 6, 975. [3] Piotr Indyk and Rajeev Motani. Approximate nearest neighbors: Toards removing the urse of dimensionality. In STOC, pages 64 63, Dallas, TX, 998. [4] Ping Li, Mihael Mitzenmaher, and Anshumali Shrivastava. Coding for random projetions. Tehnial report, arxiv:38.228, 23. [5] Ping Li, Mihael Mitzenmaher, and Anshumali Shrivastava. Coding for random projetions and nonlinear estimators. Tehnial report, 24. 3

More information

