arxiv: v1 [cs.ds] 19 Dec 2016

Size: px

Start display at page:

Download "arxiv: v1 [cs.ds] 19 Dec 2016"

Ethelbert Barton
5 years ago
Views:

1 Smlarty preservng compressons of hgh dmensonal sparse data Raghav Kulkarn LnkedIn Bangalore Rameshwar Pratap IIIT Bangalore arxv: v cs.ds 9 Dec 06 ABSTRACT The rse of nternet has resulted n an exploson of data consstng of mllons of artcles, mages, songs, and vdeos. Most of ths data s hgh dmensonal and sparse. The need to perform an effcent search for smlar obects n such hgh dmensonal bg datasets s becomng ncreasngly common. Even wth the rapd growth n computng power, the bruteforce search for such a task s mpractcal and at tmes mpossble. Therefore algorthmc solutons such as Localty Senstve Hashng LSH are requred to acheve the desred effcency n search. Any smlarty search method that acheves the effcency uses one or both of the followng methods:. Compress the data by reducng ts dmenson whle preservng the smlartes between any par of data-obects. Lmt the search space by groupng the data-obects based on ther smlartes. Typcally s obtaned as a consequence of. Our focus s on hgh dmensonal sparse data, where the standard compresson schemes, such as LSH for Hammng dstance Gons, Indyk and Motwan 7, become neffcent n both and due to at least one of the followng reasons:. o effcent compresson schemes mappng bnary vectors to bnary vectors. Compresson length s nearly lnear n the dmenson and grows nversely wth the sparsty 3. Randomness used grows lnearly wth the product of dmenson and compresson length. We propose an effcent compresson scheme mappng bnary vectors nto bnary vectors and smultaneously preservng Hammng dstance and Inner Product. Our schemes avod all the above mentoned drawbacks for hgh dmensonal sparse data. The length of our compresson depends only on the sparsty and s ndependent of the dmenson of the data. Moreover our schemes provde one-shot soluton for Hammng dstance and Inner Product, and work n the streamng settng as well. In contrast wth the local proecton strateges used by most of the prevous schemes, our scheme combnes usng sparsty the followng two strateges:. Parttonng the dmensons nto several buckets,. Then obtanng global lnear summares n each of these buckets. We generalze our scheme for real-valued data and obtan compressons for Eucldean dstance, Inner Product, and k-way Inner Product.. ITRODUCTIO The technologcal advancements have led to the generaton of huge amount of data over the web such as texts, mages, audos, and vdeos. eedless to say that most of these datasets are hgh dmensonal. Searchng for smlar data-obects n such massve and hgh dmensonal datasets s becomng a fundamental subroutne n many scenaros lke clusterng, classfcaton, nearest neghbors, rankng etc. However, due to the curse of dmensonalty a brute-force way to compute the smlarty scores on such data sets s very expensve and at tmes nfeasble. Therefore t s qute natural to nvestgate the technques that compress the dmenson of dataset whle preservng the smlarty between data obects. There are varous compressng schemes that have been already studed for dfferent smlarty measures. We would lke to emphasze that any such compressng scheme s useful only when t satsfes the followng guarantee,.e. when data obects are nearby under the desred smlarty measure, then they should reman near-by n the compressed verson, and when they are far, they should reman far n the compressed verson. In the case of probablstc compresson schemes the above should happen wth hgh probablty. Below we dscuss a few such notable schemes. In ths work we consder bnary and real-valued datasets. For bnary data we focus on Hammng dstance and Inner product, whle for real-valued data we focus on Eucldean dstance and Inner product.. Examples of smlarty preservng compressons Data obects n a datasets can be consdered as ponts vectors n hgh dmensonal space. Let we have n vectors bnary or real-valued n d-dmensonal space. Gons, Indyk, Motwan 7 proposed a data structure to solve approxmate nearest neghbor c- problem n bnary data for Hammng dstance. Ther scheme popularly known as Localty Senstve Hashng LSH. Intutvely, ther data structure can be vewed as a compresson of a bnary vector, whch s obtaned by proectng t on a randomly chosen bt postons. JL transform 0 suggests a compressng scheme for real-valued data. For any > 0, t compresses the dmenson of the ponts from d to O log n whle preservng the Eucldean dstance between any par of ponts wthn factor of ±. Gven two vectors u, v R d, the nner product

2 smlarty between them s defned as u, v := Σ d =uv. Ata Kabán suggested a compresson schemes for real data whch preserves nner product va random proecton. On the contrary, f the nput data s bnary, and t s desrable to get the compresson only n bnary data, then to the best of our knowledge no such compresson scheme s avalable whch acheves a nontrval compresson. However, wth some sparsty assumpton bound on the number of s, there are some schemes avalable whch va asymmetrc paddng addng a few extra bts n the vector reduce the nner product smlarty of the orgnal data to the Hammng 3, and Jaccard smlarty see Prelmnares for a defnton 4. Then the compresson scheme for Hammng or Jaccard can be appled on the padded verson of the data. Bnary data can also be vewed as a collecton of sets, then the underlyng smlarty measure of nterest can be the Jaccard smlarty. Broder et. al. 5, 6, 4 suggested a compresson scheme for preservng Jaccard smlarty between sets whch s popularly known as Mnwse permutatons.. Our focus: Hgh dmensonal sparse data In ths work, we focus on Hgh Dmensonal Sparse Data. In many real-lfe scenaros, data obect s represented as very hgh-dmensonal but sparse vectors,.e. number of all possble attrbutes features s huge, however, each data obect has only a very small subset of attrbutes. For example, n bag-of-word representaton of text data, the number of dmensons equals to the sze of vocabulary, whch s large. However for each data pont, say a document, contans only a small number of words n the vocabulary, leadng to a sparse vector representaton. The bag-of-words representaton s also commonly used for mage data. Data-sparsty s commonly prevalent n audo and vdeo-data as well..3 Shortcomngs of earler schemes for hgh dmensonal sparse data The qualty of any compresson scheme can be evaluated based on the followng two parameters - the compresson-length, and the amount of randomness requred for the compresson. The compresson-length s defned as the dmenson of the data after compresson. Ideally, t s desrable to have both of these to be small whle preservng a desred accuracy n the compresson. Below we wll notce that most of the above mentoned compresson schemes become n-feasble n the case of hgh dmensonal sparse datasets as ther compresson-length s very hgh, and the amount of randomness requred for the compresson s qute huge. Hammng dstance: Consder the problem of fndng c- see Defnton 0 for Hammng dstance n bnary data. In the LHS scheme, the sze of hashtable determnes the compresson-length. The sze of hashtable K = O log n see Defnton. If r = O, then the sze of p hashtable K = O log n = O d p cr log n = Od log n, whch s lnear n the dmenson. Further, n order to randomly choose a bt poston between to d, t s requre to generate Olog d many random bts. Moreover, as the sze of hash table s K, and the number of hash tables s L, t s requred to generate OKL log d many random bts to create the hashtable, whch become qute large specally when K s lnear n d. Eucldean dstance: In order to acheve compresson that preserve the dstance between any par of ponts, due to JL transform 0,, t s requred to proect the nput matrx on a random matrx of dmensons d k, where k = O log n. Each entry of the random matrx s chosen from {±} wth probablty see, or from a normal dstrbuton see 0. The compresson-length n ths scheme s O log n, and t requres O d log n randomness. Inner product: Compresson schemes whch compress bnary data nto bnary data whle preservng Inner product s not known. However usng asymmetrc paddng scheme of 3, 4 t s possble to get a compresson va Hammng or Jaccard Smlarty measure, then shortcomngs of Jaccard and Hammng wll get carry forward n such scheme. Further, n case of real valued data the compresson scheme of Ata Kabán has compressonlength = O log n, and requres O d log n randomness. Jaccard Smlarty: Mnhash permutatons 5, 6, 4 suggest a compresson scheme for preservng Jaccard smlarty for a collecton of sets. A maor dsadvantage of ths scheme s that for hgh dmensonal data computng permutatons are very expensve, and further n order to acheve a reasonable accuracy n the compresson a larger number of repetton mght be requred. A maor dsadvantage of ths scheme s that t requres substantally large amount of randomness that grows polynomally n the dmenson. Lack of good bnary to bnary compresson schemes. To summarze the above, there are two man compresson schemes currently avalable for bnary to bnary compresson. The frst one s LSH and the second one s JL-transform. The LSH requres the compresson sze to be lnear n the dmenson and the JL-transform can acheve logarthmc compresson sze but t wll compress bnary vectors to real vectors. The analogue of JL-transform whch compresses bnary vectors to bnary vectors requres the compresson-length to be lnear n the number of data ponts see Lemma 7. Snce both dmenson as well as the number of data ponts can be large, these schemes are neffcent. In ths paper we propose an effcent bnary to bnary compresson scheme for sparse data whch works smultaneously for both Hammng dstance and Inner Product.

3 .4 Our contrbuton In ths work we present a compressng scheme for hgh dmensonal sparse data. In contrast wth the local proecton strateges used by most of the prevous schemes such as LSH 9, 7 and JL 0, our scheme combnes usng sparsty the followng two step approach. Parttonng the dmensons nto several buckets,. Then obtanng global lnear summares of each of these buckets. We present our result below:.4. For bnary data For bnary data, our compresson scheme provdes one-shot soluton for both Hammng and Inner product compressed data preserves both Hammng dstance and Inner product. Moreover, the compresson-length depends only on the sparsty of data and s ndependent of the dmenson of data. We frst nformally state our compresson scheme for bnary data, see Defnton for a formal defnton. Gven a bnary vector u {0, }d, our scheme compress t nto a -dmensonal bnary vector say u0 {0, } as follows, where to be specfed later. We randomly map each bt poston say {}d= of the orgnal data to an nteger {} =. To compute the -th bt of the compressed vector u0 we check whch bts postons have been mapped to, we compute the party of bts located at those postons, and assgn t to u0. The followng fgure llustrate an example of the compresson. In the followng theorems let ψ denote the maxmum Compresson Scheme. Then for all u, u U the followng s true wth probablty at least n, IPu, u IPu 0, u 0 + IPu, u. In the followng theorem, we strengthen our result of Theorem, and shows a compresson bound whch s ndependent of the dmenson and the sparsty, but depends only on the Hammng dstance between the vectors. However, we could show our result n the Expectaton, and only for a par of vectors. Theorem 3. Consder two bnary vectors u, v {0, }d, whch get compressed nto vectors u0, v0 {0, } usng our Bnary Compresson Scheme. If we set = Or, then f dh u, v < r, then PrdH u0, v0 < r =, and f dh u, v 4r, then EdH u0, v0 > r. Remark. To the best of our knowledge, ours s the frst effcent bnary to bnary compresson scheme for preservng Hammng dstance and Inner product. For Hammng dstance n fact our scheme obtans the nofalse-negatve guarantee analogous to the one obtaned n recent paper by Pagh. Remark. When r s constant, as mentoned above, LSH 7 requres compresson length lnear n the dmenson. However, due to Theorem 3, our compresson length s only constant. Remark 3. Our compresson length s Oψ log n, whch s ndependent of the dmenson d; whereas other schemes such as LSH may requre the compresson length growng lnearly n d and the analogue of JL-transform for bnary to bnary compresson requres compresson length growng lnearly n n see Lemma 7. number of n any vector. We state our result for bnary data as follows: Theorem. Consder a set U of bnary vectors {u }n= {0, }d, a postve nteger r, and > 0. If r > 3 log n, we set = Oψ ; f r < 3 log n, we set = Oψ log n, and compress them nto a set U0 of bnary vectors {u0 }n= {0, } usng our Bnary Compresson Scheme. Then for all u, u U, f dh u, u < r, then PrdH u 0, u 0 < r =, f dh u, u + r, then PrdH u 0, u 0 < r < n. Theorem. Consder a set U of bnary vectors {u }n= {0, }d, a postve nteger r, and > 0. If r > 3 log n, we set = Oψ ; f r < 3 log n, we set = Oψ log n, and compress them nto a set U0 of bnary vectors {u0 }n= {0, } usng our Bnary Remark 4. The randomness used by our compresson scheme s Od log whch grows logarthmcally n the compresson length whereas the JL-transform uses randomness growng lnearly n the compresson length. For all-par compresson for n data ponts we use Odlog ψ+log log n randomness, whch grows logarthmcally n the sparsty and sub-logarthmcally n terms of number of data ponts..4. For real-valued data We generalze our scheme for real-valued data also and obtan compressons for Eucldean dstance, Inner product, and k-way Inner product. We frst state our compresson scheme as follows: Gven a vector a Rd, our scheme compress t nto a -dmensonal vector say α as follows. We randomly map each coordnate poston say {}d= of the orgnal data to an nteger {} =. To compute the -th coordnate of the compressed vector α we check whch coordnates of the orgnal data have been mapped to, we multply the numbers located at those postons wth a random varable x, compute ther summaton,

4 parwse Inner Product or Eucldean dstance, we have a clear advantage on the amount of randomness requred for the compresson, the randomness requred by our scheme grows logarthmcally n the compresson length, whereas the other schemes requre randomness whch grows lnearly n the compresson length. Potental applcatons and assgn t to α, where x takes a value between {, +} wth probablty /. The followng fgure llustrate an example of the compresson. In the followng we present our man result for real valued data whch s compresson bound for preservng k-way nner product. For a set of k vectors {α }k= Rd, ther k-way nner product s defned as hα α... αk = d X α α... αk, = where α denote the -th coordnate of the vector α. Theorem 4. Consder a set of k vectors {a }k= R, whch get compressed nto vectors {α }k= R usng our Real Compresson Scheme. If we set = 0Ψk k, where Ψ = max{ a }= and > 0, then the followng holds d Pr hα α... αk ha a... ak > < /0. Remark 5. An advantage of our compresson scheme s that t can be constructed n the streamng model qute effcently. The only requrement s that n the case of bnary data the maxmum number of the vectors n the stream should be bounded, and n the case of real valued data norm of the vectors should be bounded..5 Comparson wth prevous work A maor advantage of our compresson scheme s that t provdes a one-shot soluton for dfferent smlarty measures Bnary compresson scheme preserves both Hammng dstance and Inner product, and Real valued data compresson scheme preserves both Eucldean dstance, Inner product, and k-way Inner product. The second man advantage of our compresson scheme for bnary data t gves a bnary to bnary compresson as opposed to the bnary to real compresson by JLtransform. Thrd man advantage s that our compresson scheme s that ts compresson sze s ndependent of the dmensons and depends only on the sparsty as opposed to Gons, Indyk, Motwan 7 scheme whch requres lnear sze compresson. For real-valued data our results are weaker compared to prevous known works but they generalze to k-way nner product, whch none of the prevous work does. Another advantage of our real valued compresson scheme s that when the number of ponts are small constant, then for preservng a A potental use of our result s to mprove approxmate nearest neghbor search va composng wth LSH. Due to the curse of dmensonalty many search algorthms scale poorly n hgh dmensonal data. So, f t s possble to get a succnct compresson of data whle preservng the smlarty score between par of data ponts, then such compresson naturally helps for effcent search. One can frst compress the nput such that t preserve the desred smlarty measure, and then can apply a collson based hashng algorthm such as LSH 7, 9 for effcent approxmate nearest neghbor c- on the compressed data. As our compresson scheme provdes a smlar guarantee as of Defnton, then one can construct data structure for LSH for approxmate nearest neghbor problem. Thus, our smlarty preservng compresson scheme leads to an effcent approxmate nearest neghbor search. There are many smlarty based algorthmc methods used n large scale learnng and nformaton retreval, e.g., Frequent temset mnng, ROCK clusterng 8. One could potentally obtan algorthmc speed up n these methods va our compresson schemes. Recently compresson based on LSH for nner-product s used to speed up the forward and back-propagaton n neural networks 5. One could potentally use our scheme to take advantage of sparsty and obtan further speed up. Organzaton of the paper In Secton, we present the necessary background whch helps to understand the paper. In Secton 3, we present our compresson scheme for hgh dmensonal sparse bnary data. In Secton 4, we present our compresson scheme for hgh dmensonal sparse real data. Fnally n Secton 5, we conclude our dscusson, and state some possble extensons of the work.. BACKGROUD ψ Ψ a a dh u, v IPa, b otatons number of coordnates/bt postons n the compressed data upper bound on the number of s n any bnary vector. upper bound on the norm of any real-valued vector. l norm of the vector a -th bt poston coordnate of bnary real-valued vector a. Hammng dstance between bnary vectors u and v. Inner product between bnary/ real-valued vectors a and b.

5 . Probablty background Defnton. The Varance of a random varable X, denoted VarX, s defned as the expected value of the squared devaton of X from ts mean. VarX = EX EX = EX EX. Defnton. Let X and Y be ontly dstrbuted random varables. The Covarance of X and Y, denoted CovX, Y, s defned as CovX, Y = EX EXY EY. Fact 3. Let X be a random varable and λ be a constant. Then, Varλ + X = VarX and VarλX = λ VarX. Fact 4. Let X, X,..., X n be a set of n random varables. Then, n n Var X = CovX, X. = = Var X + Fact 5. Let X and Y be a par of random varables and λ be a constant. Then, CovλX, λy = λ CovX, Y. Fact 6 Chebyshev s nequalty. Let X be a random varable havng fnte mean and fnte non-zero varance σ. Then for any real number λ > 0, Pr X EX λσ λ.. Smlarty measures and ther respectve compresson schemes Hammng dstance. Let u, v {0, } d be two bnary vectors, then the Hammng dstance between these two vectors s the number of bt postons where they dffer. To the best of our knowledge, there does not exst any non-trval compresson scheme whch provde smlar compresson guarantees such as JL-lemma provdes for Eucldean dstance. In the followng lemma, we show that for a set of n-bnary vectors an analogous JL-type bnary to bnary compresson f t exst may requre compresson length lnear n n. Further collson based hashng scheme such as LSH due to Gons et al. 7, see Subsecton.3 can be consdered as a bnary to bnary compresson scheme, where the sze of hashtable determnes the compresson-length. Ther technques ncludes randomly choosng bt postons and checkng f the query and nput vectors are matchng exactly at those bt postons. Lemma 7. Consder a set of n-bnary vectors, then an analogous JL-type bnary to bnary compresson f t exst may requre compresson length lnear n n. A collson occurs when two obect hash to the same hash value. Proof. Consder a set of n bnary vectors {e } n = standard unt vectors, and the zero vector e 0. The Hammng dstance between e 0 and any e s, and the Hammng dstance between any par of vectors e and e for s. Let f be a map whch map these ponts nto bnary vectors of dmenson k by preservng the dstance between any par of vectors wthn a factor of ± ε, for a parameter ε > 0. Thus, these n ponts {fe } n = are wthn a dstance at most + ε from fe 0, and any two ponts fe and fe for are at dstance at least ε. However, the total number of ponts at dstance at most + ε from fe 0 s Ok +ε, and dstance between any two ponts fe and fe for s non-zero so each pont {e } n = has ts dstnct mage. Thus Ok +ε should be equal to n, whch gves k = Ωn +ε. Thus the compresson length can be lnear n n. Eucldean dstance. Gven two vectors a, b R d, the Eucldean dstance between them s denoted as a, b and defned as Σ d = a b. A classcal result by Johnson and Lndenstrauss 0 suggest a compressng scheme whch for any set D of n vectors n R d preserve parwse Eucldean dstance between any par of vectors n D. Lemma 8 JL transform 0. For any 0,, and any nteger n, let k be a postve nteger such that k = O log n. Then for any set D of n vectors n R d, there s a map f : R d R k such that for any par of vectors a, b n D : a, b fa, fb + a, b Furthermore, the mappng f can be found n randomzed polynomal tme. In several followup works on JL lemma, the functon f has been regarded as a random proecton matrx R R d k, and can be constructed element-wse usng Gaussan due to Indyk and Motwan 9, or unform {+, } due to Achloptas. Inner product. Gven two vectors u, v R d, the Inner product u, v between them s defned as u, v := Σ d =uv. Compresson schemes whch preserves Inner product has been studed qute a lot n the recent tme. In the case of bnary data, along wth some sparsty assumpton bound on the number of s, there are some schemes avalable whch by paddng add a few extra bts n the vector reduce the Inner product of the orgnal data to the Hammng 3, and Jaccard smlarty 4. Then the compresson scheme for Hammng or Jaccard can be appled on the padded verson of the data. Smlarly, n the case of real-valued data, a smlar paddng technque s known that due paddng reduces Inner product to Eucldean dstance 3. Recently, an nterestng work

6 by Ata Kabán suggested a compresson schemes va random proecton method. Ther scheme approxmately preserve Inner Product between any par of nput ponts and ther compresson bound matches the bound of JL-transform 0. Jaccard smlarty. Bnary vectors can also be consdered as sets over the unverse of all possble features, and a set contan only those elements whch have non-zero entres n the correspondng bnary vector. For example two vectors u, v {0, } d can be vewed as two sets u, v {,,... d}. Here, the underlyng smlarty measure of nterest s the Jaccard smlarty whch s defned as follows JSu, v = u v u v. A celebrated work by Broder et al. 5, 6, 4 suggested a technque to compress a collecton of sets whle preservng the Jaccard smlarty between any par of sets. Ther technque ncludes takng a random permutaton of {,,..., d} and assgnng a value to each set whch maps to mnmum under that permutaton. Ths compresson scheme s popularly known as Mnwse hashng. Defnton 9 Mnwse Hash functon. Let π be a permutatons over {,..., d}, then for a set u {,... d} h π u = arg mn π for u. Then due to 5, 6, 4, Prh π u = h π v = u v u v..3 Localty Senstve Hashng LSH suggest an algorthm or alternatvely a data structure for effcent approxmate nearest neghbor c- search n hgh dmensonal space. We formally state t as follows: Defnton 0. c-approxmate earest eghbor c-. Let D be set of ponts n R d, and Sm.,. be a desred smlarty measure. Then for parameters S, c > 0, the c- problem s to construct a data structure that gven any query pont q D reports a cs-near neghbor of q n D f there s an S-near neghbor of q n D. Here, we say a pont x D s S-near neghbor of q f Smq, x > S. In the followng we defne the concept of localty senstve hashng LSH whch suggest a data structure to solve c- problem. Defnton Localty senstve hashng 9. Let D be a set of n vectors n R d, and U be the hashng unverse. Then, a famly H of functons from D to U s called as S, cs, p, p -senstve for a smlarty measure Sm.,. f for any x, y D, f Smx, y S, then Pr h H hx = hy p, f Smx, y cs, then Pr h H hx = hy p. Clearly, any such scheme s nterestng only when p > p, and c <. Let K, L be the parameters of the data structure for LSH, where K s the number of hashes n each hash table, and L s the number of hash tables, then due to 9, 7, we have K = O log n p and L = O n ρ log p log n, where ρ = log p. Thus, gven a famly of S, cs, p, p -senstve hash functons, and usng result of 9, 7, one can construct a data structure for c- wth On ρ log n query tme and space On +ρ..3. How to convert smlarty preservng compresson schemes to LSH? LSH schemes for varous smlarty measures can be vewed as frst compressng the nput such that t preserve the desred smlarty measure, and then applyng collson based hashng on top of t. If any smlarty preservng compresson scheme provdes a smlar guarantee as of Defnton, then for parameters smlarty threshold S, and c, one can construct data structure for LSH hash-tables wth parameters K and L for the c- problem va 9, A COMPRESSIO SCHEME FOR HIGH DIMESIOAL SPARSE BIARY DATA We frst formally defne our Compresson Scheme as follows: Defnton Bnary Compresson Scheme. Let be the number of buckets, for = to d, we randomly assgn the -th poston to a bucket number b {,... }. Then a vector u {0, } d, compressed nto a vector u {0, } as follows: u = u mod. :b= ote 3. For brevty we denote the Bnary Compresson Scheme as BCS. Some ntuton. Consder two bnary vectors u, v {0, } d, we call a bt poston actve f at least one of the vector between u and v has value n that poston. Let ψ be the maxmum number of n any vector, then there could be at most ψ actve postons shared between vectors u and v. Further, usng the BCS, let u and v get compressed nto bnary vectors u, v {0, }. In the compressed vectors, we call a partcular bt poston pure f the number of actve postons mapped to that poston s at most one, otherwse we call t corrupted. It s easy to see that the contrbuton of pure bt postons n u, v towards Hammng dstance or Inner product smlarty, s exactly equal to the contrbuton of the bt postons n u, v whch get mapped to the pure bt postons. The number of maxmum possble corrupted bts n the compressed data s ψ because n the worst case t s possble that all the ψ actve bt poston got pared up whle compresson. The devaton of Hammng dstance or Inner product smlarty between u

7 f r > 3 log n, and we set = 6ψ, then probablty that for all u0, u0 U0 share more than r corrupted postons s at most n. and v0 from that of u and v, corresponds to the number of corrupted bt postons shared between u0 and v0. The above fgure llustrate ths wth an example, and the lemma below analyse t. Lemma 4. Consder two bnary vectors u, v {0, }d, whch get compressed nto vectors u0, v0 {0, } usng the BCS, and suppose ψ s the maxmum number of n any vector. Then for an nteger r, and > 0, 0 probablty that u0 and more than r corrupted v share r ψ postons s at most. Proof. We frst calculate the probablty that a partcular bt poston gets corrupted between u0 and v0. As there are at most ψ actve postons shared between vectors u and v, the number of ways of parng two ac tve postons from ψ actve postons s at most ψ, and ths parng wll result a corrupted bt poston n u0 or v0. Then, the probablty that a partcular bt pos ψ 0 0 ton n u or v gets corrupted s at most 4ψ. Further, f the devaton of Hammng dstance or Inner product smlarty between u0 and v0 from that of u and v s more than r, then at least r corrupted postons are shared between u0 and v0, whch mples that at least r par of actve postons n u and v got pared up whle compresson. The number of possble ways of parng r postons from ψ actve postons s actve ψ r r r at most ψ r r! ψ. Snce the probablty that a par of actve postons got mapped n the same bt poston n the compressed data s, the probabl r ty that r par of actve postons got mapped n dstnct bt postons n the compressed data s at most r. Thus, by unon bound, the probablty that at 0 least r corrupted bt poston r shared between u and ψ r ψ 0 v s at most =. r In the followng lemma we generalze the above result on a set of n bnary vectors. We suggest a compresson bound such that any par of compressed vectors share only a very small number of corrupted bts, wth hgh probablty. Lemma 5. Consder a set U of n bnary vectors {u }n= {0, }d, whch get compressed nto a set U0 of bnary vectors {u0 }n= {0, } usng the BCS. Then for any postve nteger r, and > 0, If r < 3 log n, and we set = 44ψ log n, then probablty that for all u0, u0 U0 share more than r corrupted postons s at most n. Proof. In the frst case, for a fxed par of compressed vectors u0 and u0, due to lemma 4, probablty that they share r more than r corrupted postons s at ψ most. If r > 3 log n, and = 6ψ, then the r 3 log n ψ ψ above probablty s at most < = 4t n 3 log n < n3. As there are at most pars of vec tors, then the probablty of every par of compressed vectors share more than r corrupted postons s at n most n3 < n. In the second case, as r < 3 log n, we cannot upper bound the desred probablty smlar to the frst case. Here we use a trck, n the nput data we replcate each bt poston 3 log n tmes, whch makes a d dmensonal vector to a 3d log n dmensonal, and as a consequence the Hammng dstance or Inner product smlarty s also scaled up by a multplcatve factor of 3 log n. We now apply the compresson scheme on these scaled vectors, then for a fxed par of compressed vectors u0 and u0, probablty that they have more than 3 r log n 3 r log n n corrupted postons s at most 6ψ log. As we set = 44ψ log n, the above probablty s at 3 r log n 3 log n n most 6ψ log < < n3. The f 44ψ log n nal probablty follows by applyng unon bound over all n pars. Remark 6. We would lke to emphasze that usng the BCS, for any par of vectors, the Hammng dstance between them n the compressed verson s always less than or equal to ther orgnal Hammng dstance. Thus, ths compresson scheme has only one-sded-error for the Hammng case. However, n the case of nner product smlarty ths compresson scheme can possbly have two-sded-error as the nner product n the compressed verson can be smaller or hgher than the nner product of orgnal nput. We llustrate ths by the followng example, where the compresson scheme assgns both bt postons of the nput to one bt of the compressed data. If u =, 0 and v = 0,, then IPu, v = 0; and after compresson u0 = and v0 = whch gves IPu0, v0 =. If u =, and v =,, then IPu, v =, and after compresson u0 = 0 and v0 = 0 whch gves IPu0, v0 = 0. As a consequence of Lemma 5 and the above remark, we present our compresson guarantee for the Hammng dstance and Inner product smlarty.

8 Theorem. Consder a set U of bnary vectors {u } n = {0, }d, a postve nteger r, and > 0. If r > 3 log n, we set = Oψ ; f r < 3 log n, we set = Oψ log n, and compress them nto a set U of bnary vectors {u }n = {0, } usng BCS. Then for all u, u U, f d H u, u < r, then Prd H u, u < r =, f d H u, u +r, then Prd H u, u < r < n. Theorem. Consder a set U of bnary vectors {u } n = {0, }d, a postve nteger r, and > 0. If r > 3 log n, we set = Oψ ; f r < 3 log n, we set = Oψ log n, and compress them nto a set U of bnary vectors {u }n = {0, } usng BCS. Then for all u, u U the followng s true wth probablty at least n, IPu, u IPu, u + IPu, u. 3. A tghter analyss for Hammng dstance In ths subsecton, we strengthen our analyss for the Hammng case, and shows a compresson bound whch s ndependent of the dmenson and the sparsty, and depends only on the Hammng dstance between the vectors. However, we could show our result n expectaton, and only for a par of vectors. For a par of vectors u, v {0, } d, we say that a bt poston s unmatched f exactly one of the vector has value n that poston and the other one has value 0. We say that a bt poston n the compressed data s odd-bt f odd number of unmatched postons get mapped to that bt. Let u and v get compressed nto vectors u and v usng the BCS. Our observaton s that each odd bt poston n the compressed data contrbutes to Hammng dstance n n the compressed data. We llustrate ths wth an example: let u,, k =, 0,, v,, k = 0,, 0 and let,, k get mapped to bt poston say n the compressed data, then u = 0, v =, then clearly d H u, v =. Theorem 3. Consder two bnary vectors u, v {0, } d, whch get compressed nto vectors u, v {0, } usng BCS. If we set = Or, then f d H u, v < r, then Prd H u, v < r =, and f d H u, v 4r, then Ed H u, v > r. Proof. Let ψ u denote the number of unmatched bt postons between u and v. As mentoned earler, f odd number of unmatched bt postons gets mapped to a partcular bt n the compressed data, then that bt poston corresponds to the Hammng dstance. Let we call that bt poston as odd-bt poston. In order to gve a bound on the Hammng dstance n the compressed data we need to gve a bound on number of such odd-bt postons. We frst calculate the probablty that a partcular bt poston say k-th poston n the compressed data s odd. Let we denote ths by Pr k odd. We do t usng the followng bnomal dstrbuton: Pr k odd = ψu mod = mod =0 ψu ψu. Smlarly, we compute the probablty that the k-th bt s even: ψ u Pr k ψu even = ψu. We have, Further, Pr k even Pr k = = = ψ u mod =0 ψ u mod = Pr k even + Pr k odd =. odd ψu ψu ψu ψu ψu ψu. Thus, we have the followng from Equaton and Equaton Pr k odd = ψu exp ψ u. 3 The last nequalty follows as x e x for x <. Thus expected number of odd-bts s at least exp ψ u. We now splt here n two cases: ψ u < 0r, and ψ u 0r. We address them one-by-one. Case : ψ u < 0r. We complete ths case usng Lemma 4. It s easy to verfy that n the case of Hammng dstance the analyss of Lemma 4 also holds f we consder unmatched bts nstead of actve bts n the analyss. Thus, the probablty that at least r corrupted bt poston shared between u and v s at most ψu r. We wsh to set the value of such that wth probablty at most /3 that u and v share more than r corrupted postons. If we set the value of = 4ψ u 3 r, then the above probablty s at most ψ u 4ψ u 3 r r = 3. Thus, when = 4ψ u 3 r = Oψ u = Or as ψ u < 0r and r, wth probablty at most /3, at most

9 r corrupted bts are shared between u and v. As a consequence to ths, we have Ed H u, v > 3.3r = r. Case : ψ u 0r. We contnue here from Equaton 3 Expected number of odd buckets exp ψ u exp 40r = 4r exp 5 r > 4r r = r. 4 5 Equalty 4 follows by settng = 8r and Inequalty 5 holds as exp 5 r > r for r. Fnally, Case and Case complete a proof of the theorem. 4. A COMPRESSIO SCHEME FOR HIGH DIMESIOAL SPARSE REAL DATA We frst defne our compresson scheme for the real valued data. Defnton 6. Real-valued Compresson Scheme Let be the number of buckets, for = to d, we randomly assgn the -th poston to the bucket number b {,... }. Then, for = to, the -th coordnate of the compressed vector α s computed as follows: α = ax, :b= where each x s a random varable that takes a value between {, +} wth probablty /. ote 7. For brevty we denote our Real-valued Compresson Scheme as RCS. We frst present our compresson guarantee for preservng Inner product for a par of real valued vectors. Lemma 8. Consder two vectors a, b R d, whch get compressed nto vectors α, β R usng the RCS. If we set = 0Ψ, where Ψ = max{ a, b } and > 0, then the followng holds, Pr α, β a, b > < /0. Proof. Let we have two vectors a, b R d such that a = a, a,... a d and b = b, b,... b d. Let {x } d = be a set of d random varables such that each x takes a value between {, +} wth probablty /, be a random varable that takes the value f -th dmenson of the vector s mapped to the k-th bucket of the compressed vector and 0 otherwse. Usng the compresson scheme RCS, let vectors a, b get compressed nto vectors α and β, where α = α,..α k,..α such that α k = Σ d = a x, and β = β,..β k,..β such that β k = Σ d = b x. We now compute the nner product of the compressed vectors α, β. α, β = = = α k β k = Σ d =a x Σ d =a b x Σ d =a b = Σ d =a b = Σ d =a b + = a, b + + Σ d =b x + Σ a b x x + Σ a b x x Σ a b x x Σ a b x x Σ a b x x Equaton 7 follows from Equaton 6 because x = as x = ±, and z = z as z takes value ether or 0. We contnue from Equaton 8 and compute the Expectaton and the Varance of the random varable α, β. We frst compute the Expectaton of the random varable α, β as follows: E α, β = E a, b + = E a, b + E = a, b + = a, b + Σ a b x x Σ a b x x Σ Ea b x x Σ a b Ex x 9 = a, b. 0 Equaton 9 holds due to the lnearty of expectaton. Equaton holds because Ex x = 0 as both x and x take a value between {, +} each wth probablty 0.5 whch leads to Ex x = 0. We now compute the Varance of the random varable

10 α, β as follows: Var α, β = Var a, b + = Var Σ a b x x Σ a b x x = Var Σ = Var = Var +...,,, Cov, 3 Equaton holds due to Fact 3; Equaton holds as we denote the expresson a b x x by the varable ; Equaton 3 holds due to Fact 4. We now bound the values of the two terms of Equaton 3. Var = Var... + Cov, ξl 4 k l Equaton 4 holds due to Fact 4. We bound the values of two terms of Equaton 4 one by one as follows. = = = = Var a b a b a b = Var x x Ex x E Var z k a b x x 5 Ex x 6 7 a b / a b /. 8 Equaton 5 holds due to Fact 3; Equaton 6 holds due to Defnton ; Equaton 7 holds as x, x =, k = z, and Ex x = 0; fnally, Equaton 8 holds as a b a b = a b. We now bound the second term of Equaton 4. Cov, ξl = Cov a b x x, a b x x z l z l = a b Cov x x, x x z l z l 9 = a b Ex x Ex x x x z l z l Ex x z l z l 0 = a b E x x z l z l = a b E = 0 z l z l Equaton 9 holds due to Fact 5; Equaton 0 holds due to Defnton ; Equaton holds as Ex x = 0; fnally, Equaton holds as n our compresson scheme each dmenson of the nput s mapped to a unque coordnate bucket n the compressed vector whch mples that at least one of the random varable between and has to be zero. We now bound the second term of Equaton 3. Cov, = E = E E E 3 = E a b x x k z a b x x = a b a b E x x x x k z = 0 4 Equaton 3 holds as E equal to zero because E = E ξk = and E ξk s Ea b x x = 0. A smlar argument follows for the other term as well. Equaton 4 holds as Ex x x x s equal to zero because each varable n the expectaton term takes a value between + and wth probablty 0.5. Thus, we have E α, β = a, b, and Equaton 3 n conuncton wth Equatons 4, 8,, 4 gves Var α, β a b / Ψ /, where Ψ = max{ a, b }.

11 Thus, by Chebyshev s nequalty see Fact 6, we have Pr α, β a, b > < Ψ = /0. The last nequalty follows as we set = 0Ψ. Usng a smlar analyss we can generalze our result for k-way nner product. We state our result as follows: Theorem 4. Consder a set of k vectors {a } k = R d, whch get compressed nto vectors {α } k = R usng the RCS. If we set = 0Ψk, where Ψ = max{ a } k = and > 0, then the followng holds Pr α α... α k a a... a k > < /0. We can also generalze the result of Lemma 8 for Eucldean dstance as well. Consder a par of vectors a, b R d whch get compressed nto vectors α, β R usng the compresson scheme RCS. Let α, β denote the squared eucldean dstance between the vectors α, β. Usng a smlar analyss of Lemma 8 we can compute Expectaton and Varance of the random varable α, β E α, β = a, b, and Var α, β a b Ψ, where Ψ = max{ a, b }. Thus, due to Chebyshev s nequalty see Fact 6, we have the followng result for Eucldean dstance. Theorem 5. Consder two vectors a, b R d, whch get compressed nto vectors α, β R usng the RCS. If we set = 0Ψ, where Ψ = max{ a, b } and > 0, then the followng holds Pr α, β a, b > < /0. Remark 7. In order to compress a par of data ponts our scheme requres Od log randomness, whch grows logarthmcally n the compresson length, whereas the other schemes requre randomness whch grows lnearly n the compresson length. Thus, when the number of ponts are small constant, then for preservng a parwse Inner product or Eucldean dstance, we have a clear advantage on the amount of randomness requred for the compresson. We also beleve that usng a more sophstcated concentraton result such as Martngale t s possble to obtan a more tghter concentraton guarantee, and as a consequence a smaller compresson length. 5. COCLUSIO AD OPE QUESTIOS In ths work, to the best of our knowledge, we obtan the frst effcent bnary to bnary compresson scheme for preservng Hammng dstance and Inner Product for hgh dmensonal sparse data. For Hammng dstance n fact our scheme obtans the no-false-negatve guarantee analogous to the one obtaned n recent paper by Pagh. Contrary to the local proecton approach of prevous schemes we frst randomly partton the dmenson, and then take a global summary wthn a partton. The compresson length of our scheme depends only on the sparsty and s ndependent of the dmenson as opposed to prevously known schemes. We also obtan a generalzaton of our result to real-valued settng. Our work leaves the possblty of several open questons mprovng the bounds of our compresson scheme, and extendng t to other smlarty measures such as Cosne and Jaccard smlarty are maor open questons of our work. 6. REFERECES D. Achloptas. Database-frendly random proectons: Johnson-lndenstrauss wth bnary cons. J. Comput. Syst. Sc., 664:67 687, 003. R. Agrawal and R. Srkant. Fast algorthms for mnng assocaton rules n large databases. In Proceedngs of 0th Internatonal Conference on Very Large Data Bases, September -5, 994, Santago de Chle, Chle, pages , D. Bera and R. Pratap. Frequent-temset mnng usng localty-senstve hashng. In Computng and Combnatorcs - nd Internatonal Conference, COCOO 06, Ho Ch Mnh Cty, Vetnam, August -4, 06, Proceedngs, pages 43 55, A. Z. Broder. Identfyng and flterng near-duplcate documents. In Combnatoral Pattern Matchng, th Annual Symposum, CPM 000, Montreal, Canada, June -3, 000, Proceedngs, pages 0, A. Z. Broder. Mn-wse ndependent permutatons: Theory and practce. In Automata, Languages and Programmng, 7th Internatonal Colloquum, ICALP 000, Geneva, Swtzerland, July 9-5, 000, Proceedngs, page 808, A. Z. Broder, M. Charkar, A. M. Freze, and M. Mtzenmacher. Mn-wse ndependent permutatons extended abstract. In Proceedngs of the Thrteth Annual ACM Symposum on the Theory of Computng, Dallas, Texas, USA, May 3-6, 998, pages , A. Gons, P. Indyk, and R. Motwan. Smlarty search n hgh dmensons va hashng. In VLDB 99, Proceedngs of 5th Internatonal Conference on Very Large Data Bases, September 7-0, 999, Ednburgh, Scotland, UK, pages 58 59, S. Guha, R. Rastog, and K. Shm. ROCK: A robust clusterng algorthm for categorcal attrbutes. Inf. Syst., 55: , P. Indyk and R. Motwan. Approxmate nearest neghbors: Towards removng the curse of dmensonalty. In Proceedngs of the Thrteth Annual ACM Symposum on the Theory of Computng, Dallas, Texas, USA, May 3-6, 998, pages , 998.

12 0 W. B. Johnson and J. Lndenstrauss. Extensons of lpschtz mappngs nto a hlbert space. Conference n modern analyss and probablty ew Haven, Conn., 98, Amer. Math. Soc., Provdence, R.I., pages 89 06, 983. A. Kaban. Improved bounds on the dot product under random proecton and random sgn proecton. In Proceedngs of the th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, Sydney, SW, Australa, August 0-3, 05, pages , 05. R. Pagh. Localty-senstve hashng wthout false negatves. In Proceedngs of the Twenty-Seventh Annual ACM-SIAM Symposum on Dscrete Algorthms, SODA 06, Arlngton, VA, USA, January 0-, 06, pages 9, A. Shrvastava and P. L. Asymmetrc LSH ALSH for sublnear tme maxmum nner product search MIPS. In Advances n eural Informaton Processng Systems 7: Annual Conference on eural Informaton Processng Systems 04, December , Montreal, Quebec, Canada, pages 3 39, A. Shrvastava and P. L. Asymmetrc mnwse hashng for ndexng bnary nner products and set contanment. In Proceedngs of the 4th Internatonal Conference on World Wde Web, WWW 05, Florence, Italy, May 8-, 05, pages 98 99, R. Sprng and A. Shrvastava. Scalable and sustanable deep learnng va randomzed hashng. CoRR, abs/ , 06.

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 4: Universal Hash Functions/Streaming Cont d CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected