Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Size: px
Start display at page:

Download "Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment"

Transcription

1 Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join Rice Univ. as TT Asst. Prof. Fall 25 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

2 What are we solving? Minwise hashing is widely popular for search and retrieval. Major Complaint: Document length is unnecessarily penalized. We precisely fix this and provide a practical solution. Other consequence: Algorithmic improvement for binary maximum inner product search (MIPS). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

3 Outline Motivation Asymmetric LSH for General Inner Products Asymmetric Minwise Hashing Faster Sampling Experimental Results. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27

4 Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

5 Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Sparse Binary High Dimensional Data Everywhere Sets can be represented as binary vector indicating presence/absence. Vocabulary is typically huge in practice. Modern Big data systems use only binary data matrix. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

6 Resemblance (Jaccard) Similarity The popular resemblance (Jaccard) similarity between two sets (or binary vectors) X, Y Ω is defined as: R = X Y X Y = a f x + f y a, where a = X Y, f x = X, f y = Y and. denotes the cardinality. For binary (/) vector representation a set (locations of nonzeros). a = X Y = x T y; f x = nonzeros(x); f y = nonzeros(y), where x and y are the binary vector equivalents of sets X and Y respectively. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27

7 Minwise Hashing (Broder 97) The standard practice in the search industry: Given a random permutation π (or a random hash function) over Ω, i.e., The MinHash is given by π : Ω Ω, where Ω = {,,..., D }. h π (x) = min(π(x)) An elementary probability argument shows that Pr (min(π(x )) = min(π(y ))) = X Y X Y = R. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27

8 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

9 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

10 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

11 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = For any two binary vectors S, S 2 we always have Pr (h π (S ) = h π (S 2 )) = S S 2 S S 2 = R (Jaccard Similarity.). 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

12 Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

13 Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity Well Known: Existence of LSH for a similarity = fast search algorithms with query time O(n ρ ), ρ < (Indyk & Motwani 98) The quantity ρ: A property dependent f. Smaller is better. 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

14 Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

15 Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Resemblance with descriptions: X Q = 2, X Q = 9, R = 2 9 =.22 2 X Q =, X Q = 4, R = 4 =.25 Resemblance penalizes the size of the document. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

16 Alternatives: Set containment and Inner Product For many applications (e.g. record matching, plagiarism detection etc.) Jaccard Containment more suited than Resemblance. Jaccard Containment w.r.t. Q between X and Q J C = X Q Q = a f q. () Some Observations Does not penalize the size of text. 2 Ordering same as the ordering of inner products a (or overlap). 3 Desirable ordering in the previous example. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

17 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

18 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

19 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Hopeless to find Locality Sensitive Hashing! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

20 Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

21 Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Asymmetric LSH Framework : Idea Construct two transformations P(.) and Q(.) (P Q) along with a randomized hash functions h. 2 P(.), Q(.) and h satisfies Pr h [h(p(x)) = h(q(q))] = f (x T q), f is monotonic Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

22 Small things that made BIG difference Shrivastava and Li NIPS 24 construction (L2-ALSH) P(x): Scale data to shrink norms <.83. Append x 2, x 4, and x 8 to vector x. (just 3 scalars) 2 Q(q): Normalize. Append three.5 to vector q. 3 h: Use standard LSH family for L 2 distance. Caution: Scaling is asymmetry in strict sense, it changes the distribution (e.g. variance) of hashes. First Practical and Provable Algorithm for General MIPS : Precision (%) Movielens L2LSH SRP Top, K = 256 Precision (%) NetFlix L2LSH SRP Top, K = Recall (%) Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27

23 A Generic Recipe : Even better ALSH for MIPS The Recipe: Start with a similarity S (q, x) for which we have an LSH (or ALSH). Design P(.) and Q(.), such that S (Q(q), P(x)) is monotonic in q T x Use extra dimensions. Improved ALSH (Sign-ALSH) Construction for General MIPS S (q, x) = qt x q 2 x 2 and Simhash 4. (Shrivastava and Li UAI 25) Sign-ALSH L2-ALSH Cone Trees MNIST 7,944 9,97,22 WEBSPAM 2,866 3,83 22,467 RCV 9,95,883 38,62 4 Expected after Shrivastava and Li ICML 24 Codings for Random Projections Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

24 Binary MIPS: A Sampling based ALSH Idea: Sample index i, if x i = and q i =, make hash collision, else not. if x i =, i drawn uniformly H S (f (x)) = if f = Q (for query) 2 if f = P (while preprocessing) Problems: a D Pr(H S (P(x)) = H S (Q(y))) = a D, is monotonic in inner product a. Only informative if x i =, else hash just indicates query or not. 2 Sparse data, with D f, a un-informative. D, almost all hashes are Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27

25 A Closer Look at MinHash Collision Probability: Pr(h π (x) = h π (q)) = a f x + f q a a D a Useful: f x +f very sensitive w.r.t a compared to a q a D. (D f ) The core reason why MinHash is better than random sampling. a f x +f q a Problem: is not monotonic in a (inner product). Not LSH for binary inner product. (Though a good heuristic!) Why we are biased in favor of MinHash? : SL In defense of MinHash over Simhash AISTATS 24 = For binary data MinHash is provably superior than SimHash. Already some hope to beat state-of-the-art Sign-ALSH for Binary Data Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27

26 The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], M zeros Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

27 The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], After Transformation: M zeros R = P(x) Q(q) P(x) Q(q) = a M + f q a, monotonic in the inner product a Also, M + f q a D (M of order of sparsity, handle outliers separately.) Note : To get rid of f q change P(.) to P(Q(.)) and Q(.) to Q(P(.)). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

28 Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

29 Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Alternative View: M f x favors larger documents in proportion to f x, which cancels the inherent bias of minhash towards smaller set. A novel bias correction, which works well in practice. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

30 Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos cs M ρ Theoretical ρ ρ MH ALSH ρ sign ρ ρ MH ALSH ρ sign Theoretical ρ.8.6 c c.4.2 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

31 Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos ρ Theoretical ρ c.4.5 ρ MH ALSH ρ sign.2 ρ ρ MH ALSH c.4 cs M Theoretical ρ Asymmetric Minwise Hashing is significantly better than Sign-ALSH (SL UAI 25) (Expected after SL AISTATS 24) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27 ρ sign.2

32 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

33 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

34 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

35 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Storing only the minimum seems quite wasteful. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

36 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

37 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

38 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

39 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

40 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

41 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

42 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

43 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

44 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C +2C Pr (H j (S ) = H j (S 2 )) = R for i = {,, 2..., k} O(d + k) instead of traditional O(dk)! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

45 Speedup 8 WEBSPAM Ratio Number of Hash Evaluations Figure: Ratio of old and new hashing time indicates a linear time speedup Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

46 Datasets and Baselines Table: Datasets Dataset # Query # Train # Dim nonzeros (mean ± std) EP26 2, 7,395 4,272, ± 328 MNIST 2, 68, ± 4 NEWS2 2, 8,,355,9 454 ± 654 NYTIMES 2,, 2, ± 4 Competing Schemes Asymmetric minwise hashing () 2 Traditional minwise hashing (MinHash) 3 L2 based Asymmetric LSH for Inner products (L2-ALSH) 4 SimHash based Asymmetric LSH for Inner Products (Sign-ALSH) Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

47 Actual Savings in Retrieval Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 EP Recall Recall MinHash L2 ALSH Sign ALSH MinHash L2 ALSH Sign ALSH Top EP26 Top 5 EP Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 MNIST Recall MinHash L2 ALSH Sign ALSH Top MNIST Recall MinHash L2 ALSH Sign ALSH Top 5. MNIST Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 NEWS Recall MinHash L2 ALSH Sign ALSH Top NEWS Recall MinHash L2 ALSH Sign ALSH Top 5 NEWS Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes Recall MinHash L2 ALSH Sign ALSH NYTimes.2 Top Recall MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes Recall Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

48 Ranking Verification Precision (%) 3 2 EP26 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) 3 2 MNIST MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) NEWS2 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) NYTimes MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) Precision (%) Recall (%) EP26 MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) EP26 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) MNIST MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) MNIST MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) 3 2 NEWS2 MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) NEWS2 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) 3 2 NYTimes MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) NYTimes MinHash L2 ALSH Sign ALSH Top 28 Hashes Recall (%) Recall (%) Recall (%) Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

49 Conclusions Minwise hashing has inherent bias towards smaller sets. Using the recent line of work on asymmetric LSH, we can fix the existing bias using asymmetric transformations. Asymmetric minwise hashing leads to an algorithmic improvement over state-of-the-art hashing scheme for binary MIPS. We can obtain huge speedups using recent line of work on one permutation hashing. The final algorithm performs very well in practice compared to popular schemes. Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

50 References Asymmetric LSH framework and improvements. Shrivastava & Li Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). NIPS 24 Shrivastava & Li Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). UAI 25 Efficient replacements of minwise hashing. Li et. al. One Permutation Hashing NIPS 22 Shrivastava & Li Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. ICML 24 Shrivastava & Li Improved Densification of One Permutation Hashing. UAI 24 Minwise hashing is superior to SimHash for binary data. Shrivastava & Li In Defense of MinHash over SimHash AISTATS 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA

PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of

More information

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Probabilistic Hashing for Efficient Search and Learning

Probabilistic Hashing for Efficient Search and Learning Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing

More information

On Symmetric and Asymmetric LSHs for Inner Product Search

On Symmetric and Asymmetric LSHs for Inner Product Search Behnam Neyshabur Nathan Srebro Toyota Technological Institute at Chicago, Chicago, IL 6637, USA BNEYSHABUR@TTIC.EDU NATI@TTIC.EDU Abstract We consider the problem of designing locality sensitive hashes

More information

arxiv: v2 [stat.ml] 13 Nov 2014

arxiv: v2 [stat.ml] 13 Nov 2014 Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) arxiv:4.4v [stat.ml] 3 Nov 4 Anshumali Shrivastava Department of Computer Science Computing and Information

More information

arxiv: v1 [stat.me] 18 Jun 2014

arxiv: v1 [stat.me] 18 Jun 2014 roved Densification of One Permutation Hashing arxiv:406.4784v [stat.me] 8 Jun 204 Anshumali Shrivastava Department of Computer Science Computing and Information Science Cornell University Ithaca, NY 4853,

More information

Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li

Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS 2013 1 Fast Approximate k-way Similarity Search Anshumali Shrivastava Dept. of Computer Science Cornell University

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

One Permutation Hashing for Efficient Search and Learning

One Permutation Hashing for Efficient Search and Learning One Permutation Hashing for Efficient Search and Learning Ping Li Dept. of Statistical Science ornell University Ithaca, NY 43 pingli@cornell.edu Art Owen Dept. of Statistics Stanford University Stanford,

More information

arxiv: v1 [cs.it] 16 Aug 2017

arxiv: v1 [cs.it] 16 Aug 2017 Efficient Compression Technique for Sparse Sets Rameshwar Pratap rameshwar.pratap@gmail.com Ishan Sohony PICT, Pune ishangalbatorix@gmail.com Raghav Kulkarni Chennai Mathematical Institute kulraghav@gmail.com

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Nearest Neighbor Classification Using Bottom-k Sketches

Nearest Neighbor Classification Using Bottom-k Sketches Nearest Neighbor Classification Using Bottom- Setches Søren Dahlgaard, Christian Igel, Miel Thorup Department of Computer Science, University of Copenhagen AT&T Labs Abstract Bottom- setches are an alternative

More information

4 Locality-sensitive hashing using stable distributions

4 Locality-sensitive hashing using stable distributions 4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

LSH-SAMPLING BREAKS THE COMPUTA- TIONAL CHICKEN-AND-EGG LOOP IN ADAP- TIVE STOCHASTIC GRADIENT ESTIMATION

LSH-SAMPLING BREAKS THE COMPUTA- TIONAL CHICKEN-AND-EGG LOOP IN ADAP- TIVE STOCHASTIC GRADIENT ESTIMATION LSH-SAMPLING BREAKS THE COMPUTA- TIONAL CHICKEN-AND-EGG LOOP IN ADAP- TIVE STOCHASTIC GRADIENT ESTIMATION Beidi Chen, Yingchen Xu & Anshumali Shrivastava Department of Computer Science Rice University

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Similarity Search in High Dimensions II. Piotr Indyk MIT

Similarity Search in High Dimensions II. Piotr Indyk MIT Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance

More information

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

Supplemental Material for Discrete Graph Hashing

Supplemental Material for Discrete Graph Hashing Supplemental Material for Discrete Graph Hashing Wei Liu Cun Mu Sanjiv Kumar Shih-Fu Chang IM T. J. Watson Research Center Columbia University Google Research weiliu@us.ibm.com cm52@columbia.edu sfchang@ee.columbia.edu

More information

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Improved Consistent Sampling, Weighted Minhash and L1 Sketching Improved Consistent Sampling, Weighted Minhash and L1 Sketching Sergey Ioffe Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA 94043, sioffe@google.com Abstract We propose a new Consistent Weighted

More information

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Alexandr Andoni Piotr Indy April 11, 2009 Abstract We provide novel methods for efficient dimensionality reduction in ernel spaces.

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

Lecture 17 03/21, 2017

Lecture 17 03/21, 2017 CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Stochastic Generative Hashing

Stochastic Generative Hashing Stochastic Generative Hashing B. Dai 1, R. Guo 2, S. Kumar 2, N. He 3 and L. Song 1 1 Georgia Institute of Technology, 2 Google Research, NYC, 3 University of Illinois at Urbana-Champaign Discussion by

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Minwise hashing for large-scale regression and classification with sparse data

Minwise hashing for large-scale regression and classification with sparse data Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons

More information

Linear Spectral Hashing

Linear Spectral Hashing Linear Spectral Hashing Zalán Bodó and Lehel Csató Babeş Bolyai University - Faculty of Mathematics and Computer Science Kogălniceanu 1., 484 Cluj-Napoca - Romania Abstract. assigns binary hash keys to

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

arxiv: v1 [stat.ml] 23 Nov 2017

arxiv: v1 [stat.ml] 23 Nov 2017 Practical Hash Functions for Similarity Estimation and Dimensionality Reduction Søren Dahlgaard 1,2, Mathias Bæ Tejs Knudsen 1,2, and Miel Thorup 1 1 University of Copenhagen mthorup@di.u.d 2 SupWiz [s.dahlgaard,m.nudsen@supwiz.com

More information

Set Similarity Search Beyond MinHash

Set Similarity Search Beyond MinHash Set Similarity Search Beyond MinHash ABSTRACT Tobias Christiani IT University of Copenhagen Copenhagen, Denmark tobc@itu.dk We consider the problem of approximate set similarity search under Braun-Blanquet

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Super-Bit Locality-Sensitive Hashing

Super-Bit Locality-Sensitive Hashing Super-Bit Locality-Sensitive Hashing Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

Practical and Optimal LSH for Angular Distance

Practical and Optimal LSH for Angular Distance Practical and Optimal LSH for Angular Distance Alexandr Andoni Columbia University Piotr Indyk MIT Thijs Laarhoven TU Eindhoven Ilya Razenshteyn MIT Ludwig Schmidt MIT Abstract We show the existence of

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

A New Space for Comparing Graphs

A New Space for Comparing Graphs A New Space for Comparing Graphs Anshumali Shrivastava and Ping Li Cornell University and Rutgers University August 18th 2014 Anshumali Shrivastava and Ping Li ASONAM 2014 August 18th 2014 1 / 38 Main

More information

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

arxiv: v2 [cs.ds] 17 Apr 2018

arxiv: v2 [cs.ds] 17 Apr 2018 Distance-Sensitive Hashing arxiv:170307867v2 [csds] 17 Apr 2018 ABSTRACT Martin Aumüller BARC and IT University of Copenhagen maau@itudk Rasmus Pagh BARC and IT University of Copenhagen pagh@itudk Locality-sensitive

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Decoupled Collaborative Ranking

Decoupled Collaborative Ranking Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides

More information

The Power of Asymmetry in Binary Hashing

The Power of Asymmetry in Binary Hashing The Power of Asymmetry in Binary Hashing Behnam Neyshabur Yury Makarychev Toyota Technological Institute at Chicago Russ Salakhutdinov University of Toronto Nati Srebro Technion/TTIC Search by Image Image

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Large-scale Collaborative Ranking in Near-Linear Time

Large-scale Collaborative Ranking in Near-Linear Time Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee

More information

There are two things that are particularly nice about the first basis

There are two things that are particularly nice about the first basis Orthogonality and the Gram-Schmidt Process In Chapter 4, we spent a great deal of time studying the problem of finding a basis for a vector space We know that a basis for a vector space can potentially

More information

Lecture 3: Pivoted Document Length Normalization

Lecture 3: Pivoted Document Length Normalization CS 6740: Advanced Language Technologies February 4, 2010 Lecture 3: Pivoted Document Length Normalization Lecturer: Lillian Lee Scribes: Lakshmi Ganesh, Navin Sivakumar Abstract In this lecture, we examine

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Parameter Norm Penalties. Sargur N. Srihari

Parameter Norm Penalties. Sargur N. Srihari Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

Fast Hierarchical Clustering from the Baire Distance

Fast Hierarchical Clustering from the Baire Distance Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.

More information

Nystrom Method for Approximating the GMM Kernel

Nystrom Method for Approximating the GMM Kernel Nystrom Method for Approximating the GMM Kernel Ping Li Department of Statistics and Biostatistics Department of omputer Science Rutgers University Piscataway, NJ 0, USA pingli@stat.rutgers.edu Abstract

More information

Geometry of Similarity Search

Geometry of Similarity Search Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000

More information

Tabulation-Based Hashing

Tabulation-Based Hashing Mihai Pǎtraşcu Mikkel Thorup April 23, 2010 Applications of Hashing Hash tables: chaining x a t v f s r Applications of Hashing Hash tables: chaining linear probing m c a f t x y t Applications of Hashing

More information

Introduction to Hash Tables

Introduction to Hash Tables Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In

More information

More on Estimation. Maximum Likelihood Estimation.

More on Estimation. Maximum Likelihood Estimation. More on Estimation. In the previous chapter we looked at the properties of estimators and the criteria we could use to choose between types of estimators. Here we examine more closely some very popular

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) = CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is

More information

SQL-Rank: A Listwise Approach to Collaborative Ranking

SQL-Rank: A Listwise Approach to Collaborative Ranking SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

More information

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries

More information

SPATIAL INDEXING. Vaibhav Bajpai

SPATIAL INDEXING. Vaibhav Bajpai SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Exploiting Primal and Dual Sparsity for Extreme Classification 1

Exploiting Primal and Dual Sparsity for Extreme Classification 1 Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon

More information

Sparse Fourier Transform (lecture 1)

Sparse Fourier Transform (lecture 1) 1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information