Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Size: px

Start display at page:

Download "Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment"

Melvin McBride
6 years ago
Views:

1 Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join Rice Univ. as TT Asst. Prof. Fall 25 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

2 What are we solving? Minwise hashing is widely popular for search and retrieval. Major Complaint: Document length is unnecessarily penalized. We precisely fix this and provide a practical solution. Other consequence: Algorithmic improvement for binary maximum inner product search (MIPS). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

3 Outline Motivation Asymmetric LSH for General Inner Products Asymmetric Minwise Hashing Faster Sampling Experimental Results. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27

4 Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

5 Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Sparse Binary High Dimensional Data Everywhere Sets can be represented as binary vector indicating presence/absence. Vocabulary is typically huge in practice. Modern Big data systems use only binary data matrix. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

6 Resemblance (Jaccard) Similarity The popular resemblance (Jaccard) similarity between two sets (or binary vectors) X, Y Ω is defined as: R = X Y X Y = a f x + f y a, where a = X Y, f x = X, f y = Y and. denotes the cardinality. For binary (/) vector representation a set (locations of nonzeros). a = X Y = x T y; f x = nonzeros(x); f y = nonzeros(y), where x and y are the binary vector equivalents of sets X and Y respectively. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27

7 Minwise Hashing (Broder 97) The standard practice in the search industry: Given a random permutation π (or a random hash function) over Ω, i.e., The MinHash is given by π : Ω Ω, where Ω = {,,..., D }. h π (x) = min(π(x)) An elementary probability argument shows that Pr (min(π(x )) = min(π(y ))) = X Y X Y = R. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27

8 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

9 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

10 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

11 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = For any two binary vectors S, S 2 we always have Pr (h π (S ) = h π (S 2 )) = S S 2 S S 2 = R (Jaccard Similarity.). 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

12 Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

13 Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity Well Known: Existence of LSH for a similarity = fast search algorithms with query time O(n ρ ), ρ < (Indyk & Motwani 98) The quantity ρ: A property dependent f. Smaller is better. 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

14 Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

15 Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Resemblance with descriptions: X Q = 2, X Q = 9, R = 2 9 =.22 2 X Q =, X Q = 4, R = 4 =.25 Resemblance penalizes the size of the document. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

16 Alternatives: Set containment and Inner Product For many applications (e.g. record matching, plagiarism detection etc.) Jaccard Containment more suited than Resemblance. Jaccard Containment w.r.t. Q between X and Q J C = X Q Q = a f q. () Some Observations Does not penalize the size of text. 2 Ordering same as the ordering of inner products a (or overlap). 3 Desirable ordering in the previous example. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

17 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

18 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

19 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Hopeless to find Locality Sensitive Hashing! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

20 Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

21 Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Asymmetric LSH Framework : Idea Construct two transformations P(.) and Q(.) (P Q) along with a randomized hash functions h. 2 P(.), Q(.) and h satisfies Pr h [h(p(x)) = h(q(q))] = f (x T q), f is monotonic Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

22 Small things that made BIG difference Shrivastava and Li NIPS 24 construction (L2-ALSH) P(x): Scale data to shrink norms <.83. Append x 2, x 4, and x 8 to vector x. (just 3 scalars) 2 Q(q): Normalize. Append three.5 to vector q. 3 h: Use standard LSH family for L 2 distance. Caution: Scaling is asymmetry in strict sense, it changes the distribution (e.g. variance) of hashes. First Practical and Provable Algorithm for General MIPS : Precision (%) Movielens L2LSH SRP Top, K = 256 Precision (%) NetFlix L2LSH SRP Top, K = Recall (%) Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27

23 A Generic Recipe : Even better ALSH for MIPS The Recipe: Start with a similarity S (q, x) for which we have an LSH (or ALSH). Design P(.) and Q(.), such that S (Q(q), P(x)) is monotonic in q T x Use extra dimensions. Improved ALSH (Sign-ALSH) Construction for General MIPS S (q, x) = qt x q 2 x 2 and Simhash 4. (Shrivastava and Li UAI 25) Sign-ALSH L2-ALSH Cone Trees MNIST 7,944 9,97,22 WEBSPAM 2,866 3,83 22,467 RCV 9,95,883 38,62 4 Expected after Shrivastava and Li ICML 24 Codings for Random Projections Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

24 Binary MIPS: A Sampling based ALSH Idea: Sample index i, if x i = and q i =, make hash collision, else not. if x i =, i drawn uniformly H S (f (x)) = if f = Q (for query) 2 if f = P (while preprocessing) Problems: a D Pr(H S (P(x)) = H S (Q(y))) = a D, is monotonic in inner product a. Only informative if x i =, else hash just indicates query or not. 2 Sparse data, with D f, a un-informative. D, almost all hashes are Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27

25 A Closer Look at MinHash Collision Probability: Pr(h π (x) = h π (q)) = a f x + f q a a D a Useful: f x +f very sensitive w.r.t a compared to a q a D. (D f ) The core reason why MinHash is better than random sampling. a f x +f q a Problem: is not monotonic in a (inner product). Not LSH for binary inner product. (Though a good heuristic!) Why we are biased in favor of MinHash? : SL In defense of MinHash over Simhash AISTATS 24 = For binary data MinHash is provably superior than SimHash. Already some hope to beat state-of-the-art Sign-ALSH for Binary Data Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27

26 The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], M zeros Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

27 The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], After Transformation: M zeros R = P(x) Q(q) P(x) Q(q) = a M + f q a, monotonic in the inner product a Also, M + f q a D (M of order of sparsity, handle outliers separately.) Note : To get rid of f q change P(.) to P(Q(.)) and Q(.) to Q(P(.)). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

28 Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

29 Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Alternative View: M f x favors larger documents in proportion to f x, which cancels the inherent bias of minhash towards smaller set. A novel bias correction, which works well in practice. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

30 Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos cs M ρ Theoretical ρ ρ MH ALSH ρ sign ρ ρ MH ALSH ρ sign Theoretical ρ.8.6 c c.4.2 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

31 Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos ρ Theoretical ρ c.4.5 ρ MH ALSH ρ sign.2 ρ ρ MH ALSH c.4 cs M Theoretical ρ Asymmetric Minwise Hashing is significantly better than Sign-ALSH (SL UAI 25) (Expected after SL AISTATS 24) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27 ρ sign.2

32 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

33 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

34 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

35 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Storing only the minimum seems quite wasteful. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

36 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

37 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

38 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

39 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

40 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

41 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

42 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

43 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

44 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C +2C Pr (H j (S ) = H j (S 2 )) = R for i = {,, 2..., k} O(d + k) instead of traditional O(dk)! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

45 Speedup 8 WEBSPAM Ratio Number of Hash Evaluations Figure: Ratio of old and new hashing time indicates a linear time speedup Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

46 Datasets and Baselines Table: Datasets Dataset # Query # Train # Dim nonzeros (mean ± std) EP26 2, 7,395 4,272, ± 328 MNIST 2, 68, ± 4 NEWS2 2, 8,,355,9 454 ± 654 NYTIMES 2,, 2, ± 4 Competing Schemes Asymmetric minwise hashing () 2 Traditional minwise hashing (MinHash) 3 L2 based Asymmetric LSH for Inner products (L2-ALSH) 4 SimHash based Asymmetric LSH for Inner Products (Sign-ALSH) Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

47 Actual Savings in Retrieval Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 EP Recall Recall MinHash L2 ALSH Sign ALSH MinHash L2 ALSH Sign ALSH Top EP26 Top 5 EP Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 MNIST Recall MinHash L2 ALSH Sign ALSH Top MNIST Recall MinHash L2 ALSH Sign ALSH Top 5. MNIST Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 NEWS Recall MinHash L2 ALSH Sign ALSH Top NEWS Recall MinHash L2 ALSH Sign ALSH Top 5 NEWS Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes Recall MinHash L2 ALSH Sign ALSH NYTimes.2 Top Recall MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes Recall Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

48 Ranking Verification Precision (%) 3 2 EP26 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) 3 2 MNIST MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) NEWS2 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) NYTimes MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) Precision (%) Recall (%) EP26 MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) EP26 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) MNIST MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) MNIST MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) 3 2 NEWS2 MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) NEWS2 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) 3 2 NYTimes MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) NYTimes MinHash L2 ALSH Sign ALSH Top 28 Hashes Recall (%) Recall (%) Recall (%) Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

49 Conclusions Minwise hashing has inherent bias towards smaller sets. Using the recent line of work on asymmetric LSH, we can fix the existing bias using asymmetric transformations. Asymmetric minwise hashing leads to an algorithmic improvement over state-of-the-art hashing scheme for binary MIPS. We can obtain huge speedups using recent line of work on one permutation hashing. The final algorithm performs very well in practice compared to popular schemes. Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

50 References Asymmetric LSH framework and improvements. Shrivastava & Li Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). NIPS 24 Shrivastava & Li Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). UAI 25 Efficient replacements of minwise hashing. Li et. al. One Permutation Hashing NIPS 22 Shrivastava & Li Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. ICML 24 Shrivastava & Li Improved Densification of One Permutation Hashing. UAI 24 Minwise hashing is superior to SimHash for binary data. Shrivastava & Li In Defense of MinHash over SimHash AISTATS 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27

PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA

PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of