Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Similar documents
PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA

Efficient Data Reduction and Summarization

Probabilistic Hashing for Efficient Search and Learning

On Symmetric and Asymmetric LSHs for Inner Product Search

arxiv: v2 [stat.ml] 13 Nov 2014

arxiv: v1 [stat.me] 18 Jun 2014

Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Algorithms for Data Science: Lecture on Finding Similar Items

Bloom Filters and Locality-Sensitive Hashing

One Permutation Hashing for Efficient Search and Learning

arxiv: v1 [cs.it] 16 Aug 2017

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Nearest Neighbor Classification Using Bottom-k Sketches

4 Locality-sensitive hashing using stable distributions

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Hash-based Indexing: Application, Impact, and Realization Alternatives

LSH-SAMPLING BREAKS THE COMPUTA- TIONAL CHICKEN-AND-EGG LOOP IN ADAP- TIVE STOCHASTIC GRADIENT ESTIMATION

Randomized Algorithms

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

Similarity Search. Stony Brook University CSE545, Fall 2016

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

Algorithms for Data Science

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Ad Placement Strategies

Similarity Search in High Dimensions II. Piotr Indyk MIT

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

COMPSCI 514: Algorithms for Data Science

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Supplemental Material for Discrete Graph Hashing

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Lecture 17 03/21, 2017

ECS289: Scalable Machine Learning

Stochastic Generative Hashing

Analysis of Algorithms I: Perfect Hashing

Lecture 14 October 16, 2014

Lecture 3 Sept. 4, 2014

Bloom Filters, Minhashes, and Other Random Stuff

Composite Quantization for Approximate Nearest Neighbor Search

Minwise hashing for large-scale regression and classification with sparse data

Linear Spectral Hashing

Distribution-specific analysis of nearest neighbor search and classification

arxiv: v1 [stat.ml] 23 Nov 2017

Set Similarity Search Beyond MinHash

CS246 Final Exam, Winter 2011

Introduction to Information Retrieval

B490 Mining the Big Data

Super-Bit Locality-Sensitive Hashing

Text Analytics (Text Mining)

1 Finding Similar Items

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Practical and Optimal LSH for Angular Distance

Algorithms for Nearest Neighbors

A New Space for Comparing Graphs

Locality Sensitive Hashing

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Lecture: Analysis of Algorithms (CS )

arxiv: v2 [cs.ds] 17 Apr 2018

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Decoupled Collaborative Ranking

The Power of Asymmetry in Binary Hashing

Boolean and Vector Space Retrieval Models

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Large-scale Collaborative Ranking in Near-Linear Time

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

There are two things that are particularly nice about the first basis

Lecture 3: Pivoted Document Length Normalization

Ranked Retrieval (2)

Sparse and Robust Optimization and Applications

Parameter Norm Penalties. Sargur N. Srihari

Finding similar items

Fast Hierarchical Clustering from the Baire Distance

Nystrom Method for Approximating the GMM Kernel

Geometry of Similarity Search

Tabulation-Based Hashing

Introduction to Hash Tables

More on Estimation. Maximum Likelihood Estimation.

Recommendation Systems

Optimization for Machine Learning

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

SQL-Rank: A Listwise Approach to Collaborative Ranking

Introduction to Randomized Algorithms III

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Text Analytics (Text Mining)

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

SPATIAL INDEXING. Vaibhav Bajpai

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Exploiting Primal and Dual Sparsity for Extreme Classification 1

Sparse Fourier Transform (lecture 1)

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Transcription:

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join Rice Univ. as TT Asst. Prof. Fall 25 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

What are we solving? Minwise hashing is widely popular for search and retrieval. Major Complaint: Document length is unnecessarily penalized. We precisely fix this and provide a practical solution. Other consequence: Algorithmic improvement for binary maximum inner product search (MIPS). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Outline Motivation Asymmetric LSH for General Inner Products Asymmetric Minwise Hashing Faster Sampling Experimental Results. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27

Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Sparse Binary High Dimensional Data Everywhere Sets can be represented as binary vector indicating presence/absence. Vocabulary is typically huge in practice. Modern Big data systems use only binary data matrix. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

Resemblance (Jaccard) Similarity The popular resemblance (Jaccard) similarity between two sets (or binary vectors) X, Y Ω is defined as: R = X Y X Y = a f x + f y a, where a = X Y, f x = X, f y = Y and. denotes the cardinality. For binary (/) vector representation a set (locations of nonzeros). a = X Y = x T y; f x = nonzeros(x); f y = nonzeros(y), where x and y are the binary vector equivalents of sets X and Y respectively. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27

Minwise Hashing (Broder 97) The standard practice in the search industry: Given a random permutation π (or a random hash function) over Ω, i.e., The MinHash is given by π : Ω Ω, where Ω = {,,..., D }. h π (x) = min(π(x)) An elementary probability argument shows that Pr (min(π(x )) = min(π(y ))) = X Y X Y = R. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27

Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero. 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero. 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 3 4 5 6 7 8 9 2 3 4 5 π(s ): π(s 2 ): π(s 3 ): 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero. 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 3 4 5 6 7 8 9 2 3 4 5 π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero. 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 3 4 5 6 7 8 9 2 3 4 5 π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = For any two binary vectors S, S 2 we always have Pr (h π (S ) = h π (S 2 )) = S S 2 S S 2 = R (Jaccard Similarity.). 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity Well Known: Existence of LSH for a similarity = fast search algorithms with query time O(n ρ ), ρ < (Indyk & Motwani 98) The quantity ρ: A property dependent f. Smaller is better. 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Resemblance with descriptions: X Q = 2, X Q = 9, R = 2 9 =.22 2 X Q =, X Q = 4, R = 4 =.25 Resemblance penalizes the size of the document. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

Alternatives: Set containment and Inner Product For many applications (e.g. record matching, plagiarism detection etc.) Jaccard Containment more suited than Resemblance. Jaccard Containment w.r.t. Q between X and Q J C = X Q Q = a f q. () Some Observations Does not penalize the size of text. 2 Ordering same as the ordering of inner products a (or overlap). 3 Desirable ordering in the previous example. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Hopeless to find Locality Sensitive Hashing! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27

Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Asymmetric LSH Framework : Idea Construct two transformations P(.) and Q(.) (P Q) along with a randomized hash functions h. 2 P(.), Q(.) and h satisfies Pr h [h(p(x)) = h(q(q))] = f (x T q), f is monotonic Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Small things that made BIG difference Shrivastava and Li NIPS 24 construction (L2-ALSH) P(x): Scale data to shrink norms <.83. Append x 2, x 4, and x 8 to vector x. (just 3 scalars) 2 Q(q): Normalize. Append three.5 to vector q. 3 h: Use standard LSH family for L 2 distance. Caution: Scaling is asymmetry in strict sense, it changes the distribution (e.g. variance) of hashes. First Practical and Provable Algorithm for General MIPS : Precision (%) 6 4 2 Movielens L2LSH SRP Top, K = 256 Precision (%) 5 4 3 2 NetFlix L2LSH SRP Top, K = 256 2 4 6 8 Recall (%) 2 4 6 8 Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27

A Generic Recipe : Even better ALSH for MIPS The Recipe: Start with a similarity S (q, x) for which we have an LSH (or ALSH). Design P(.) and Q(.), such that S (Q(q), P(x)) is monotonic in q T x Use extra dimensions. Improved ALSH (Sign-ALSH) Construction for General MIPS S (q, x) = qt x q 2 x 2 and Simhash 4. (Shrivastava and Li UAI 25) Sign-ALSH L2-ALSH Cone Trees MNIST 7,944 9,97,22 WEBSPAM 2,866 3,83 22,467 RCV 9,95,883 38,62 4 Expected after Shrivastava and Li ICML 24 Codings for Random Projections Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27

Binary MIPS: A Sampling based ALSH Idea: Sample index i, if x i = and q i =, make hash collision, else not. if x i =, i drawn uniformly H S (f (x)) = if f = Q (for query) 2 if f = P (while preprocessing) Problems: a D Pr(H S (P(x)) = H S (Q(y))) = a D, is monotonic in inner product a. Only informative if x i =, else hash just indicates query or not. 2 Sparse data, with D f, a un-informative. D, almost all hashes are Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27

A Closer Look at MinHash Collision Probability: Pr(h π (x) = h π (q)) = a f x + f q a a D a Useful: f x +f very sensitive w.r.t a compared to a q a D. (D f ) The core reason why MinHash is better than random sampling. a f x +f q a Problem: is not monotonic in a (inner product). Not LSH for binary inner product. (Though a good heuristic!) Why we are biased in favor of MinHash? : SL In defense of MinHash over Simhash AISTATS 24 = For binary data MinHash is provably superior than SimHash. Already some hope to beat state-of-the-art Sign-ALSH for Binary Data Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27

The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], M zeros Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], After Transformation: M zeros R = P(x) Q(q) P(x) Q(q) = a M + f q a, monotonic in the inner product a Also, M + f q a D (M of order of sparsity, handle outliers separately.) Note : To get rid of f q change P(.) to P(Q(.)) and Q(.) to Q(P(.)). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27

Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Alternative View: M f x favors larger documents in proportion to f x, which cancels the inherent bias of minhash towards smaller set. A novel bias correction, which works well in practice. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27

Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos cs M ρ.9.8.7.6.5.4.3.2. Theoretical ρ.5.9.95.9.95.5 ρ MH ALSH ρ sign ρ.9.8.7.6.5.4.3.2. ρ MH ALSH ρ sign.4..2.4.3.3.2. Theoretical ρ.8.6 c.4.2.8.6 c.4.2 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27

Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos ρ.9.8.7.6.5.4.3.2..8 Theoretical ρ.9.95.95.6.5.9 c.4.5 ρ MH ALSH ρ sign.2 ρ.9.8.7.6.5.4.3.2. ρ MH ALSH.8.6.4 c.4 cs M..2.4.3.3.2. Theoretical ρ Asymmetric Minwise Hashing is significantly better than Sign-ALSH (SL UAI 25) (Expected after SL AISTATS 24) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27 ρ sign.2

Complaints with MinHash: Costly Sampling 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Complaints with MinHash: Costly Sampling 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 3 4 5 6 7 8 9 2 3 4 5 π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Complaints with MinHash: Costly Sampling 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 3 4 5 6 7 8 9 2 3 4 5 π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Complaints with MinHash: Costly Sampling 2 3 4 5 6 7 8 9 2 3 4 5 S : S 2 : S 3 : 2 3 4 5 6 7 8 9 2 3 4 5 π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Storing only the minimum seems quite wasteful. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C +2C Pr (H j (S ) = H j (S 2 )) = R for i = {,, 2..., k} O(d + k) instead of traditional O(dk)! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27

Speedup 8 WEBSPAM Ratio 6 4 2 2 4 6 8 Number of Hash Evaluations Figure: Ratio of old and new hashing time indicates a linear time speedup Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 22 / 27

Datasets and Baselines Table: Datasets Dataset # Query # Train # Dim nonzeros (mean ± std) EP26 2, 7,395 4,272,227 672 ± 328 MNIST 2, 68, 784 5 ± 4 NEWS2 2, 8,,355,9 454 ± 654 NYTIMES 2,, 2,66 232 ± 4 Competing Schemes Asymmetric minwise hashing () 2 Traditional minwise hashing (MinHash) 3 L2 based Asymmetric LSH for Inner products (L2-ALSH) 4 SimHash based Asymmetric LSH for Inner Products (Sign-ALSH) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 23 / 27

Actual Savings in Retrieval Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan.8.6.4.2 MinHash L2 ALSH Sign ALSH Top 5 EP26.2.4.6.8 Recall.8.6.4.2.2.4.6.8 Recall.8.6.4.2 MinHash L2 ALSH Sign ALSH MinHash L2 ALSH Sign ALSH Top EP26 Top 5 EP26.2.4.6.8 Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan.5.4.3.2. MinHash L2 ALSH Sign ALSH Top 5 MNIST.2.4.6.8 Recall.4.3.2. MinHash L2 ALSH Sign ALSH Top MNIST.2.4.6.8 Recall.5.4.3.2 MinHash L2 ALSH Sign ALSH Top 5. MNIST.2.4.6.8 Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan.8.6.4.2 MinHash L2 ALSH Sign ALSH Top 5 NEWS2.2.4.6.8 Recall.8.6.4.2 MinHash L2 ALSH Sign ALSH Top NEWS2.2.4.6.8 Recall.8.6.4.2 MinHash L2 ALSH Sign ALSH Top 5 NEWS2.2.4.6.8 Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan.8.6.4 MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes.2.4.6.8 Recall.8.6.4 MinHash L2 ALSH Sign ALSH NYTimes.2 Top.2.4.6.8 Recall.8.6.4 MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes.2.4.6.8 Recall Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 24 / 27

Ranking Verification Precision (%) 3 2 EP26 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) 3 2 MNIST MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) 2 5 5 NEWS2 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) 8 6 4 2 NYTimes MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) Precision (%) 2 3 4 5 6 7 8 9 Recall (%) 4 3 2 EP26 MinHash L2 ALSH Sign ALSH Top 64 Hashes 2 3 4 5 6 7 8 9 Recall (%) 6 4 2 EP26 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) 2 3 4 5 6 7 8 9 Recall (%) 5 4 3 2 MNIST MinHash L2 ALSH Sign ALSH Top 64 Hashes 2 3 4 5 6 7 8 9 Recall (%) 8 6 4 2 MNIST MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) 2 3 4 5 6 7 8 9 Recall (%) 3 2 NEWS2 MinHash L2 ALSH Sign ALSH Top 64 Hashes 2 3 4 5 6 7 8 9 Recall (%) 4 3 2 NEWS2 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) 2 3 4 5 6 7 8 9 Recall (%) 3 2 NYTimes MinHash L2 ALSH Sign ALSH Top 64 Hashes 2 3 4 5 6 7 8 9 Recall (%) 4 3 2 NYTimes MinHash L2 ALSH Sign ALSH Top 28 Hashes 2 3 4 5 6 7 8 9 Recall (%) 2 3 4 5 6 7 8 9 Recall (%) 2 3 4 5 6 7 8 9 Recall (%) 2 3 4 5 6 7 8 9 Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 25 / 27

Conclusions Minwise hashing has inherent bias towards smaller sets. Using the recent line of work on asymmetric LSH, we can fix the existing bias using asymmetric transformations. Asymmetric minwise hashing leads to an algorithmic improvement over state-of-the-art hashing scheme for binary MIPS. We can obtain huge speedups using recent line of work on one permutation hashing. The final algorithm performs very well in practice compared to popular schemes. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 26 / 27

References Asymmetric LSH framework and improvements. Shrivastava & Li Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). NIPS 24 Shrivastava & Li Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). UAI 25 Efficient replacements of minwise hashing. Li et. al. One Permutation Hashing NIPS 22 Shrivastava & Li Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. ICML 24 Shrivastava & Li Improved Densification of One Permutation Hashing. UAI 24 Minwise hashing is superior to SimHash for binary data. Shrivastava & Li In Defense of MinHash over SimHash AISTATS 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 27 / 27