Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
|
|
- Melvin McBride
- 6 years ago
- Views:
Transcription
1 Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join Rice Univ. as TT Asst. Prof. Fall 25 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27
2 What are we solving? Minwise hashing is widely popular for search and retrieval. Major Complaint: Document length is unnecessarily penalized. We precisely fix this and provide a practical solution. Other consequence: Algorithmic improvement for binary maximum inner product search (MIPS). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
3 Outline Motivation Asymmetric LSH for General Inner Products Asymmetric Minwise Hashing Faster Sampling Experimental Results. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27
4 Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27
5 Shingle Based Representation Shingle based representation (Bag-of-Words) widely adopted. Document is represented as a set of tokens over a vocabulary Ω. Example Sentence : Five Kitchen Berkley. Shingle Representation (Uni-grams): {Five, Kitchen, Berkeley} Shingle Representation (Bi-grams): {Five Kitchen, Kitchen Berkeley} Sparse Binary High Dimensional Data Everywhere Sets can be represented as binary vector indicating presence/absence. Vocabulary is typically huge in practice. Modern Big data systems use only binary data matrix. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27
6 Resemblance (Jaccard) Similarity The popular resemblance (Jaccard) similarity between two sets (or binary vectors) X, Y Ω is defined as: R = X Y X Y = a f x + f y a, where a = X Y, f x = X, f y = Y and. denotes the cardinality. For binary (/) vector representation a set (locations of nonzeros). a = X Y = x T y; f x = nonzeros(x); f y = nonzeros(y), where x and y are the binary vector equivalents of sets X and Y respectively. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27
7 Minwise Hashing (Broder 97) The standard practice in the search industry: Given a random permutation π (or a random hash function) over Ω, i.e., The MinHash is given by π : Ω Ω, where Ω = {,,..., D }. h π (x) = min(π(x)) An elementary probability argument shows that Pr (min(π(x )) = min(π(y ))) = X Y X Y = R. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27
8 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27
9 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27
10 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27
11 Traditional Minwise Hashing Computation 2 Uniformly sample a permutation over attributes π : [, D] [, D]. 2 Shuffle the vectors under π. 3 The hash value is smallest index which is not zero S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = For any two binary vectors S, S 2 we always have Pr (h π (S ) = h π (S 2 )) = S S 2 S S 2 = R (Jaccard Similarity.). 2 This is very inefficient, we recently found faster ways ICML 24 and UAI 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27
12 Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27
13 Locality Sensitive Hashing (LSH) and Sub-linear Search Locality Sensitive : A family (randomized) of hash functions h s.t. where f is monotonically increasing 3. Pr h [ h(x) = h(y) ] = f (sim(x, y)), MinHash is LSH for Resemblance or Jaccard Similarity Well Known: Existence of LSH for a similarity = fast search algorithms with query time O(n ρ ), ρ < (Indyk & Motwani 98) The quantity ρ: A property dependent f. Smaller is better. 3 A stronger sufficient condition than the classical one Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27
14 Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27
15 Known Complaints with Resemblance R = X Y X Y = a f x + f y a, Consider text description of two restaurants: Five Guys Burgers and Fries Downtown Brooklyn New York {five, guys, burgers, and, fries, downtown, brooklyn, new, york} 2 Five Kitchen Berkley {five, kitchen, berkley} Search Query (Q): Five Guys {five, guys} Resemblance with descriptions: X Q = 2, X Q = 9, R = 2 9 =.22 2 X Q =, X Q = 4, R = 4 =.25 Resemblance penalizes the size of the document. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27
16 Alternatives: Set containment and Inner Product For many applications (e.g. record matching, plagiarism detection etc.) Jaccard Containment more suited than Resemblance. Jaccard Containment w.r.t. Q between X and Q J C = X Q Q = a f q. () Some Observations Does not penalize the size of text. 2 Ordering same as the ordering of inner products a (or overlap). 3 Desirable ordering in the previous example. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27
17 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27
18 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27
19 LSH Framework Not Sufficient for Inner Products Locality Sensitive Requirement: where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (x T y), Theorem (Shrivastava and Li NIPS 24): Impossible for dot products For inner products, we can have x and y, s.t. x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) =. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = For Binary Inner Products : Still Impossible x T y x T x is always true. We instead need x, y, z such that x T y > z T z Hopeless to find Locality Sensitive Hashing! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 / 27
20 Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
21 Asymmetric LSH (ALSH) for General Inner Products Shrivastava and Li (NIPS 24): Despite no LSH, Maximum Inner Product Search (MIPS) is still efficient via an extended framework Asymmetric LSH Framework : Idea Construct two transformations P(.) and Q(.) (P Q) along with a randomized hash functions h. 2 P(.), Q(.) and h satisfies Pr h [h(p(x)) = h(q(q))] = f (x T q), f is monotonic Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
22 Small things that made BIG difference Shrivastava and Li NIPS 24 construction (L2-ALSH) P(x): Scale data to shrink norms <.83. Append x 2, x 4, and x 8 to vector x. (just 3 scalars) 2 Q(q): Normalize. Append three.5 to vector q. 3 h: Use standard LSH family for L 2 distance. Caution: Scaling is asymmetry in strict sense, it changes the distribution (e.g. variance) of hashes. First Practical and Provable Algorithm for General MIPS : Precision (%) Movielens L2LSH SRP Top, K = 256 Precision (%) NetFlix L2LSH SRP Top, K = Recall (%) Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 3 / 27
23 A Generic Recipe : Even better ALSH for MIPS The Recipe: Start with a similarity S (q, x) for which we have an LSH (or ALSH). Design P(.) and Q(.), such that S (Q(q), P(x)) is monotonic in q T x Use extra dimensions. Improved ALSH (Sign-ALSH) Construction for General MIPS S (q, x) = qt x q 2 x 2 and Simhash 4. (Shrivastava and Li UAI 25) Sign-ALSH L2-ALSH Cone Trees MNIST 7,944 9,97,22 WEBSPAM 2,866 3,83 22,467 RCV 9,95,883 38,62 4 Expected after Shrivastava and Li ICML 24 Codings for Random Projections Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 4 / 27
24 Binary MIPS: A Sampling based ALSH Idea: Sample index i, if x i = and q i =, make hash collision, else not. if x i =, i drawn uniformly H S (f (x)) = if f = Q (for query) 2 if f = P (while preprocessing) Problems: a D Pr(H S (P(x)) = H S (Q(y))) = a D, is monotonic in inner product a. Only informative if x i =, else hash just indicates query or not. 2 Sparse data, with D f, a un-informative. D, almost all hashes are Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 5 / 27
25 A Closer Look at MinHash Collision Probability: Pr(h π (x) = h π (q)) = a f x + f q a a D a Useful: f x +f very sensitive w.r.t a compared to a q a D. (D f ) The core reason why MinHash is better than random sampling. a f x +f q a Problem: is not monotonic in a (inner product). Not LSH for binary inner product. (Though a good heuristic!) Why we are biased in favor of MinHash? : SL In defense of MinHash over Simhash AISTATS 24 = For binary data MinHash is provably superior than SimHash. Already some hope to beat state-of-the-art Sign-ALSH for Binary Data Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 6 / 27
26 The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], M zeros Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27
27 The Fix: Asymmetric Minwise Hashing Let M be the maximum sparsity of the data vectors. M = max x C x Define P : [, ] D [, ] D+M and Q : [, ] D [, ] D+M as: P(x) = [x; ; ; ;...; ; ; ;...; ] M f x s and f x zeros Q(q) = [x; ; ; ;...; ], After Transformation: M zeros R = P(x) Q(q) P(x) Q(q) = a M + f q a, monotonic in the inner product a Also, M + f q a D (M of order of sparsity, handle outliers separately.) Note : To get rid of f q change P(.) to P(Q(.)) and Q(.) to Q(P(.)). Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 7 / 27
28 Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27
29 Asymmetric Minwise Hashing: Alternative View P (x) = [x; M f x ; ] Q (x) = [x; ; M f x ] The weighted Jaccard between P (x) and Q (q) is R W = i min(p (x) i, Q (q) i ) i max(p (x) i, Q (q) i ) = a 2M a. Fast Consistent Weighted Sampling (CWS) to get asymmetric MinHash in O(f x ) time instead of O(2M) where M f x. Alternative View: M f x favors larger documents in proportion to f x, which cancels the inherent bias of minhash towards smaller set. A novel bias correction, which works well in practice. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 8 / 27
30 Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos cs M ρ Theoretical ρ ρ MH ALSH ρ sign ρ ρ MH ALSH ρ sign Theoretical ρ.8.6 c c.4.2 Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27
31 Theoretical Comparisons Collision probability monotonic in inner product = asymmetric minwise hashing is an ALSH for binary MIPS. ( )) ρ MH ALSH = log S /M log ( π cos S M 2 S /M log cs ; ρ Sign = ( )) /M 2 cs /M log ( π cos ρ Theoretical ρ c.4.5 ρ MH ALSH ρ sign.2 ρ ρ MH ALSH c.4 cs M Theoretical ρ Asymmetric Minwise Hashing is significantly better than Sign-ALSH (SL UAI 25) (Expected after SL AISTATS 24) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 9 / 27 ρ sign.2
32 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
33 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
34 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
35 Complaints with MinHash: Costly Sampling S : S 2 : S 3 : π(s ): π(s 2 ): π(s 3 ): h π (S ) = 2, h π (S 2 ) =, h π (S 3 ) = Process the entire vector to compute one minhash O(d). Search time is dominated by the hashing query. O(KLd) Training and Testing time dominated by the hashing time. O(kd) Parallelization possible but not energy efficient. (LSK WWW 22) Storing only the minimum seems quite wasteful. Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
36 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
37 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
38 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
39 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
40 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
41 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) E E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
42 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C E 2 H(S 2 ) E E E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
43 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C E Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
44 Solution: One Pass for All hashes. Sketching: Bin and compute minimum non-zero index in each bin. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 π(s ) π(s 2 ) OPH(S ) E E 2 OPH(S 2 ) E E E 2. Fill Empty Bins: Borrow from right (circular) with shift. Bin Bin Bin 2 Bin 3 Bin 4 Bin 5 H(S ) +C 2+C 2 H(S 2 ) +C +C +2C Pr (H j (S ) = H j (S 2 )) = R for i = {,, 2..., k} O(d + k) instead of traditional O(dk)! Anshumali Shrivastava and Ping Li WWW 25 May 2st 25 2 / 27
45 Speedup 8 WEBSPAM Ratio Number of Hash Evaluations Figure: Ratio of old and new hashing time indicates a linear time speedup Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27
46 Datasets and Baselines Table: Datasets Dataset # Query # Train # Dim nonzeros (mean ± std) EP26 2, 7,395 4,272, ± 328 MNIST 2, 68, ± 4 NEWS2 2, 8,,355,9 454 ± 654 NYTIMES 2,, 2, ± 4 Competing Schemes Asymmetric minwise hashing () 2 Traditional minwise hashing (MinHash) 3 L2 based Asymmetric LSH for Inner products (L2-ALSH) 4 SimHash based Asymmetric LSH for Inner Products (Sign-ALSH) Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27
47 Actual Savings in Retrieval Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 EP Recall Recall MinHash L2 ALSH Sign ALSH MinHash L2 ALSH Sign ALSH Top EP26 Top 5 EP Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 MNIST Recall MinHash L2 ALSH Sign ALSH Top MNIST Recall MinHash L2 ALSH Sign ALSH Top 5. MNIST Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5 NEWS Recall MinHash L2 ALSH Sign ALSH Top NEWS Recall MinHash L2 ALSH Sign ALSH Top 5 NEWS Recall Fraction of Linear Scan Fraction of Linear Scan Fraction of Linear Scan MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes Recall MinHash L2 ALSH Sign ALSH NYTimes.2 Top Recall MinHash L2 ALSH Sign ALSH Top 5.2 NYTimes Recall Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27
48 Ranking Verification Precision (%) 3 2 EP26 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) 3 2 MNIST MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) NEWS2 MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) NYTimes MinHash L2 ALSH Sign ALSH Top 32 Hashes Precision (%) Precision (%) Recall (%) EP26 MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) EP26 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) MNIST MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) MNIST MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) 3 2 NEWS2 MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) NEWS2 MinHash L2 ALSH Sign ALSH Top 28 Hashes Precision (%) Precision (%) Recall (%) 3 2 NYTimes MinHash L2 ALSH Sign ALSH Top 64 Hashes Recall (%) NYTimes MinHash L2 ALSH Sign ALSH Top 28 Hashes Recall (%) Recall (%) Recall (%) Recall (%) Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27
49 Conclusions Minwise hashing has inherent bias towards smaller sets. Using the recent line of work on asymmetric LSH, we can fix the existing bias using asymmetric transformations. Asymmetric minwise hashing leads to an algorithmic improvement over state-of-the-art hashing scheme for binary MIPS. We can obtain huge speedups using recent line of work on one permutation hashing. The final algorithm performs very well in practice compared to popular schemes. Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27
50 References Asymmetric LSH framework and improvements. Shrivastava & Li Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). NIPS 24 Shrivastava & Li Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). UAI 25 Efficient replacements of minwise hashing. Li et. al. One Permutation Hashing NIPS 22 Shrivastava & Li Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. ICML 24 Shrivastava & Li Improved Densification of One Permutation Hashing. UAI 24 Minwise hashing is superior to SimHash for binary data. Shrivastava & Li In Defense of MinHash over SimHash AISTATS 24 Anshumali Shrivastava and Ping Li WWW 25 May 2st / 27
PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA
PROBABILISTIC HASHING TECHNIQUES FOR BIG DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of
More informationEfficient Data Reduction and Summarization
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information
More informationProbabilistic Hashing for Efficient Search and Learning
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing
More informationOn Symmetric and Asymmetric LSHs for Inner Product Search
Behnam Neyshabur Nathan Srebro Toyota Technological Institute at Chicago, Chicago, IL 6637, USA BNEYSHABUR@TTIC.EDU NATI@TTIC.EDU Abstract We consider the problem of designing locality sensitive hashes
More informationarxiv: v2 [stat.ml] 13 Nov 2014
Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) arxiv:4.4v [stat.ml] 3 Nov 4 Anshumali Shrivastava Department of Computer Science Computing and Information
More informationarxiv: v1 [stat.me] 18 Jun 2014
roved Densification of One Permutation Hashing arxiv:406.4784v [stat.me] 8 Jun 204 Anshumali Shrivastava Department of Computer Science Computing and Information Science Cornell University Ithaca, NY 4853,
More informationFast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li
Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS 2013 1 Fast Approximate k-way Similarity Search Anshumali Shrivastava Dept. of Computer Science Cornell University
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationOne Permutation Hashing for Efficient Search and Learning
One Permutation Hashing for Efficient Search and Learning Ping Li Dept. of Statistical Science ornell University Ithaca, NY 43 pingli@cornell.edu Art Owen Dept. of Statistics Stanford University Stanford,
More informationarxiv: v1 [cs.it] 16 Aug 2017
Efficient Compression Technique for Sparse Sets Rameshwar Pratap rameshwar.pratap@gmail.com Ishan Sohony PICT, Pune ishangalbatorix@gmail.com Raghav Kulkarni Chennai Mathematical Institute kulraghav@gmail.com
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationNearest Neighbor Classification Using Bottom-k Sketches
Nearest Neighbor Classification Using Bottom- Setches Søren Dahlgaard, Christian Igel, Miel Thorup Department of Computer Science, University of Copenhagen AT&T Labs Abstract Bottom- setches are an alternative
More information4 Locality-sensitive hashing using stable distributions
4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationHash-based Indexing: Application, Impact, and Realization Alternatives
: Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider
More informationLSH-SAMPLING BREAKS THE COMPUTA- TIONAL CHICKEN-AND-EGG LOOP IN ADAP- TIVE STOCHASTIC GRADIENT ESTIMATION
LSH-SAMPLING BREAKS THE COMPUTA- TIONAL CHICKEN-AND-EGG LOOP IN ADAP- TIVE STOCHASTIC GRADIENT ESTIMATION Beidi Chen, Yingchen Xu & Anshumali Shrivastava Department of Computer Science Rice University
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationCS6931 Database Seminar. Lecture 6: Set Operations on Massive Data
CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationSimilarity Search in High Dimensions II. Piotr Indyk MIT
Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance
More informationBottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationSupplemental Material for Discrete Graph Hashing
Supplemental Material for Discrete Graph Hashing Wei Liu Cun Mu Sanjiv Kumar Shih-Fu Chang IM T. J. Watson Research Center Columbia University Google Research weiliu@us.ibm.com cm52@columbia.edu sfchang@ee.columbia.edu
More informationImproved Consistent Sampling, Weighted Minhash and L1 Sketching
Improved Consistent Sampling, Weighted Minhash and L1 Sketching Sergey Ioffe Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA 94043, sioffe@google.com Abstract We propose a new Consistent Weighted
More informationDimension Reduction in Kernel Spaces from Locality-Sensitive Hashing
Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Alexandr Andoni Piotr Indy April 11, 2009 Abstract We provide novel methods for efficient dimensionality reduction in ernel spaces.
More informationMetric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)
Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy
More informationLecture 17 03/21, 2017
CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationStochastic Generative Hashing
Stochastic Generative Hashing B. Dai 1, R. Guo 2, S. Kumar 2, N. He 3 and L. Song 1 1 Georgia Institute of Technology, 2 Google Research, NYC, 3 University of Illinois at Urbana-Champaign Discussion by
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationLecture 14 October 16, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationComposite Quantization for Approximate Nearest Neighbor Search
Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from
More informationMinwise hashing for large-scale regression and classification with sparse data
Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons
More informationLinear Spectral Hashing
Linear Spectral Hashing Zalán Bodó and Lehel Csató Babeş Bolyai University - Faculty of Mathematics and Computer Science Kogălniceanu 1., 484 Cluj-Napoca - Romania Abstract. assigns binary hash keys to
More informationDistribution-specific analysis of nearest neighbor search and classification
Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.
More informationarxiv: v1 [stat.ml] 23 Nov 2017
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction Søren Dahlgaard 1,2, Mathias Bæ Tejs Knudsen 1,2, and Miel Thorup 1 1 University of Copenhagen mthorup@di.u.d 2 SupWiz [s.dahlgaard,m.nudsen@supwiz.com
More informationSet Similarity Search Beyond MinHash
Set Similarity Search Beyond MinHash ABSTRACT Tobias Christiani IT University of Copenhagen Copenhagen, Denmark tobc@itu.dk We consider the problem of approximate set similarity search under Braun-Blanquet
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationSuper-Bit Locality-Sensitive Hashing
Super-Bit Locality-Sensitive Hashing Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationPractical and Optimal LSH for Angular Distance
Practical and Optimal LSH for Angular Distance Alexandr Andoni Columbia University Piotr Indyk MIT Thijs Laarhoven TU Eindhoven Ilya Razenshteyn MIT Ludwig Schmidt MIT Abstract We show the existence of
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline
More informationA New Space for Comparing Graphs
A New Space for Comparing Graphs Anshumali Shrivastava and Ping Li Cornell University and Rutgers University August 18th 2014 Anshumali Shrivastava and Ping Li ASONAM 2014 August 18th 2014 1 / 38 Main
More informationLocality Sensitive Hashing
Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of
More informationSome Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform
Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform
More informationDot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics
Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect
More informationarxiv: v2 [cs.ds] 17 Apr 2018
Distance-Sensitive Hashing arxiv:170307867v2 [csds] 17 Apr 2018 ABSTRACT Martin Aumüller BARC and IT University of Copenhagen maau@itudk Rasmus Pagh BARC and IT University of Copenhagen pagh@itudk Locality-sensitive
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationDecoupled Collaborative Ranking
Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides
More informationThe Power of Asymmetry in Binary Hashing
The Power of Asymmetry in Binary Hashing Behnam Neyshabur Yury Makarychev Toyota Technological Institute at Chicago Russ Salakhutdinov University of Toronto Nati Srebro Technion/TTIC Search by Image Image
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More informationThere are two things that are particularly nice about the first basis
Orthogonality and the Gram-Schmidt Process In Chapter 4, we spent a great deal of time studying the problem of finding a basis for a vector space We know that a basis for a vector space can potentially
More informationLecture 3: Pivoted Document Length Normalization
CS 6740: Advanced Language Technologies February 4, 2010 Lecture 3: Pivoted Document Length Normalization Lecturer: Lillian Lee Scribes: Lakshmi Ganesh, Navin Sivakumar Abstract In this lecture, we examine
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationSparse and Robust Optimization and Applications
Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability
More informationParameter Norm Penalties. Sargur N. Srihari
Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationFast Hierarchical Clustering from the Baire Distance
Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.
More informationNystrom Method for Approximating the GMM Kernel
Nystrom Method for Approximating the GMM Kernel Ping Li Department of Statistics and Biostatistics Department of omputer Science Rutgers University Piscataway, NJ 0, USA pingli@stat.rutgers.edu Abstract
More informationGeometry of Similarity Search
Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000
More informationTabulation-Based Hashing
Mihai Pǎtraşcu Mikkel Thorup April 23, 2010 Applications of Hashing Hash tables: chaining x a t v f s r Applications of Hashing Hash tables: chaining linear probing m c a f t x y t Applications of Hashing
More informationIntroduction to Hash Tables
Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In
More informationMore on Estimation. Maximum Likelihood Estimation.
More on Estimation. In the previous chapter we looked at the properties of estimators and the criteria we could use to choose between types of estimators. Here we examine more closely some very popular
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationCS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =
CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationLecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.
COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is
More informationSQL-Rank: A Listwise Approach to Collaborative Ranking
SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationHashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille
Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
More informationHashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries
More informationSPATIAL INDEXING. Vaibhav Bajpai
SPATIAL INDEXING Vaibhav Bajpai Contents Overview Problem with B+ Trees in Spatial Domain Requirements from a Spatial Indexing Structure Approaches SQL/MM Standard Current Issues Overview What is a Spatial
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationExploiting Primal and Dual Sparsity for Extreme Classification 1
Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon
More informationSparse Fourier Transform (lecture 1)
1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More information