Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li

Size: px

Start display at page:

Download "Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li"

Cornelius Gallagher
6 years ago
Views:

1 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Fast Approximate k-way Similarity Search Anshumali Shrivastava Dept. of Computer Science Cornell University Ping Li Dept. of Statistics & Biostatistics Dept. of Computer Science Rutgers University

2 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS The 3-way Resemblance Standard Jaccard or 2-way resemblance is one of the widely used similarity measures over set representations (e.gs 1,S 2 ) of documents defined as R = Sim(S 1,S 2 ) = S 1 S 2 S 1 S 2 3-way resemblance is a natural extension defined over 3 sets as: R 3way = Sim(S 1,S 2,S 3 ) = S 1 S 2 S 3 S 1 S 2 S 3 Can also be thought of as normalized co-occurrence.

3 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS A Simple Experiment with Google Sets Problem: Given two (or a set of words)w 1 andw 2, complete the set by finding more words representing the set (or words that are semantically similar). Competing Methods: Google: The original Google s algorithm available via Google spreadsheet. 3-way resemblance (3-way): Use 3-way resemblance w 1 w 2 w w 1 w 2 w every wordw and report top 5 words. to rank Sum Resemblance (SR) : Use the sum of pairwise resemblance w 1 w w 1 w + w 2 w w 2 w and report top 5 words based on this similarity. Pairwise Intersection (PI): Retrieve top 100 words based on pairwise resemblance for eachw 1 andw 2 independently. Report the common words. In our experiments, all methods except Google use binary term-document representation generated from 1M wikipedia documents collected from Wikidump.

4 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Google Sets: Results JAGUAR AND TIGER GOOGLE 3-WAY SR PI MILKY AND WAY GOOGLE 3-WAY SR PI LION LEOPARD CAT dance GALAXY even LEOPARD CHEETAH LEOPARD STARS STARS another CHEETAH LION litre SPACE EARTH still CAT PANTHER bmw the LIGHT back DOG CAT chasis UNIVERSE SPACE TIME

5 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Improving Retrieval Problem: Refine search in presence of more than one representative query. Scenarios: Pairwise: Just one queryq, rank elementebased on resemblance q e q e. 3-way NNbor: Two representative queriesq 1 andq 2, rank based on 3-way resemblance q 1 q 2 e q 1 q 2 e. 4-way NNbor: Three representative queriesq 1,q 2 andq 3, rank based on 4-way resemblance q 1 q 2 q 3 e q 1 q 2 q 3 e.

6 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Improving Retrieval: Results Table 1: Percentage of top candidates with the same labels as that of query (queries) retrieved using various similarity criteria. Higher value indicates better retrieval quality. TOP PAIRWISE 3-WAY 4-WAY MNIST WEBSPAM

7 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Why 3-way Resemblance? SR PI 3-way Custom Quality? Efficient? Poor Poor Looks Good No Yes Yes (this work) Say Good? Note: Linear run time is not acceptable in applications like search.

8 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS way Search Problems and c-approximate Versions Given two setss 1 ands 2, finds 3 C maximizing S 1 S 2 S 3 S 1 S 2 S 3. O(n) c-approximate Version (3-wayc-NN): Given two query setss 1 ands 2, if there exists 3 C withsim(s 1,S 2,S 3 ) R 0, then we report some S 3 C so that S 1 S 2 S 3 S 1 S 2 S 3 cr 0 with probability 1 δ. Given sets 1, find setss 2,S 3 C maximizing S 1 S 2 S 3 S 1 S 2 S 3. O(n2 ) c-approximate Version (3-wayc-CP): Given a query sets 1, if there exist a pair of sets 2,S 3 C withsim(s 1,S 2,S 3 ) R 0, then we report sets S 2,S 3 C so that S 1 S 2 S 3 S 1 S 2 S 3 cr 0 with probability 1 δ. FindS 1,S 2,S 3 C maximizing S 1 S 2 S 3 S 1 S 2 S 3. O(n3 ) c-approximate Version (3-wayc-BC): If there exist setss 1,S 2,S 3 C withsim(s 1,S 2,S 3 ) R 0, then we report setss 1,S 2,S 3 C so that S 1 S 2 S 3 S 1 S 2 S 3 cr 0 with probability 1 δ.

9 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Key Ideas: Probabilistic Indexing Given three setss 1,S 2,S 3 Ω and an independent random permutation π : Ω Ω, we have the following: Pr(min(π(S 1 ))=min(π(s 2 ))=min(π(s 3 ))) = R 3way. This estimator leads to an efficient indexing scheme. If we map every elements C to the hash bucket indexed by B(S) = [minπ(s);minπ(s)], given querys 1,S 2 we probe only the bucketb (S 1,S 2 ) = [minπ(s 1 );minπ(s 2 )] and we do better than random! This idea can be converted into a provably fast algorithm forc-nn search by adding two more handles K and L to control the probability.

10 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Main Algorithmic Results Theorem 1 ForR 3way c-nn one can construct a data structure with O(n ρ log 1/cR0 n) query time ando(n 1+ρ ) space. Theorem 2 ForR 3way c-cp one can construct a data structure with O(n 2ρ log 1/cR0 n) query time ando(n 1+2ρ ) space. Theorem 3 ForR 3way c-bc there exist an algorithm with running time O(n 1+2ρ log 1/cR0 n). ρ R 0 = R 0 = c Figure 1: log 1/c log1/c+log1/r 0 Plot of ρ = 1 < 1 values with respect to c for various thresholdsr 0

11 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Are there more k-way similarities which are efficient? Theorem 4 Any PGF transformation on3-way resemblancer 3way admits efficient c-nn search. wherepgf(s) = i=1 p is i with allp i 0 satisfying i=1 p i = 1 Corollary 1 e R3way 1 admits efficientc-nn search. Theorem 5 Weighted 3-way resemblance, defined as Sim(x,y,z) = min{x i,y i,z i } i max{x i,y i,z i }, naturally enjoys all efficiently guarantees of 3-way resemblance using consistent weighted sampling instead of Minhash.

12 Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS Conclusions 3-way (and higher) resemblance seems a natural choice for many interesting search problems, and at the same time it admits efficient search algorithms. The idea of probabilistic hashing can reduce the computational requirements significantly. More Possibilities Joint Recommendations: Users A and B would like to watch a movie together. Profile of each person represented as a binary sparse vector over a giant universe of attributes. For example: actors, actresses, genres, directors, etc, which she/he likes. Represent movie M as binary vectors over the same universe. A natural measure to maximize is A B M A B M.

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join