Similarity Search. Stony Brook University CSE545, Fall 2016

Size: px

Start display at page:

Download "Similarity Search. Stony Brook University CSE545, Fall 2016"

Reynard Stafford
5 years ago
Views:

1 Similarity Search Stony Brook University CSE545, Fall 20

2 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings Entity Resolution Fingerprint Matching

3 Finding Similar Items: What we will cover Set Similarity Shingling Minhashing Locality-sensitive hashing Embeddings Distance Metrics High-Degree of Similarity

4 Document Similarity Challenge: How to represent the document in a way that can be efficiently encoded and compared?

5 Shingles Goal: Convert documents to sets

6 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters

7 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd}

8 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd} Similar documents will have many common shingles Changing words or order has minimal effect. In practice use 5 < k < 0

9 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters Large enough that any given shingle appearing a document is highly unlikely (e.g. <.% chance) E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd} Can hash large singles to smaller (e.g. 9-shingles into 4 bytes) Can also use words (aka n-grams). Similar documents will have many common shingles Changing words or order has minimal effect. In practice use 5 < k < 0

10 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

11 Minhashing Goal: Convert sets to shorter ids, signatures

12 Minhashing - Background Goal: Convert sets to shorter ids, signatures Characteristic Matrix:. Jaccard Similarity: (Leskovec at al., 204;

13 Minhashing - Background Goal: Convert sets to shorter ids, signatures Characteristic Matrix:. Jaccard Similarity: (Leskovec at al., 204; often very sparse! (lots of zeros)

14 Minhashing - Background Characteristic Matrix: ab Jaccard Similarity: bc 0 de 0 ah ha 0 0 ed ca 0

15 Minhashing - Background Characteristic Matrix: ab * * Jaccard Similarity: bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 *

16 Minhashing - Background Characteristic Matrix: ab * * Jaccard Similarity: bc 0 * de 0 * ah ** ha 0 0 ed ** sim(, ) = 3 / (# both have / # at least one has) ca 0 *

17 Minhashing - Background Characteristic Matrix: How many different rows are possible? ab * * bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 *

18 Minhashing - Background Characteristic Matrix: ab * * bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 * How many different rows are possible?, -- type a, 0 -- type b 0, -- type c 0, 0 -- type d sim(, ) = a / (a+b+c)

19 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

20 Minhashing Characteristic Matrix: S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 (Leskovec at al., 204;

21 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 (Leskovec at al., 204;

22 Minhashing Characteristic Matrix: S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 permuted order ha 2 ed 3 ab 4 bc 5 ca ah de Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. (Leskovec at al., 204;

23 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 ha 2 ed 3 ab 4 bc 5 ca ah de (Leskovec at al., 204;

24 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 ab 0 0 ha 4 bc ed de ab ah bc 2 5 ha 0 0 ed 0 0 ca ca ah de h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = (Leskovec at al., 204;

25 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 2 ed 3 ab 4 bc ha ca 2 ed 0 0 ah 5 ca 0 0 de (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) =

26 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 2 ed 3 ab 4 bc ha ca 2 ed 0 0 ah 5 ca 0 0 de (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row

27 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h ( ) = ed #permuted row 2 h ( ) = ha #permuted row h (S 3 ) = ed #permuted row 2 h (S 4 ) = ha #permuted row

28 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row

29 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row

30 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation 2 4 bc de 0 0 ah 0 0 S 3 S 4 h 2 2 ha 0 0 h 2 2 ed ca 0 0 (Leskovec at al., 204;

31 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation 2 4 bc de 0 0 ah 0 0 ha 0 0 S 3 S 4 h 2 2 h ed ca 0 0 (Leskovec at al., 204;

32 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 S 3 S 4 h 2 2 h ed 0 0 h ca 0 0 (Leskovec at al., 204;

33 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 S 3 S 4 h 2 2 h h ca 0 0 (Leskovec at al., 204;

34 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;

35 Minhashing Characteristic Matrix: Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;

36 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;

37 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 Estimate with a random sample of permutations de 0 0 (i.e. ~00) ah ha ed ca (Leskovec at al., 204; S 3 S 4 h 2 2 h h

38 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 Estimate with a random sample of permutations de 0 0 (i.e. ~00) 3 ah 0 0 ha ed 0 0 S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/ ca 0 0 (Leskovec at al., 204;

39 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/3 Real Sim(, S 3 ) = Type a / (a + b + c) = 3/ ca 0 0 (Leskovec at al., 204;

40 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc de ah ha ed ca (Leskovec at al., 204; S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/3 Real Sim(, S 3 ) = Type a / (a + b + c) = 3/4 Try Sim(, S 4 ) and Sim(, )

41 Minhashing To implement Problem: Can t actually do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!)

42 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets)

43 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets) Algorithm: for r in rows of cm: #cm is characteristic matrix compute h i (r) for all i in hashes #produces 00 precomputed values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value h i (r) < M[i][s]: M[i][s] = h i (r)

44 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Known as efficient minhashing. Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets) Algorithm: for r in rows of cm: #cm is characteristic matrix compute h i (r) for all i in hashes #produces 00 precomputed values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value h i (r) < M[i][s]: M[i][s] = h i (r)

45 Minhashing What hash functions to use? Start with a decent function E.g. h (x) = ascii(string) % large_prime_number Add a random multiple and addition E.g. h 2 (x) = (a*ascii(string) + b) % large_prime_number

46 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

47 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document). New Problem: Even if the size of signatures are small, it can be computationally expensive to find similar pairs. E.g. m documents;,000,000 choose 2 = 500,000,000,000 pairs

48 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity.

49 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. If we wanted the similarity for all pairs of documents, could anything be done?

50 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. Approach: Hash multiple times: similar items are likely in the same bucket.

51 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. Approach: Hash multiple times: similar items are likely in the same bucket. Approach from MinHash: Hash columns of signature matrix Candidate pairs end up in the same bucket. (LSH is a type of near-neighbor search)

52 Locality-Sensitive Hashing Step : Add bands (Leskovec at al., 204;

53 Locality-Sensitive Hashing Step : Add bands Can be tuned to catch most true-positives with least false-positives. (Leskovec at al., 204;

54 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;

55 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;

56 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;

57 Locality-Sensitive Hashing Step 2: Hash columns within bands Criteria for being candidate pair: They end up in same bucket for at least band. (Leskovec at al., 204;

Locality-Sensitive Hashing Step 2: Hash columns within bands Simplification: There are enough buckets compared to rows per band that columns must

58 Locality-Sensitive Hashing Step 2: Hash columns within bands Simplification: There are enough buckets compared to rows per band that columns must be identical in order to hash to the same bucket. Thus, we only need to check if identical within a band. (Leskovec at al., 204;

59 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity

60 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity P( == b): probability S and S2 agree within a given band

61 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band

62 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band =.2 20 =.00035

63 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band =.2 20 = What if wanting 40% Jaccard Similarity?

64 Document Similarity Pipeline Shingling Minhashing Localitysensitive hashing

65 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). (

66 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). Typical properties, d: distance metric d(x, x) = 0 d(x, y) = d(y, x) d(x, y) d(x,z) + d(z,y) (

67 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance Cosine Distance Edit Distance Hamming Distance

68 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance ( L2 Norm ) Cosine Distance Edit Distance Hamming Distance

Sim). There are other metrics of similarity. e.

69 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance ( L2 Norm ) Cosine Distance Edit Distance Hamming Distance

70 Locality Sensitive Hashing - Theory LSH Can be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar.

71 Locality Sensitive Hashing - Theory LSH Can be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar. E.g. for euclidean distance: Choose random lines (analogous to hash functions in minhashing) Project the two points onto each line; match if two points within an interval

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining