Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Size: px

Start display at page:

Download "Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University"

Lindsay Murphy
6 years ago
Views:

If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our

1 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

2 Many problems can be expressed as finding similar objects: Find near(est)-neighbors Example Applications: Pages with similar words For duplicate detection, clustering by topic Customers who purchased similar products knn classification, collaborative filtering Images with similar features Image recommendation Record linkage (deduplication) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

3 [Hays and Efros, SIGGRAPH 7] nearest neighbors from a collection of million images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3

4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 4 Given: (High dimensional) data points x, x, For example: Image is a vector of pixel colors [ ] And some distance function d(x, x ) Which quantifies the distance between x and x Goal: Find all pairs of data points (x i, x j ) that are within some distance threshold d x i, x j s

5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Given: (High dimensional) data points x, x, For example: Image is a vector of pixel colors [ ] And some distance function d(x, x ) Which quantifies the distance between x and x Goal: Find all pairs of data points (x i, x j ) that are within some distance threshold d x i, x j s Naïve solution would take O N where N is the number of data points

6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Hash objects to buckets such that objects that are similar hash to the same bucket Only compare candidates in each bucket Benefits: Instead of O(N ) comparisons, we need O(N) to find similar documents Hash functions depend on similarity functions

7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7 Goal: Given a large number (N in the millions or billions) of documents, find near duplicate pairs Applications: Mirror websites, or approximate mirrors Similar news articles at many news sites Problems: Too many documents to compare all pairs

8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 Shingling: Convert documents to sets Simple approaches: Document = set of words appearing in document Document = set of important words

9 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 9 Need to account for ordering of words! Document = set of Shingles A set of k-shingles (or k-grams) is a set of k- sequence tokens that appears in the doc Tokens can be characters, words Example: k=; D = abcab Set of -shingles: S(D ) = {ab, bc, ca}

10 A natural similarity measure is the Jaccard similarity: sim(d, D ) = C C / C C J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

11 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Encode sets using / (bit, boolean) vectors One dimension per element in the universal set Interpret set intersection as bitwise AND, and set union as bitwise OR Example: C = ; C = Size of intersection = 3; size of union = 4, Jaccard similarity = 3/4 Distance: d(c,c ) = (Jaccard similarity) = /4

12 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Shingles Rows = elements (shingles) Columns = sets (documents) in row e and column s if and only if e is a member of s Column similarity is the Jaccard similarity of the corresponding sets Typical matrix is sparse! Example: sim(c,c ) =? Documents

13 Shingles Rows = elements (shingles) Columns = sets (documents) in row e and column s if and only if e is a member of s Column similarity is the Jaccard similarity of the corresponding sets Typical matrix is sparse! Example: sim(c,c ) =? Size of intersection = 3; size of union = 6, Jaccard similarity = 3/6 d(c,c ) = (Jaccard similarity) = 3/6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Documents 3

14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Suppose we need to find near-duplicate documents among N = million documents Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs N(N )/ 5* comparisons At 5 secs/day and 6 comparisons/sec, it would take 5 days For N = million, it takes more than a year

15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Key Idea: hash each column C to a small signature h(c): () h(c) is small enough that the signature fits in RAM () sim(c, C ) is the same as the similarity of signatures h(c ) and h(c ) Locality sensitive hashing: If sim(c,c ) is high, then with high prob. h(c ) = h(c ) If sim(c,c ) is low, then with high prob. h(c ) h(c ) Expect that most pairs of near duplicate docs hash into the same bucket!

16 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7 Input matrix (Shingles x Documents)

17 Signature matrix M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, nd element of the permutation is the first to map to a 4 th element of the permutation is the first to map to a Input matrix (Shingles x Documents) Permutation

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.

18 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 9 Imagine the rows of the boolean matrix permuted under random permutation Define a hash function h (C) = the index of the first (in the permuted order ) row in which column C has value : h (C) = min (C) Use several (e.g., ) independent hash functions (that is, permutations) to create a signature of a column

19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Permuting rows even once is prohibitive Row hashing! Pick K hash functions k i Ordering under k i gives a random row permutation! How to pick a random hash function h(x)? Universal hashing: h a,b (x)= (a x+b) mod N where: a,b random integers

20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, One-pass implementation For each column C Initialize all hash values for each permutation i: sig(c)[i] = For each row If there is a in column C Update hash value of column C if the row number in the current permutation is smaller than current value If k i < sig(c)[i], then sig(c)[i] k i

21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

22 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4 Input matrix (Shingles x Documents) Permutation

23 One bit matching (given a ) Pr[h (C ) = h (C )] = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Signature matrix M 4

24 One bit matching (given ) Pr[h (C ) = h (C )] = Sim(C, C) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Signature matrix M 4

25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 Given cols C and C, rows may be classified as: C C A B C D a = # rows of type A, etc. Note: sim(c, C ) = a/(a +b +c) Then: Pr[h(C ) = h(c )] = Sim(C, C ) Look down the cols C and C until we see a If it s a type-a row, then h(c ) = h(c ) If a type-b or type-c row, then not

26 9 One given bit matching Pr[h (C ) = h (C )] = sim(c, C ) The expected similarity of the signatures (defined as fractions of matching values) sim[h (C ) = h (C )] = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4

27 3 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) The expected similarity of the signatures (defined as fractions of matching values) sim[h (C ) = h (C )] = sim(c, C ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4

28 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Similarities: Col/Col Sig/Sig Signature matrix M 4 Input matrix (Shingles x Documents) Permutation

29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3 Key Idea: hash each column C to a small signature h(c): () h(c) is small enough that the signature fits in RAM () sim(c, C ) is the same as the similarity of signatures h(c ) and h(c ) Locality sensitive hashing: If sim(c,c ) is high, then with high prob. h(c ) = h(c ) If sim(c,c ) is low, then with high prob. h(c ) h(c ) Expect that most pairs of near duplicate docs hash into the same bucket!

30 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 33 Probability of sharing a bucket No chance if t < s Similarity threshold s Probability = if t > s Similarity t =sim(c, C ) of two sets

31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 34 Probability of matching bit Similarity t =sim(c, C ) of two sets

32 35 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4

33 36 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = sim(c, C ) K J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4

34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 37 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = sim(c, C ) K Any bit matching (OR): Pr[any h (C ) = h (C )] = Signature matrix M 4

35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 38 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = sim(c, C ) K Any bit matching (OR): Pr[any h (C ) = h (C )] = - (-sim(c, C )) K Signature matrix M 4

36 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, b bands r rows per band One signature Signature matrix M

37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 48 Divide matrix M into b bands of r rows Candidate column pairs are those that hash to the same values for any band Tune b and r to catch most similar pairs, but few non-similar pairs

38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Columns C and C have similarity t Prob. of one given band (r rows) matching = Prob. of any band matching =

39 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Columns C and C have similarity t Prob. of one given band (r rows) matching = t r Prob. of any band matching = - ( - t r ) b

40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 54 Probability of sharing a bucket s ~ (/b) /r - ( -t r ) b Similarity t=sim(c, C ) of two sets

41 Prob. sharing a bucket J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 55 Tradeoff of false positive and false negatives Example: 5 hash-functions (r=5, b=) Blue area: False Negative rate Green area: False Positive rate Similarity

42 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6

43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Given a (d,d,p,p )-sensitive family F AND-construction AND of r members of F (d,d,p r,p r )-sensitive Mirrors the effect of r rows in a single band OR-construction OR of b members of F (d,d,(-p ) b,(-p ) b )-sensitive Mirrors the effect of combining multiple bands Select r and b to increase p and decrease p

44 Hashing function: a vector of d dimensions is hashed into the ith bit value (d, d, -d/d, -d/d)-sensitive AND/OR constructions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 63

45 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 64 Distance function: d = theta A randomly chosen vector v f Given two vectors x and y, f(x) = f(y) iff v f x and v f y have the same sign (d,d, -d/8, -d/8)-sensitive

46 A randomly chosen line with segments of length a A point is hashed to the bucket in which its projection onto the line lies (d,d, p,p)-sensitive J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 65

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard