Database Systems CSE 514

Size: px

Start display at page:

Download "Database Systems CSE 514"

Ginger Mills
5 years ago
Views:

1 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter

2 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section The final exam is next Tuesday, in class CSEP514 - Winter

3 Data Cleaning Cover slides 2-20 from the Data Cleaning Tutorial CSEP514 - Winter

4 Deduplication Other names: entity resolution, record linkage,... Cover (quickly!) slides of the ER tutorial CSEP514 - Winter

5 Record Linkage The same object is represented in different ways in two relations Problem: find matching pairs Examples Typos: Woshington v.s. Washington Different naming conventions: IBM v.s. International Business Machines Corporation 5

6 Company1 Example Company1 CName1 Microsoft Corp Apple Computer Apples, Pears, and More... Other attributes CName2 Microsoft Inc Apple Corporation Apples and Pears Farm Other attributes... 6

7 Fellegi & Sunter Model Each relation has several attributes: R(A1,A2,...,Ak) S(A1,A2,...,Ak) Fix two records r and s: we have a score of similarity for each pair of attributes: sim(r.a1,s.a1), sim(r.a2,s.a2),..., sim(r.ak,s.ak) Using these scores, decide if r = s or r s This lecture will discuss only single attribute 7

8 Similarity Join another name for Fuzzy Matching SELECT * FROM Company1, Company2 WHERE cname1 cname2 Intended meaning: retrieve all pairs that are sufficiently similar 8

9 Outline for Today Define similarity s1 s2 Jaccard, Hamming, Edit distance Given a set of sets s1, s2,... compute efficiently all pairs (si,sj) s.t. si sj Min-hashes Locality Sensitive Hashing CSEP514 - Winter

10 Similarity Functions CSEP514 - Winter

11 What is Similar? Similarity function sim(s1,s2): Sim(s1,s2) > k means s1, s2 are similar Distance function dist(s1,s2): Dist(s1,s2) < k means s1, s2 are similar 11

12 Two Approaches Q-grams Simple, and can lead to efficient optimizations Edit Distance More accurate, but less amenable to optimizations 12

13 Q-Grams Given a string s, a q-gram is a substring of length q Usually q = 3 washington woshington s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} Variation: may include beginning and end: ##w, #wa, on$, n$$

14 Hamming Distance H(s1, s2) = s1 Δ s2 = s1-s2 + s2-s1 s1 s1 is similar to s2 if H(s1, s2) < c, for some constant c s2

15 Hamming Distance s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} H(s1, s2) =? 15

16 Hamming Distance s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} H(s1, s2) = 4 16

17 Jaccard Similarity J(s1,s2) = s1 s2 / s1 s2 Note: J(s1,s2) [0,1] s1 s1 is similar to s2 if J(s1, s2) > k, for some constant k s2

18 Jaccard Similarity s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} J(s1, s2) =? 18

19 Jaccard Similarity s1 = {was, ash, shi, hin, ing, ngt, gto, ton} s2 = {wos, osh, shi, hin, ing, ngt, gto, ton} J(s1, s2) = 6/10 =

20 They are related! Suppose s1 = s2 = L s1 s2 = L H(s1,s2) s1 s2 = L + H(s1,s2) J(s1,s2) = (L H(s1,s2)) / (L + H(s1,s2)) 20

21 Representing q-grams Company (id, name, ) CQ(id, qgram) Id Name... 1 Washingto n The q-gram schema Id Qgram 1 was 1 ash 1 shi 1 hin

22 Naive Similarity Joins SELECT * FROM Company1, Company2 WHERE cname1 cname2 > 6 Company1(id1, cname1, ) Company2(id2, cname2, ) CQ1(id1, qgram1) CQ2(id2, qgram2) Rewrite the query over the q-gram schema 22

23 Naive Similarity Joins SELECT * FROM Company1, Company2 WHERE cname1 cname2 > 6 SELECT x.*, y.* FROM Company1 x, Company2 y, CQ1 u, CQ2 v WHERE x.id = u.id and y.id = v.id and u.qgram = v.qgram GROUP BY x.id, y.id HAVING count(*) > 6 23

24 Edit Distance Sometimes none of the similarity/distance measures are good enough Need to use edit distance Models more accurately the typing mistakes Slower than Jaccard or Hamming Use Jaccard or Hamming for pruning, then compute edit distance to remove false positives 24

25 The Levenstein Distance Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1)

26 Levenstein distance - example distance( William Cohen, Willliam Cohon ) s1 W I L L I A M _ C O H E N s2 op W I L L L I A M _ C O H O N C C C C I C C C C C C C S C cost

27 Levenstein distance - example distance( William Cohen, Willliam Cohon ) s1 W I L L gap I A M _ C O H E N s2 W I L L L I A M _ C O H O N C C C C I C C C C C C C S C op cost

28 Computing Levenstein distance 1/4 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

29 Computing Levenstein distance 2/4 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

30 Computing Levenstein distance 3/4 D(i-1,j-1) + d(si,tj) //subst/copy D(i,j)= min D(i-1,j)+1 //insert D(i,j-1)+1 //delete C O H E N M C C O H N = D(s,t)

31 Computing Levenstein distance 4/4 D(i-1,j-1) + d(si,tj) //subst/copy D(i,j) = min D(i-1,j)+1 //insert D(i,j-1)+1 //delete A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1) C O H E N M C C O H N

32 Needleman-Wunch distance D(i-1,j-1) + d(si,tj) //subst/copy D(i,j) = min D(i-1,j) + G //insert D(i,j-1) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) G = gap cost William Cohen Wukkuan Cigeb

33 Pre-filtering If editdistance(s1, s2) k, then Qgrams(s1) Qgrams(s2) v Where v = max( s1, s2 ) 1 (k-1) * q Where q = length of the q-gram Thus, can use any of the signature schemes 33

34 Jaccard Recap: Jaccard similarity: J(S, T) = S T / S T Problem: given large collection of sets S1, S2,, Sn, and given a threshold s, find all pairs Si, Sj s.t. J(Si, Sj) > s 34

35 Similarity Joins CSEP514 - Winter

36 Similarity joins Problem: We are given n sets s1, s2,..., sn Find quickly all pairs s.t. J(Si,Sj) > s False positives, false negatives are OK 36

37 Application 1: Collaborative Filtering We have n customers: 1, 2,, n Each customer i buys a set of items Si We would like to recommend items bought by customer j, if J(Si, Sj) > s 37

38 Application 2: Similar Documents Given n documents d1, d2,, dn Let Si be the set of q-grams for document i Want to find all pairs of similar documents, i.e. for which J(Si, Sj) > s 38

39 Example You work for a copyright violation detection company Customer: has 10 6 documents Web: has pages Your job is to find almost identical documents 39

40 The Signature Method For each Si, compute a signature Sig(Si) s.t. J(Si,Sj) > s çè Sig(Si) Sig(Sj) emptyset With high probability 40

41 The Signature Method Step 1: compute all pairs i,j for which Sig(Si) Sig(Sj) emptyset This is a join operation! Step 2: for all such pairs, return (i,j) if J(Si,Sj) > s Hopefully only a few such pairs Both false positives and false negatives are possible 41

42 Discussion We will assume that each Si is a string, and we associate it with the set of its q-grams The signature Sig(Si) is also a set of something (to be explained) CSEP514 - Winter

43 Signature 1 = the q-grams! Obviously: s1 s2 > k è s1 s2 emptyset SELECT DISTINCT x.*, y.* FROM Company1 x, Company2 y, CQ1 u, CQ2 v WHERE x.id = u.id and y.id = v.id and u.qgram = v.qgram Lots of false positives 43

44 Signature 2: Prefix Filter Suppose all sets si have same cardinality L Order the q-grams lexicographically Define sig(s) = {the first L-k+1 smallest q- grams in q} 44

45 Signature 2: Prefix Filter Example: L = 8, k = 6, L k + 1 = 3 s1 = {was, ash, shi, hin, ing, ngt, gto, ton} = {ash, gto, hin, ing, ngt, shi, ton, was} Sig(s1) = {ash, gto, hin} 45

46 Signature 2: Prefix Filter Fact: if s1 s2 > k then sig(s1) sig(s2) emptyset Why? 46

47 Signature 3: Minhash Let π be an arbitrary permutation of the domain For each i, let: mh(si) = {the smallest element in Si according to π} 47

48 Example The entire domain is {a,b,c,d,e,f,g,h} The set Si is Si = {a,b,c,e,f} Suppose we choose the permutation: π = d,g,c,h,b,f,a,e Then what is mh(si) =? 48

49 Minhash Main property: Probability(mh(Si) = mh(sj)) = J(Si,Sj) Why? 49

50 Warmup Question Choose a random permutation π of {a,b,c,z} S= a b c d What is the probability that mh(s) = c? 50

51 Warmup Question Choose a random permutation π of {a,b,c,z} S= a b c d What is the probability that mh(s) = c? Answer: P = ¼ (each of a,b,c,d can be the min) 51

52 Main Property Sj Si f a b c e d What is Prob(mh(Si) = mh(sj))? 52

53 Main Property Sj Si f a b c e d What is Prob(mh(Si) = mh(sj))? 53 Answer: 2/6 = J(Si,Sj)

54 Computing Minhashes We use a hash function (which we assume is random) mh(s) { v = ; forall x in S do if h(x) < v then {v = h(x); y = x;} return y; } 54

55 Example The set Si is Si = {a,b,c,e,f} Compute h: h(a)=77,h(b)=55,h(c)=33,h(e)=88,h(f)=66 Then what is mh(si) =? 55

56 Using Minhashes for Fuzzy Join Recall: we have n sets S1,, Sn Want to return all pairs s.t. J(Si,Sj) > s Compute mh(s1),, mh(sn) Return pairs for which mh(si)=mh(sj) The probability of returning (Si,Sj) is J(Si,Sj) Probability of a false positive = J(Si,Sj) Probability of a false negative = 1-J(Si,Sj) 56

57 Example We have n = 1,000,000 records S1,..., Sn We want pairs s.t. J(Si,Sj) > 0.99 There are ordered pairs (Si,Sj) Suppose 2,000,000 pairs have J(Si,Sj)>0.99 the others have J(Si,Sj) = 0.5 # false positives = 0.5 * This is bad # false negatives = 0.01 * 2,000,000 = 20,000

58 Naïve Idea Use r independent hash functions h 1,, h r For each Si, compute a block of minhashes: MH(Si) = the r minhashes for each j=1,,r Return all pairs where MH(Si)=MH(Sj) The probability of returning (Si,Sj) is (J(Si,Sj)) r Probability of a false positive = (J(Si,Sj)) r Probability of a false negative = 1-(J(Si,Sj)) r 58

59 Example r = 2 The set Si is Si = {a,b,c,e,f} Compute h1: h1(a)=77,h1(b)=55,h1(c)=33,h1(e)=88,h1(f)=66 Compute h2: h2(a)=22,h2(b)=66,h2(c)=55,h2(e)=11,h2(f)=44 Then what is MH(Si) =? 59

60 Example r = 2 The set Si is Si = {a,b,c,e,f} Compute h1: h1(a)=77,h1(b)=55,h1(c)=33,h1(e)=88,h1(f)=66 Compute h2: h2(a)=22,h2(b)=66,h2(c)=55,h2(e)=11,h2(f)=44 Then what is MH(Si) =? Answer: MH(Si) = ce (an ordered pair) 60

61 Signature 4: LSH Locality Sensitive Hashing b blocks of minhashes MH 1,, MH b Each MH k has size r Total size of the signature: m = b*r Return pairs (Si,Sj) s.t. there exists k MH k (S i ) = MH k (S j ) Still a single simple join! 61

62 Example n=7 m=6 b=3 r=2 S1 S2 S3 S4 S5 S6 S r rows Each rectangle (=band) becomes one new hash value

63 Analysis Goal: want to compute the probability that Sig(MH(Si)) Sig(MH(Sj)) emptyset, as a function of s = J(Si, Sj) Let s = J(Si, Sj) 63

64 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal?

65 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? 7 7 Answer: s 65

66 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? r rows 66

67 Analysis Si Sj J(Si,Sj) = s What is the probability that two entries are equal? r rows Answer: s r 67

68 Analysis Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands 68

69 Analysis Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands Answer: 1-(1-s r 69) b

70 This is precisely the probability that Sig(Si) Sig(Sj ) emptyset Si Sj J(Si,Sj) = s What is the probability that some pair of bands are equal? b bands Answer: 1-(1-s r ) b 70

71 Analysis 1 P(s) = 1-(1-s r ) b 0 s 0 (1/b) 1/r 71 1

72 Putting it together You work for a copyright violation detection company Customers: have documents 1, 2, 3,., 10 6 Web: has pages 1, 2, 3,., Your job is to find almost identical documents 72

73 Doc DocID Qgram 1 To 1 o b 1 be 1 be 1 e o 1 or 1 or 1 r n 1 no Step 1: Q-grams Web URL Qgram abc.com Oba abc.com Bam abc.com ama abc.com ma abc.com a h bcd.com

74 Step 2: Compute m Min-hashes docid mh1 mh2 mh3 mh4 mh To que Ion or Not url mh1 mh2 mh3 mh4 mh50 0 abc.com sec ret def ens bcd.com Note: m = a few hundreds 74

75 Step 3: Compute Signatures docid h(mh1,,mh20) h(mh21,,mh40) docsig docid Sig websig url Sig @1 abc.com @ @2 abc.com 3232@2 1 abc.com @25 abc.com @ @1 bcd.com 23423@

76 Step 4: Find docs with common signatures SELECT DISTINCT docsig.docid, websig.url FROM docsig, websig WHERE docsig.sig = websig.sig Note: this is a SINGLE JOIN query!! Still need to filter out the false positives (easy) 76

CSE 321 Discrete Structures

CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold