Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Size: px

Start display at page:

Download "Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara"

Rosemary Cain
6 years ago
Views:

1 Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee Pallickara Computer Science, W5.B.2 W5.B.3 FAQs Compute the similarity between the two strings s1=farmers Insurance, s2 = Liberty Insurance V W CosSimilarity(V,W ) = cos(α) = V W = What is the Jaccard similarity for the same case? W5.B.4 How can we compute TF/IDF based similarity analysis? W5.B.5 Programming Assignment 2 Using n-grams 1

2 Spring 2017 W5.B.6 W5.B.7 Using n-grams A string is divided into smaller tokens of size n. q-grams or n-grams Size of n-gram is string with a length n Size of n-gram words is string with a length n words Generating n-grams Slide a window of size n over the string Generating n-grams s1 = Henri Waternoose s2 = Henry Waternose Generate 3-grams Q-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## Q-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## W5.B.8 W5.B.9 n-gram based token similarity n-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## n-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## 13 overlaps among total 22 distinct q-grams. Overlaps: number of two item pairs having overlap Jaccard similarity StringJaccard(s 1,s 2 ) = = 0.59 n-gram based token similarity n-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## n-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## Using the cosine similarity with tf-idf weights CosSimilarity(V,W ) =? W5.B.10 W5.B.11 Using n-grams Using the same similarity measures used in other token based similarity computations How large should n be? [1/2] n should be picked large enough that the probability of any given n-gram appearing in any given document is low Less sensitive to typographical errors discretization Vs. discretisation What if we change the size of n? Our corpus of documents is s Picking n = 5 Printable ASCII characters 27 5 = 14,348,907 possible 5-grams Is it really true with real s? 2

3 Spring 2017 W5.B.12 W5.B.13 How large should n be? [2/2] All characters do not appear with equal probability Imagine that there are only 20 characters and estimate the number of n-grams as 20 n For research articles, a choice of n = 9 is considered safe Applications of Near-neighbor search W5.B.14 W5.B.15 Near Neighbor(NN) Search Also known as proximity search, similarity search Finding closest or most similar points The post-office problem Assigning to a residence the nearest post office Hashing n-grams Creating buckets and use the bucket numbers as the shingles For the set of 9-grams, Map each of those 9-grams to a bucket number in the range 0 to bytes of data is compacted to 4 bytes Can hash-based approaches provide NN search results? W5.B.16 W5.B.17 Similarity-preserving summaries of set Signatures Replacing large sets of n-grams by much smaller representations Applications of Near-neighbor search: MinHashing We should be able to compare the signatures of two sets and estimate the Jaccard similarity Of the underlying sets from the signatures alone 3

4 Spring 2017 W5.B.18 W5.B.19 Matrix representations of sets Element {a,b,c,d,e S1 = {a,d S2 = {c S3 = {b,d,e S4 = {a,c,d MinHashing Signature generating algorithm Minhash of the characteristic matrix Minhash(π) of a set is the number of the row (element) with first non-zero in the permuted order π π = (b,e,a,d,c) Element Element Permuted order 1 Permuted order 2 W5.B.20 W5.B.21 MinHash and Jaccard Similarity Theorem: P(minhash(S) = minhash (T)) =JaccardSIM (S,T) MinHash and Jaccard Similarity Theorem: P(minhash(S) = minhash (T)) =JaccardSIM (S,T) Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) Element S T b 0 0 e 0 0 a 1 0 d 1 1 c 0 1 Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) Element S T b 0 0 e 0 0 a 1 0 d 1 1 c 0 1 P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is, X/(X+Y) = JaccardSIM (S,T) W5.B.22 W5.B.23 MinHash and permuting the order What if we change the order of Elements? Can we simulate the effect of a random permutation? It is NOT feasible to permute a large characteristic matrix explicitly N element will need N! permutations! Element Element Permuted order 1 Permuted order 2 MinHash Signatures Pick (at random) some number n of permutations of the rows of the characteristic matrix M E.g. 100 permutations or several hundred permutations The minhash functions are determined by these permutations h 1, h 2, h 3,. h n MinHash signature for S Vector [h 1 (S), h 2 (S), h 3 (S),, h n (S)] 4

25 Representing the MinHash Signatures Form a signature matrix The i th column of M is replaced by the min hash signature of the i th column Compressed form for a sparse matrix Element >10 9 elements?

5 Spring 2017 W5.B.24 W5.B.25 Representing the MinHash Signatures Form a signature matrix The i th column of M is replaced by the min hash signature of the i th column Compressed form for a sparse matrix Element >10 9 elements? Using a random hash function [1/2] A Random hash function Maps row numbers to as many buckets as there are rows 0,1,, k-1 to bucket numbers 0 ~ k-1 Maps some pairs of integers to the same bucket Leaves other buckets unfilled However, not too many collisions W5.B.26 W5.B.27 Using a random hash function [2/2] Pick n randomly chosen hash functions h 1, h 2,. h n on the rows In the signature matrix, let SIG(i,c) be the element of the signature matrix for the i th hash function and column c Initially, the set SIG(i,c) to for all i and c For each row r Compute h 1 (r), h 2 (r),. h n (r) For each column c If c has 0 in row, do nothing If c has 1 in row r, then for each i = 1,2,,n set SIG(i,c) to the smaller of the current value of SIG(i,c) and h i (r) Row (element) S1 Row S2 (element) S1 S3 S2 S4 S3 Hash S4 1 Hash M 0 1 Hash Hash N-2 1 Hash 0 M N-1 0 Hash 0 M N>>M W5.B.28 W5.B.29 (Row # +1) mod 5 Row (element) Hash Hash Hash M Hash M Row (element) X+1 mod x +1 mod 5 (3 x Row # +1) mod 5 5

6 Spring 2017 W5.B.30 W5.B.31 h1 h2 h 1 (x) = (x + 1) mod 5 h 2 (x) = (3x +1) mod 5 In the previous slide, h 1 (0) and h 2 (0) are both 1 The row numbered 0 has 1 s in S1 and S4 h1 1 1 h2 1 1 h1 1 1 h2 1 1 Row number 1 Only in S3 is 1 Hash value H 1 (1)=2 and h 2 (1)=4 h h W5.B.32 W5.B.33 h h Row number 2 S2 and S4 Hash value h 1 (2)=3 and h 2 (2)=2 h h Row number 3 S1, S3 and S4 have 1 Hash value h 1 (3)=4 and h 2 (3)=0 h h h h Calculating a single MinHash value with MapReduce W5.B.34 Suppose you have some documents and have stored n-grams of these documents in a large table W5.B.35 Approach 1. Partitioning by rows The table must be partitioned across the processors by rows Each processor receives a subset of the n-grams for every document Each column n-grams for a single document Each row r r th n-gram for all the documents Can each processor compute the minhash for each document? No. It should pass the hash values and group them based on the document ID (Assume that document ID is stored separately) Compute a minhash value for each of your documents using a single hash function 6

7 Spring 2017 W5.B.36 W5.B.37 Design your map function What should be the input <key, value>? What should be the output <key, value>? map (k: docid, v: (row_number,freq) { if (freq >0) emit_intermediate(ik = k, iv =hash(row_number)) Design your reduce function MapReduce system will group the hashes by documentid Reduce function will find the minimum hash value within the group reduce (ik: docid, ivs: hashval[]) { var minhash := INFINITY for each iv in ivs { if iv < minhash { minhash := iv emit_final(fk = ik, fv = minhash) W5.B.38 W5.B.39 Approach 2. Partitioning by column All the k-grams for a document go to the same processor Do you need a reduce function? map (k: docid, v: kgram_row_with_value[]) { var minhash := INFINITY for each kgram with non 0 vlaue in v { var h := hash(kgram_row) if h < minhash { minhash := h emit_intermediate(ik = k, iv = minhash) W5.B.40 W5.B.41 Finding the most similar documents Number of pairs of documents to compare is too high Locality sensitive hashing 1M documents, signatures of length 250 ( 4 Byte each ) 1M x 1,000Bytes = 1GB Number of comparisons = 1M C 2 (Half of trillion pairs) 1ms per calculation of similarity 6 days to complete computing 7

8 Spring 2017 W5.B.42 W5.B.43 Locality-sensitive hashing (LSH) Do we need to calculate the similarity for all of the pairs? Near-neighbor search LSH Hash items several times, so that similar items are more likely to be hashed to the same bucket than dissimilar items Assign the candidate pair to the same bucket Reduce false positives Dissimilar pairs in the same bucket Reduce false negatives Similar pairs in different buckets W5.B.44 Dividing a signature matrix into 4 bands and 3 rows per band BAND 1 BAND 2 BAND 3 D 10 D 11 D 12 D D11(1,2,1) and D13 (1,2,1) will go to the same bucket (1,3,1) and (0,3,0) will NOT go to the same bucket unless they show the same values in the other band The first band of MinHash signature W5.B.45 Analysis of the Banding Technique [1/2] Suppose that we use b bands of r rows each Suppose that a particular pair of documents have Jaccard Similarity s P(the minhash signatures for these documents agree in any one particular row of the signature matrix) = s BAND 4 W5.B.46 W5.B.47 Analysis of the Banding Technique [2/2] The probability that documents (signatures) become a candidate pair (Signatures should match at least in ONE band): Threshold The value of similarity s to determine two documents are similar An approximation of the threshold is (1/b) 1/r P(the signatures agree in all rows of one particular band)=s r P(the signatures do not agree in at least one row of a particular band)=1-s r P(the signatures do not agree in all rows of any of the bands)= (1-s r ) b P(the signatures agree in all the rows of at least one band)= 1-(1-s r ) b 8

9 Spring 2017 W5.B.48 W5.B.49 For example If there are 16 bands and each band contains 4 rows S=0.5 If there are 8 bands and each band contains 8 rows S=0.77 If there are 4 bands and each band contains 16 rows S=0.91 Suppose that there are 20 bands (b=20), and each band includes 5 rows (r=5). P(the signatures agree in all the rows of at least one band)=1-(1-s r ) b s 1-(1-s r ) b

Finding similar items

Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the