COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018

Lecture 9 Similarity Queries

Few words about the exam The exam is Thursday (Oct 4) in two days In this room this time Duration 1 hour Syllabus: Till the end of last class Focus: Clustering An extensive list of chapters were provided in piazza, you should concentrate on things that were discussed in detail in class.

From the two books From Blum et al. book: Ch 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9.1, 3.9.2, 3.9.4 All exercise in Ch 3 are relevant Ch 4.1, 4.8. Ex. 4.1-4.5, 4.11, 4.12, 4.25, 4,55-4.57 Ch 7.1, 7.2, 7.3, 7.4, 7.5 Ex. 7.1-7.21, 7.25-7.29 From MMDS: Exercises are at the end of a section Ch 5.1, 5.5 Ch 7.1, 7.3, 7.4 Ch 10.4 Ch 11.1, 11.2, 11.3 Only things that are covered in class are in syllabus Concentrate on topics that were not covered by the first homework It is fine if you learn something by mistake

Finding similar items [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Finding near-neighbors in a high dimensional space Pages with similar words For data de-duplication, classification by topics Customers who purchased similar products Product classification Images with similar features The space is high dimensional The database can be huge

Finding near-neighbors in a high dimensional space Near neighbors: Points that are only a small distance away Have to define distance or similarity For example, Jaccard distance/similarity For any two sets A and B, the Jaccard similarity A B A B For any two sets A and B, the Jaccard distance 1 A B A B

Jaccard metric into one of set intersection, we use a technique called shingling, which is introduced in Section 3.2. For any two sets A and B, the Jaccard similarity 3.1.1 Jaccard Similarity of Sets A B A B The Jaccard similarity of sets S and T is S T / S T, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T by SIM(S, T ). For any two sets A and B, the Jaccard distance A B 1 A B Example 3.1 : In Fig. 3.1 we see two sets S and T. There are three elements in their intersection and a total of eight elements that appear in S or T or both. Thus, SIM(S, T )=3/8. S T similarity = 3 8 ; distance = 5 8 Figure 3.1: Two sets with Jaccard similarity 3/8

Running example: Similarity of documents Jaccard distance for textual similarity To find plagiarism No simple automated process (documents are not exactly same) To find the same news from different sources To match mirror pages

Documents to sets: Shingling Jaccard distance for textual similarity Document: string of characters k-shingle: a substring of length k within the document Example: abcdabd Set of 2-shingles: { ab,bc,cd,da,bd} Bag of 2-shingles: {ab,bc,cd,da,ab,bd} We might remove whitespace altogether We might replace sequence of whitespaces by only one whitespace

Shingling: choosing the shingle size k = 1: bad k should be picked large enough that the probability of any given shingle appearing in any given document is low. k very large: not enough statistics: bad For large documents k = 9 is considered safe Hashing Shingles: Give each shingle a number and then instead of set of shingles, keep the set of numbers

Shingles built from words First rule: avoid stop words: and, such, to etc. Use stop word followed by the next two or three words: and other world leaders etc.

3.3.1 Matrix Representation of Sets Summarizing the shingles Before explaining how it is possible to construct small signatures from large sets, it is helpful to visualize a collection of sets as their characteristic matrix. The columns of the matrix correspond to the sets, and the rows correspond to elements How of tothe go universal from Setsset from Signatures which elements of the sets are drawn. There is a 1 in row Thercharacteristic and column c if matrix the element of setsfor row r is a member of the set for column c. Otherwise the value in position (r, c) is 0. Element S 1 S 2 S 3 S 4 a 1 0 0 1 b 0 0 1 0 c 0 1 0 1 d 1 0 1 1 e 0 0 1 0 Figure 3.2: A matrix representing four sets U = {a, b, c, d, e} S 1 = {a, d}; S 2 = {c} etc. Example 3.6 : In Fig. 3.2 is an example of a matrix representing sets chosen from the universal set {a, b, c, d, e}. Here,S 1 = {a, d}, S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row and leftmost columns are not part of the matrix, but are present only to remind us what the rows and columns represent.

Sets Signatures: Minhashing Hash the columns of the characteristic matrix How? Pick a permutation of the rows Hash value of a column is the element in first row in the permuted order where the column has 1

elements of the characteristic of the universal matrix. set from In this which section, elements we shall of the learn setshow are drawn. a minhash There is is computed a 1 in rowinrprinciple, and column andcinif the later Minhashing element sectionsfor werow shall r issee a member how a good of the approximation c. to Otherwise minhash theisvalue computed in position in practice. (r, c) is 0. set for column To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. Element The minhash S 1 Svalue 2 Sof 3 any S 4 column is the number of the first row, in the permuted a order, 1 in which 0 0the column 1 has a 1. b 0 0 1 0 Example 3.7 : Let us suppose c we pick 0 the 1 order 0 of 1 rows beadc for the matrix of Fig. 3.2. This permutation d defines 1 a minhash 0 1 function 1 h that maps sets to rows. Let us compute the minhash e 0value 0 of set 1 S 1 0according to h. The first column, which is the column for set S 1, has 0 in row b, soweproceedtorowe, the second in the permuted order. There is again a 0 in the column for S 1,so we proceed Permutation to rowfigure ofa, the wherewefinda1.thus.h(s 3.2: rows: Ab,e,a,c,d matrix representing 1 )=a. four sets Example 3.6 : In Fig. 3.2 Element is an example S 1 S 2 of asmatrix 3 S 4 representing sets chosen from the universal set {a, b, c, b d, e}. 0Here,S 0 1 = 1 {a, d}, 0 S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row e and leftmost 0 0 columns 1 0 are not part of the matrix, but are present only to remind a us what 1 the 0 rows 0 and 1 columns represent. d 1 0 1 1 It is important to remember c that 0 the 1 characteristic 0 1 matrix is unlikely to be the way the data is stored, but it is useful as a way to visualize the data. For one reason not to store data as a matrix, these matrices are almost always sparse (they h(shave 1 ) = many a; Figure h(s more 2 ) 3.3: = 0 s c; Athan h(s permutation 3 ) 1 s) = b; inh(s practice. of 4 ) the = rows ait saves of Fig. space 3.2 to represent a sparse matrix of 0 s and 1 s by the positions in which the 1 s appear. Foranother

suppose we pick the order of rows beadc for the matr tation defines a minhash Why Minhashing? function h that maps sets the minhash value of set S 1 according to h. The fir For any two sets S, T, the probability that the minhashes (with a lumn for set S random uniform 1, has 0 in row b, soweproceedtorow permutation) being equal uted order. There is again a 0 in the column for S 1, P(h(S) = h(t )) = Sim(S, T ), herewefinda1.thus.h(s 1 )=a. the Jaccard similarity of the two sets. Why? Element S 1 S 2 S 3 S 4 b 0 0 1 0 e 0 0 1 0 a 1 0 0 1 d 1 0 1 1 c 0 1 0 1

Minhash signature Take n (say 100) random permutations Each define a minhash function: h 1, h 2,..., h n Minhash signature of S is a vector [h 1 (S), h 2 (S),..., h n (S)] n is much smaller than the number of rows of the characteristic matrix