Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Size: px

Start display at page:

Download "Assignment 3: Chapter 2 & 3 (2.6, 3.8)"

Johnathan Bond
6 years ago
Views:

1 Neha Aggarwal Comp 578 Data Mining Fall Assignment 3: Chapter 2 & 3 (2.6, 3.8) 2.6 Q.18 This exercise compares and contrasts some similarity and distance measures. (a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors. x = 1111 y = 111 Ans. Hamming distance: x = 1111 y = 111 The bit in red denotes the values which are different in the two vectors. The Hamming distance is the count of such bits, and is 3 in this case. Jaccard measure: f ij means frequency of the pair (i,j) in (x,y) f = 5, f 1 = 1, f 1 = 2, f 11 = 2 J = f 11 /( f 1 + f 1 + f 11 ) = 2/5 =.4 (b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching coefficient, and which approach is more similar to the cosine measure? Explain Ans. Hamming distance is a measure of dissimilarity and computes the total of values that are different in the two vectors. For a binary vector this measure simulates an XOR operation. That is, the number of attributes for which the values are (,1) or (1,) only. In part (a), Hamming distance is calculated to be 3. Simple matching coefficient (SMC) is a similarity measure that is useful when there is symmetry in the data i.e when both positive and negative values are important. SMC counts the bit values that are equal. From the example in (a), SMC = (f 11 + f )/ (f 1 + f 1 + f 11 + f )

2 = 7/1 Although, SMC is a similarity measure, it can be easily used to measure the dissimilarity, simply by (1-SMC). In the above example, this results in 3/1. Both Hamming distance and SMC can therefore, be used to measure the dissimilarity among vectors. The only difference between SMC and Hamming distance is that, SMC is expressed as a ratio of dissimilar bits to the total number of bits whereas Hamming distance simple counts the number of dissimilar bits. Jaccard measure: Assume that x and y are two items, and the values of x, y are expressed as binary attributes, with meaning the item was not purchased and 1 meaning the item was purchased. While measuring the similarity between items x and y, the attribute with value is of no use because signifies the item was not purchased. Since the number of items not purchased by a customer outweighs the number of items purchased, Jaccard measure discards these (,) attributes, thereby reducing computational time. Cosine similarity is mostly used to measure document similarity. Since two documents are not likely to contain many of the same words, similarity does not depend on the number of shared values. Both Jaccard measure and cosine similarity ignore - matches, but cosine similarity can also handle non-binary vectors. (c) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and otherwise.) Ans. Hamming distance between two vectors is the number of positions for which the two values are different. In other words, it is the minimum bits to be changed to convert one binary vector to another. Jaccard measure is used to measure the similarity among vectors. Jaccard measure ignores - values, i.e. it considers the absence of an attribute in both the vectors as of no real importance. Although different species share many genes, there is still a lot of difference in the gene structure of different organisms. Also, different species may have varying number of genes. Jaccard measure would ignore the cases when a particular gene is not present in either of the organisms and count the number of matches between the two gene vectors. Hamming distance, on the other hand, would provide information about difference in the gene structure of the two organisms. Thus, the similarity between organisms of different species is more accurately represented by Jaccard similarity

3 (d) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.) Ans. Hamming distance would be a good measure while comparing the genetic makeup of two organisms of the same species. Human beings share more than 99.9% of the same genes, thus it is more crucial to measure the differences in their gene structure, in order to obtain useful information regarding what sets them apart. Jaccard similarity would unnecessarily compute the large number of similar bits. Hamming distance, on the other hand, would ignore similar bits, and compute the difference in the genome of two organisms. 2.6 Q 19. For the following vectors, x and y, calculate the indicated similarity or distance measures. The formulas for different measures are: Cosine Similarity Cos(x,y) = (x.y) / x y Where x.y = n k=1 x k y k x = n k=1 x 2 k = x.x y = y.y Correlation Corr(x,y) = s xy / s x s y s xy = (1/n-1) n k=1 (x k Mean x) (y k Mean y) s x = (1/n-1) n k=1 (x k Mean x) 2 s y = (1/n-1) n k=1 (y k Mean y) 2 Jaccard coefficient J = f 11 / f 1 + f 1 + f 11 Euclidean Distance d(x,y) = n k=1 (x k y k ) 2 (a) x = (1,1,1,1), y = (2,2,2,2) cosine, correlation, Euclidean Cosine similarity(cs) x.y = 1*2 + 1*2 + 1*2 + 1*2 = 8 x = = 2 y = = 4 CS = 8/2*4 = 1 Correlation(x,y) = s xy /s x * s y

4 Mean x = 1, Mean y = 2 s xy = 1/3[(1-1)(2-2) + (1-1)(2-2) + (1-1)(2-2) + (1-1)(2-2)] = Therefore, correlation = Euclidean distance = (1-2) 2 + (1-2) 2 + (1-2) 2 + (1-2) 2 = 2 (b) x = (,1,,1), y = (1,,1,) cosine, correlation, Euclidean, Jaccard Cosine Similarity x.y = *1 + 1* + *1 + 1* = CS = Correlation Mean x = 1/2, Mean y = 1/2 s xy = 1/3[(-1/2)(1-1/2) + (1-1/2)(-1/2) + (-1/2)(1-1/2) + (1-1/2)(-1/2)] = 1/3[-1/4-1/4-1/4 1/4] = -1/3 s x = 1/3[(-1/2) 2 + (1-1/2) 2 + (-1/2) 2 + (1-1/2) 2 ] = 1/3[1/4 + 1/4 + 1/4 + 1/4] = 1/3 s y = 1/3[(1-1/2) 2 + (-1/2) 2 + (1-1/2) 2 + (-1/2) 2 ] = 1/3[1/4 + 1/4 + 1/4 + 1/4] = 1/3 Correlation = s xy / s x * s y = (-1/3)/ 1/3 * 1/3 = (-1/3)/(1/3) = -1 Euclidean distance = (-1) 2 + (1-) 2 + (-1) 2 + (1-) 2 = 2 Jaccard measure f 11 = Therefore, Jaccard measure = (c) x = (,-1,,1), y = (1,,-1,) cosine, correlation, Euclidean Cosine similarity x.y = *1 + -1* + *-1 + 1* = CS = correlation Mean x =, Mean y = s xy = 1/3[(-)(1-) + (-1-)(-) + (-)(-1-) + (1-)(-)]

5 = 1/3[] = Therefore, correlation = Euclidean distance = (-1) 2 + (-1-) 2 + (+1) 2 + (1-) 2 = 2 (d) x = (1,1,,1,,1), y = (1,1,1,,,1) cosine, correlation, Jaccard Cosine similarity x.y = 1*1 + 1*1 + *1 + 1* + * + 1*1 = 3 x = = 2 y = = 2 = 3/2*2 = 3/4 =.75 Correlation Mean x = 2/3, Mean y = 2/3 s xy = 1/5[(1-2/3)(1-2/3) + (1-2/3)(1-2/3) + (-2/3)(1-2/3) + (1-2/3)(-2/3) + (-2/3) (-2/3) + (1-2/3)(1-2/3)] = 1/5[1/9 + 1/9 2/9 2/9 + 4/9 + 1/9] = 1/5 * 3/9 = 1/5 * 1/3 = 1/15 s x = 1/5[(1-2/3) 2 + (1-2/3) 2 + (-2/3) 2 + (1-2/3) 2 + (-2/3) 2 + (1-2/3) 2 ] = 1/5[1/9 + 1/9 + 4/9 + 1/9 + 4/9 + 1/9] = 1/5[12/9] = 1/5[4/3] = 4/15 s y = 1/5[(1-2/3) 2 + (1-2/3) 2 + (1-2/3) 2 + (-2/3) 2 + (-2/3) 2 + (1-2/3) 2 ] = 1/5[1/9 + 1/9 + 1/9 + 4/9 + 4/9 + 1/9] = 1/5[12/9] = 1/5[4/3] = 4/15 Correlation = (s xy )/ s x * s y = [(1/15)/( 4/15)*( 4/15)] = [(1/15)/(4/15)] = 1/4 =.25 Jaccard coefficient f 11 = 3 f 1 = 1 f 1 = 1

6 J = 3/5 =.6 (e) x = (2,-1,,2,,-3), y = (-1,1,-1,,,-1) cosine, correlation Cosine similarity x.y = 2 * (-1) + ( -1) * 1 + * (-1) + 2 * + * + (-3) * (-1) = (-2) + (-1) = CS = correlation Mean x =, Mean y = s xy = 1/5[(2-)(-1-) + (-1-)(1-) + (-)(-1-) + (2-)(-) + (-)(-) + (-3-) (-1-)] = 1/5[ ] = 1/5 * = Correlation = 3.8 Q1 Obtain one of the data sets available at the UCI machine learning repository and apply as many of the different visualization techniques described in the chapter as possible. Ans. I used the Haberman s survival data set. The attributes are: 1. age at the time of operation 2. year of operation(19xx) 3. number of positive auxiliary nodes detected The data set belongs to: Class 1 if the patient survived for more than 5 years Class 2 if the patient died within 5 years Histograms

7 2 19 count No. of aux. nodes count Year of operation(19xx)

8 6 5 Count Age Box plots Class 1 Box plot 1 Value 8 6 LQ min median max UQ - age yop aux nodes

9 Class 2 Box plot Value LQ min median max UQ - age yop aux nodes Scatter plots Scatter plot for age of patient at time of operation and year of operation yer of operation age Class 1 Class 2

10 Scatter plot for age at the time of operation and no. of auxillary nodes detected no. of aux. nodes age Class 1 Class 2 Scatter plot for year of operation and no. of auxillary nodes detected no. of aux. nodes year of operation Class 1 Class 2 Pie Chart

11 Distribution of both the classes 26% Class 1 Class 2 74% Percentile plot Percentile plots for age, year of operation and no. of auxillary nodes detected 1 Value 8 6 age yop aux nodes Percentile

12 Reference: Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher