1 Basic Concept and Similarity Measures

Size: px

Start display at page:

Download "1 Basic Concept and Similarity Measures"

Silvia Brown
6 years ago
Views:

1 THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay Lecture 10: Cluster Analysis and Multidimensional Scaling 1 Basic Concept and Similarity Measures Cluster analysis attempts to group similar observations together based on some distance measure. It differs from the classification discussed before, because the number of groups is assumed to be known in classification. The inputs of cluster analysis are similarity measures or data from which similarities can be computed. Similarity measures: There are many measures available; they all involve certain degrees of subjectivity. Important considerations include the nature of the variables, scales of measurement, and subject matter knowledge. Below are some distances and similarity coefficients for pairs of items. 1. Euclidean distance between two p-dimensional observations x and y d(x, y) = 2. Statistical distance: (x 1 y 1 ) 2 + (x 2 y 2 ) (x p y p ) 2 = (x y) (x y). d(x, y) = where S is the sample covariance matrix. (x y) S 1 (x y), 3. Minkowski metric: 4. Canberra metric: 5. Czekanowski coefficient: [ p ] 1/m d(x, y) = x i y i m. i=1 d(x, y) = p i=1 x i y i (x i + y i ) d(x, y) = 1 2 p i=1 min(x i, y i ) p. i=1(x i + y i ) Remark. Distance measure needs to satisfy the following basic properties: d(p, Q) = d(q, P )

2 d(p, Q) 0 if P Q d(p, Q) = 0 if P = Q d(p, Q) d(p, R) + d(r, Q), the triangle equality. When items cannot be represented by meaningful distance measurments, pairs of items are often compared on the basis of the presence or absence of certain characteristics, resulting in using the binary variable. Similar items have more characteristics in common than do dissimilar items. After tabulating the binary variables, one can define similarity coefficients between items as a distance measure. See Table 12.1 of the textbook for a list of similarity coefficients. 2 Hierarchical Clustering Methods Hierarchical clustering techniques proceed by either a series of successive mergers or a series of successive divisions. Agglomerative hierarchical methods start with the individual objects and merge similar objects into groups iteratively. Divisive hierarchical methods work in the opposite direction. It starts with a single group and divides iteratively objects into subgroups so that objects in one subgroup are far from those of the other subgroup. Following the textbook, we discuss agglomerative hierarchical procedures. In particular, we focus on three linkage methods. The results of clustering analysis can be shown by a two-dimensional diagram known as a dendrogram. Suppose that there are N objects associated with them is an N N symmetric matrix of distance (or similarity) D = [d ij ]. For any two groups U and V, denote the distance between them as d(u, V ) = d UV. Single linkage: (minimum distance or nearest neighbor). 1. Find the minimum distance between the objects to form a cluster of two objects, say (UV). 2. The distance between (UV) and any other object W is defined as d (UV )W = min{d UW, D V W }. 3. Find a new group from the resulting new distance matrix 4. Iterative the process. See the demonstration. Complete linkage: (maximum distance) 2

3 Based on similar procedure as the single linkage, group objects together by the minimum distance, but updates the distance using maximum, d (UV )W = max{d UW, D V W }. Average linkage: (average distance) Updates the distance using average. Consider two subgroups (UV ) and W. Then, i k dik d (UV )W =, N (UV ) N W where i and k denote member in (UV ) and W, respectively, and N (UV ) and N W number of objects in the subgroups (UV ) and W. are the Demonstration: In R, the commands for hierarachical clustering is hclust. Demonstration of hierarchical clustering analysis *** Data: Table 12-4, 11 languages *** Commands used "hclust", "dist", and "plot". *** Created a data matrix based on Example 12.2 on page 680. da=read.table("t12-3a.dat") da V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V colnames(da) <- c("e","n","da","du","g","fr","sp","i","p","h","fi") dd=10-da % compute a distance measure d1=as.dist(dd) m1=hclust(d1,method="single") 3

4 plot(m1) % The program groups "Fr", "Sp" and "I" in one-step. % See Figure 12.4 on page 685. m2=hclust(d1) % default is "complete" linkage. plot(m2) % See Figure 12.7 on page 687. m3=hclust(d1,method="average") plot(m3) % See Figure 12.9 on page 690. *** Public Utility Data on Table 12.4 da=read.table("t12-4.dat") dim(da) [1] 22 9 da[1,] V1 V2 V3 V4 V5 V6 V7 V8 V Arizona x=da[,1:8] aa =cor(x) print(aa,digits=3) V1 V2 V3 V4 V5 V6 V7 V8 V V V V V V V V d1=as.dist(aa) print(d1,digits=3) V1 V2 V3 V4 V5 V6 V7 V V V V V V V

5 d2=-d1 % High correlations mean more similar. Distance should be shorter. % Thus, the sign is changed. m4=hclust(d2) plot(m4) % See Figure 12.8 on page 689. x1=var(x) print(x1,digits=3) V1 V2 V3 V4 V5 V6 V7 V8 V e e e e-03 V e e e e-01 V e e e e-01 V e e e e+00 V e e e e-02 V e e e e+03 V e e e e+00 V e e e e-01 se=sqrt(diag(x1)) print(se,digits=3) V1 V2 V3 V4 V5 V6 V7 V a1=diag(se) a1inv=solve(a1) y=as.matrix(x)%*%a1inv print(var(y),digits=3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] dd=dist(y,method="euclidean") print(dd,digits=2)

6 (Output edited. This is Table 12.6 of the textbook, p.691. m5=hclust(dd) plot(m5) % See Figure on page Nonhierarchical Clustering Methods K-means method: MacQueen (1967) considers k-means for describing an algorithm to assign each item to the cluster having the nearest centroid. In its simplest version, the process consists of three steps. 1. Partition the items into K initial clusters. 2. Proceed through the list of items, assigning an item to the cluster whose centroid is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3. Repeat Step 2 until no more reassignments take place. Demonstration: In R, the command is kmeans. *** Non-hierarchical clustering method *** *** Data: The standardized public utility companies. help(kmeans) m6=kmeans(y,4) names(m6) [1] "cluster" "centers" "withinss" "size" m6$cluster [1] print(m6$centers,digits=3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] m6$size 6

7 [1] m6$withinss [1] d1=dist(m6$centers,method="euclidean") print(d1,digits=3) *** 5 groups *** m7=kmeans(y,5) m7$cluster [1] m7$size [1] m7$withinss [1] print(m7$centers,digits=3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] d1=dist(m7$centers,method="euclidean") print(d1,digits=3) Model-based approach to clustering with applications We shall discuss the following two papers: You may download the two papers from UC library. 1. Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis 7

8 and density estimation. Journal of the American Statistical Association, 97, Frühwirth-Schnatter, S. and Kaufmann, S. (2008). Model-based clustering of multiple time series. Journal of Business & Economic Statistics, 26, Multidimensional Scaling Problem: Suppose that there are N items with N(N 1)/2 pairs of similarities (or distances). Multidimensional scaling is to find a representation of the items in few dimensions such that the inter-item proximities nearly match the original similarities (or distances). Stress: The numerical measure of closeness between the original similarities and the fitted values of similarities. In practice, if only the rank orders of the N(N 1)/2 original similarities are used, the process is called non-numetric multidimensional scaling. If the actual magnitudes of the original similarities are used, the process is called metric multidimensional scaling. The latter is also known as principal coordinate analysis. Denote the similarity between items i and j by s ij. Assume there are no ties in similarities. Arrange the similarities in a strictly ascending order as s i1 k 1 < s s2 k 2 < < s im k M (1) where M = N(N 1)/2. (If distances are used, arrange in a strictly decreasing order.) Multidimensional scaling attempts to find a q-dimensional configuration of the N items such that the distance d (q) ik, between the pairs of items, match the ordering in (1). A perfect match occurs if d (q) i 1 k 1 d (q) i 2 k 2 d (q) i M k M. (2) Kruskal (1964) proposed stress as a measure of closeness. It is defined as ˆd (q) ik i<k(d (q) (q) 1/2 Stress(q) = ik ˆd ik )2, i<k[d (q) ik ]2 where is the fitted distance and d iks are the distances that correspond to the perfect match with similarities s ik. General guidelines Stress Goodness of Fit 20% Poor 10% Fair 5% Good 2.5% Excellent 0% Perfect 8

9 Takane, et al. (1977) proposed an alternative measure of closeness, [ SStress = i<k(d 2 ik ˆd 2 ik) 2 ]. i<k d 4 ik The value of SStress is always between 0 and 1. Any value less than 0.1 is typically taken to mean that there is a good representation of the objects by the points in the given configuration. Algorithm: See page 709 of the textbook. 1. Obtain the N(N 1)/2 similarities between distinct pairs of items. Order the similarities as in (1). Typically, distances are used. 2. Using a trial configuration in q dimensions, determine the inter-item distances d (q) ik and estimates so as to minimize the Stress or SStress measure. 3. Using ˆd (q) ik (q) ˆd ik s, move the points around to obtain an improved configuration. 4. Choose the dimension q via checking the Stress or SStress measure. Demonstration: In R, the multidimensional scaling command is cmdscale. It stands for classical multidimensional scaling. setwd("c:/teaching/ama") da=read.table("t12-4.dat") dim(da) [1] 22 9 x=da[,1:8] s1=sqrt(diag(var(x))) print(s1,digits=3) V1 V2 V3 V4 V5 V6 V7 V sinv=diag(1/s1) y=as.matrix(x)%*%sinv dd=dist(y,method="euclidean") help(cmdscale) mm=cmdscale(dd,4) % use 4-dimension to fit the data (q=4) print(mm,digits=3) [,1] [,2] [,3] [,4] [1,] [2,]

10 [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] m2=cmdscale(dd,2) % Use 2-dimensional fit. plot(m2[,1],m2[,2]) text(m2[,1],m2[,2],c(1:22)) ** Another example: Airline distance da=read.table("t12-7m.dat",header=t) da Ata Bos Cin Col Dal Ind LRk LAs Mem StL Spo Tpa

11 d1=as.dist(da) d1 Ata Bos Cin Col Dal Ind LRk LAs Mem StL Spo Bos 1068 Cin Col Dal Ind LRk LAs Mem StL Spo Tpa mc=cmdscale(d1,2) plot(mc[,1],mc[,2]) % see Figure on page 711. text(mc[,1],mc[,2],colnames(da)) plot(-mc[,1],-mc[,2]) % To match precisely with Figure text(-mc[,1],-mc[,2],colnames(da)) *** A third example **** da=read.table("t12-9.dat") dim(da) [1] 25 7 x=da[,2:7] x V2 V3 V4 V5 V6 V

12 s1=sqrt(diag(var(x))) ainv=diag(1/s1) y=as.matrix(x)%*%ainv d1=dist(y,method="euclidean") md=cmdscale(d1,2,eig=t) names(md) [1] "points" "eig" "x" "ac" "GOF" plot(-md$points[,1],-md$points[,2]) % See Figure on page 715. text(-md$points[,1],-md$points[,2],da[,1]) 12

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher