Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1

Similarities Start with X which we assume is centered and standardized. The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The first 2 (or any 2) PCA scores yield an n 2 matrix that can be visualized as a scatter plot. Similarly, the first 2 (or any 2) PCA loadings yield an p 2 matrix that can be visualized as a scatter plot. 2 / 1

Olympic data 3 / 1

Similarities The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The visualization of the cases based on PCA scores were determined by the eigenvectors of X X T, an n n matrix. To see this, remember that X = U V T so X X T = U 2 U T X T X = V 2 V T 4 / 1

Similarities The matrix X X T is a measure of similarity between cases. The matrix X T X is a measure of similarity between features. Structure in the two similarity matrices yield insight into the set of cases, or the set of features... 5 / 1

Distances Distances are inversely related to similarities. If A and B are similar, then d(a, B) should be small, i.e. they should be near. If A and B are distant, then they should not be similar. For a data matrix, there is a natural distance between cases: d(x i, X k ) 2 = X i X k 2 2 p = X ij X kj 2 2 j=1 = (X X T ) ii 2(X X T ) ik + (X X T ) kk 6 / 1

Distances Suggests a natural transformation between a similarity matrix S and a distance matrix D D ik = (S ii 2 S ik + S kk ) 1/2 The reverse transformation is not so obvious. Some suggestions from your book: S ik = D ik = e D ik 1 = 1 + D ik 7 / 1

Distances A distance (or a metric) on a set S is a function d : S S [0, + ) that satisfies d(x, x) = 0; d(x, y) = 0 x = y d(x, y) = d(y, x) d(x, y) d(x, z) + d(z, y) If d(x, y) = 0 for some x y then d is a pseudo-metric. 8 / 1

Similarities A similarity on a set S is a function s : S S R and should satisfy s(x, x) s(x, y) for all x y s(x, y) = s(y, x) By adding a constant, we can often assume that s(x, y) 0. 9 / 1

Examples: nominal data The simplest example for nominal data is just the discrete metric { 0 x = y d(x, y) = 1 otherwise. The corresponding similarity would be { 1 x = y s(x, y) = 0 otherwise. = 1 d(x, y) 10 / 1

Examples: ordinal data If S is ordered, we can think of S as (or identify S with) a subset of the non-negative integers. If S = m then a natural distance is d(x, y) = x y m 1 1 The corresponding similarity would be s(x, y) = 1 d(x, y) 11 / 1

Examples: vectors of continuous data If S = R k there are lots of distances determined by norms. The Minkowski p or l p norm, for p 1: ( k ) 1/p d(x, y) = x y p = x i y i p i=1 Examples: p = 2 the usual Euclidean distance, l 2 p = 1 d(x, y) = k i=1 x i y i, the taxicab distance, l 1 p = the d(x, y) = max 1 i k x i y i, the sup norm, l 12 / 1

Examples: vectors of continuous data If 0 < p < 1 this is not a norm, but it still defines a metric. The preceding transformations can be used to construct similarities. Examples: binary vectors If S = {0, 1} k R k the vectors can be thought of as vectors of bits. The l 1 norm counts the number of mismatched bits. This is known as Hamming distance. 13 / 1

Example: Mahalanobis distance Given Σ k k that is positive definite, we define Mahalanobis distance on R k by d Σ (x, y) = ( ) 1/2 (x y) T Σ 1 (x y) This is the usual Euclidean distance, with a change of basis given by a rotation and stretching of the axes. If Σ is only non-negative definite, then we can replace Σ 1 with Σ, its pseudo-inverse. This yields a pseudo-metric because it fails the test d(x, y) = 0 x = y. 14 / 1

Example: similarities for binary vectors Given binary vectors x, y {0, 1} k we can summarize their agreement in a 2 2 table x = 0 x = 1 total y y = 0 f 00 f 01 f 0 y = 1 f 10 f 11 f 1 total x f 0 f 1 k 15 / 1

Example: similarities for binary vectors We define the simple matching coefficient (SMC) similarity by SMC(x, y) = f 00 + f 11 = f 00 + f 11 f 00 + f 01 + f 10 + f 11 k = 1 x y 1 k The Jaccard coefficient ignores entries where x i = y i = 0 J(x, y) = f 11 f 01 + f 10 + f 11. 16 / 1

Example: cosine similarity & correlation Given vectors x, y R k the cosine similarity is defined as 1 cos(x, y) = x, y x 2 y 2 1. The correlation between two vectors is defined as 1 cor(x, y) = x x 1, y ȳ 1 x x 1 2 y ȳ 1 2 1. 17 / 1

Example: correlation An alternative, perhaps more familiar definition: S xy cor(x, y) = S x S y S xy = 1 k 1 S 2 x = 1 k 1 S 2 y = 1 k 1 k (x i x)(y i ȳ) i=1 k (x i x) 2 i=1 k (y i ȳ) 2 i=1 18 / 1

Correlation & PCA The matrix 1 n 1 X T X was actually the matrix of pair-wise correlations of the features. Why? How? 1 ( X T ) X n 1 = 1 (D 1/2 X T HX D 1/2) ij n 1 The diagonal entries of D are the sample variances of each feature. The inner matrix multiplication computes the pair-wise dot-products of the columns of HX ( n X T HX )ij = (X ki X i )(X kj X j ) k=1 19 / 1

High positive correlation 20 / 1

High negative correlation 21 / 1

No correlation 22 / 1

Small positive 23 / 1

Small negative 24 / 1

Combining similarities In a given data set, each case may have many attributes or features. Example: see the health data set for HW 1. To compute similarities of cases, we must pool similarities across features. In a data set with M different features, we write x i = (x i1,..., x im ), with each x im S m. 25 / 1

Combining similarities Given similarities s m on each S m we can define an overall similarity between case x i and x j by s(x i, x j ) = M w m s m (x im, x jm ) m=1 with optional weights w m for each feature. Your book modifies this to deal with asymmetric attributes, i.e. attributes for which Jaccard similarity might be used. 26 / 1

27 / 1