Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Last time: PCA Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml PCA Find an orthogonal basis {v 1, v 2,..., v p } for the data space such that: The first principal component (PC) v 1 is the direction of greatest variance of data. The j-th PC v j is the direction orthogonal to v 1, v 2,..., v j 1 of greatest variance, for j = 2,..., p. Eigendecomposition of the sample covariance matrix S = 1 n 1 n i=1 x ix i. S = VΛV. Λ is a diagonal matrix with eigenvalues (variances along each principal component) λ 1 λ 2 λ p 0 V is a p p orthogonal matrix whose columns are the p eigenvectors of S, i.e. the principal components v 1,..., v p Dimensionality reduction by projecting x i R p onto first k principal components: z i = [ v 1 x i,..., v k x i ] R k. Eigendecomposition and PCA Singular Value Decomposition (SVD) S = 1 n 1 n i=1 x i x i = 1 n 1 X X. S is a real and symmetric matrix, so there exist p eigenvectors v 1,..., v p that are pairwise orthogonal and p associated eigenvalues λ 1,..., λ p which satisfy the eigenvalue equation Sv i = λ i v i. In particular, V is an orthogonal matrix: VV = V V = I p. S is a positive-semidefinite matrix, so the eigenvalues are non-negative: λ i 0, i. Why is S symmetric? Why is S positive-semidefinite? Reminder: A symmetric p p matrix R is said to be positive-semidefinite if a R p, a Ra 0. SVD Any real-valued n p matrix X can be written as X = UDV where U is an n n orthogonal matrix: UU = U U = I n D is a n p matrix with decreasing non-negative elements on the diagonal (the singular values) and zero off-diagonal elements. V is a p p orthogonal matrix: VV = V V = I p SVD always exists, even for non-square matrices. Fast and numerically stable algorithms for SVD are available in most packages.the relevant R command is svd.

SVD and PCA Let X = UDV be the SVD of the n p data matrix X. Note that (n 1)S = X X = (UDV ) (UDV ) = VD U UDV = VD DV, using orthogonality (U U = I n ) of U. The eigenvalues of S are thus the diagonal entries of Λ = 1 n 1 D D. We also have XX = (UDV )(UDV ) = UDV VD U = UDD U, using orthogonality (V V = I p ) of V. Gram matrix B = XX, B ij = x i x j is called the Gram matrix of dataset X. B and (n 1)S = X X have the same nonzero eigenvalues, equal to the non-zero squared singular values of X. > biplot(crabs.pca,scale=1) 50 48 4 46 35 28 41 30 4037 3 38 36 23 150 33 34 24 22 13 32 27 1 44 47 31 25 14 21 17 8 147 144 1141 146 133131 2 26 1811 138 130 20 6 4 1 134 135127 126 11215 5 3 140 128 113 112 1 14 148 132 118 114 115 13 121 122 75 10 55 760 52 5 51 137 124 10 CWCL 12 7774 16 56 53 100 0 8 120 63 66 106 BD 117 62 104 61 557102 101 6 111 11 116 FL 136 125 3 80 108 107 6768 58 87185 84 7 78 123 76 73 71 6 103 180 RW 81 64105 2 88 153 152 4 83 82 86 5 161 72 70 171 155 151 7 174 158 156 8 172 166160 157 65 167 1 18 1176 181 16215 154 17175 163 165 18 17 14 184 16 168 182 170 186 187 13 185 178 173 15 177 164 188 200 12 10 183 16 PCA plots show the data items (rows of X) in the space spanned by PCs. allow us to visualize the original variables X (1),..., X (p) (corresponding to columns of X) in the same plot. Iris Data Recall that X = [X (1),..., X (p) ] and X = UDV is the SVD of the data matrix. The full PC projection of x i is the i-th row of UD: z i = V x i = D U i, equivalently: XV = UD. The j-th unit vector e j R p points in the direction of the original variable X (j). Its PC projection η j is: η j = V e j = V j (the j-th row of V) The projection of e j indicates the weighting each PC gives to the original variable X (j). Dot products between these projections give entries of the data matrix: 50 samples from each of the 3 species of iris: setosa, versicolor, and virginica Each measuring the length and widths of both sepal and petals Collected by E. Anderson (135) and analysed by R.A. Fisher (136) x ij = min{n,p} k=1 U ik D kk V jk = D U i, V j = z i, η j. focus on the first two PCs and the quality depends on the proportion of variance explained by the first two PCs.

Iris Data Iris data biplot > data(iris) > iris[sample(150,20),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 54 5.5 2.3 4.0 1.3 versicolor 33 5.2 4.1 1.5 0.1 setosa 30 4.7 3.2 1.6 0.2 setosa 73 6.3 2.5 4. 1.5 versicolor 107 4. 2.5 4.5 1.7 virginica 4 4.6 3.1 1.5 0.2 setosa 0 5.5 2.5 4.0 1.3 versicolor 83 5.8 2.7 3. 1.2 versicolor 50 5.0 3.3 1.4 0.2 setosa 2 6.1 3.0 4.6 1.4 versicolor 128 6.1 3.0 4. 1.8 virginica 57 6.3 3.3 4.7 1.6 versicolor 4.4 2. 1.4 0.2 setosa 2 4. 3.0 1.4 0.2 setosa 86 6.0 3.4 4.5 1.6 versicolor 66 6.7 3.1 4.4 1.4 versicolor 85 5.4 3.0 4.5 1.5 versicolor 147 6.3 2.5 5.0 1. virginica 8 5.0 3.4 1.5 0.2 setosa 41 5.0 3.5 1.3 0.3 setosa > iris.pca<-princomp(iris[,-5],cor=true) > loadings(iris.pca) Comp.3 Comp.4 Sepal.Length 0.521-0.377 0.720 0.261 Sepal.Width -0.26-0.23-0.244-0.124 Petal.Length 0.580-0.1-0.801 Petal.Width 0.565-0.634 0.524 > biplot(iris.pca,scale=0) 3 2 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 1 13 4 226 10 30 31 46 48 35 36 7 50 125 24 8 40 2 27 23 41 18 21 2844 32 385 37 4 22 47 2011 176 1 33 15 34 16 Sepal.Width 61 4 58 5463 120 82 87 6 700 88 80 60 1 114 3 73 683 5 135147 100 56 84 102 1 10 65 7274 885 6 7 122 124 115 112 67 7 134 64 55 127 62 8 12 2 75 77 133 13 150 5 104 128Petal.Length 11 76 Petal.Width 71 78 117 148105 131 52 6687 138 146 113 108 123 53 130103 86 57 11116140 1 141 51 101 106 125126 121 144 136 Sepal.Length 14 137 1 3 2 1 0 1 2 3 Iris Data biplot - scaled 132 118 1.0 0.5 0.0 0.5 1.0 There are other projections we can consider for biplots (assuming p < n to simplify notation): p x ij = U ik D kk V jk = D 1:p,1:pUi,1:p, Vj = D 1 α 1:p,1:p U i,1:p, D α 1:p,1:pVj. k=1 where 0 α 1, i.e., we change representation to z i = D 1 α 1:p,1:p U i,1:p, η j = D α 1:p,1:pV j case α = 1: Sample covariance of the projected points is: Ĉov ( Z ) = 1 n 1 U 1:n,1:pU 1:n,1:p = 1 n 1 I p. Projected points are uncorrelated and dimensions are equi-variance. Sample covariance between X (i) and X (j) is: Ê(X (i) X (j) ) = 1 ( VD DV ) = 1 n 1 i,j n 1 D 1:p,1:pVi, D 1:p,1:p Vj The angle between the projected variables corresponds to their correlation. >?biplot... scale: The variables are scaled by lambda ^ scale and the observations are scaled by lambda ^ (1-scale) where lambda are the singular values as computed by princomp. (default=1)... > biplot(iris.pca,scale=1) 0.2 0.1 0.0 0.1 0.2 10 5 0 5 10 14 3 13 46 4 26 48 10 30 31 35 36 750 12 25 40 24 8 2 27 23 41 181 21 32 2844 385 37 4 22 47 20 11 176 1 33 15 34 Sepal.Width 16 61 4 58 54 63 120 82 8107 6 70 0 88 80 60 1 114 3 73 68 83 5 135 147 56 84 102 1 10 100 122 65 72 74 124 7 115 112 8 85 677 134 127 6 65 62 8 12 133 75 77 213 150 5 104 128 11 Petal.Len 76 71 78 117105131 Petal.Wid 148 52 66 138146 113108123 87 53 130 103 8657 1116 140 1 141 51 101 106 121 144 136 14 137 125 126 Sepal.Length 1 0.2 0.1 0.0 0.1 0.2 132 118 10 5 0 5 10

Crabs Data biplots US Arrests Data > biplot(crabs.pca,scale=0) > biplot(crabs.pca,scale=1) 0.5 0.0 0.5 1.0 20 10 0 10 20 30 CW CL 50 150 484 44 46 41403 3536 3738 3233 31 28 227 30 25 24 26 22 23 21 201 18 17 13 14 147 1146 144 148 1 1141 47 15 14 11 12 8 100 140 13 138 133 134 135 137 132 131 16 63 66 7 23 77 75 62 6 6768 6061 5 58 57 5655 5253 54 3 73 74 71 2 1 8788 01 778 80 76 64 51 136 130 127128 126 8 7 4 5 12 6 8384 8 118 8586 121 82 124 125 123 81 122 120 117 111 11 112 114 113 115 116 108 10 72 70107106 6 102 101 65 105 104103 176 180 162160 161 15 157 158 156155 154 153 152 175 171 172 174 170 16 166 167 168 163 165 151 15 1 18 17188 18 11 186 187185 184 17 182 177 178 181 200 16 1314 12 10 183 173 164 BD FL RW 0.5 0.0 0.5 1.0 50 48 4 46 35 28 41 30 4037 3 38 36 23 150 33 34 24 22 13 32 27 1 44 47 31 25 14 21 17 8 147 144 1141 146 133131 2 26 1811 138 130 20 6 4 1 134 135127 126 11215 5 3 140 128 113 112 1 14 148 132 118 114 115 13 121 122 75 10 55 760 52 5 51 137 124 10 CWCL 12 7774 16 56 53 100 0 8 120 63 66 106 BD 117 62 104 61 557102 101 6 111 11 116 FL 136 125 3 80 108 107 6768 58 87185 84 7 78 123 76 73 71 6 103 180 RW 81 64105 2 88 153 152 4 83 82 86 5 161 72 70 171 155 151 7 174 158 156 8 172 166160 157 65 167 1 18 1176 181 16215 154 17175 163 165 18 17 14 184 16 168 182 170 186 187 13 185 178 173 15 177 164 188 200 12 10 183 16 This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 173. Also given is the percent of the population living in urban areas. pairs(usarrests) usarrests.pca <- princomp(usarrests,cor=t) plot(usarrests.pca) pairs(predict(usarrests.pca)) biplot(usarrests.pca) 20 10 0 10 20 30 US Arrests Data Pairs Plot US Arrests Data Biplot > pairs(usarrests) > biplot(usarrests.pca) 50 150 250 10 20 30 40 5 0 5 50 150 250 10 20 30 40 Murder Assault UrbanPop Rape 5 10 15 30 50 70 0 0.2 0.1 0.0 0.1 0.2 0.3 Mississippi North Carolina South Carolina West VirginiaVermont Georgia Alabama Alaska Arkansas Kentucky Murder Louisiana Tennessee South Dakota North Dakota Montana Maryland Assault Wyoming Maine Virginia Idaho New Mexico Florida New Hampshire Michigan Iowa Indiana Nebraska Missouri Kansas Rape Delaware Oklahoma Texas Oregon Pennsylvania Minnesota Wisconsin Illinois Nevada Arizona Ohio New York Colorado Washington Connecticut New Jersey Massachusetts Utah Rhode Island California Hawaii UrbanPop 5 0 5 5 10 15 30 50 70 0 0.2 0.1 0.0 0.1 0.2 0.3

Suppose there are n points X in R p, but we are only given the n n matrix D of inter-point distances. Can we reconstruct X? Rigid transformations (translations, rotations and reflections) do not change inter-point distances so cannot recover X exactly. However X can be recovered up to these transformations! Let d ij = x i x j 2 be the distance between points x i and x j. d 2 ij = x i x j 2 2 = (x i x j ) (x i x j ) = x i x i + x j x j 2x i x j Let B = XX be the n n matrix of dot-products, b ij = x i x j. The above shows that D can be computed from B. Some algebraic exercise shows that B can be recovered from D if we assume n i=1 x i = 0. US City Flight Distances If we knew X, then SVD gives X = UDV. As X has rank at most r = min(n, p), we have at most r non-zero singular values in D and we can assume U R n r, D R r r and V R r p. The eigendecomposition of B is then: B = XX = UD 2 U = UΛU. This eigendecomposition can be obtained from B without knowledge of X! Let x i = U i Λ 1 2 R r. If r < p, pad x i with 0s so that it has length p. Then, x i x j = U i ΛU j = b ij = x i x j and we have found a set of vectors with dot-products given by B, as desired. The vectors x i differs from x i only via the orthogonal matrix V (recall that xi = U i DV = x i V ) so are equivalent up to rotation and reflections. We present a table of flying mileages between 10 American cities, distances calculated from our 2-dimensional world. Using D as the starting point, metric MDS finds a configuration with the same distance matrix. ATLA CHIG DENV HOUS LA MIAM NY SF SEAT DC 0 587 1212 701 136 604 748 213 2182 5 587 0 20 40 17 1188 713 1858 1737 57 1212 20 0 87 831 1726 1631 4 1021 144 701 40 87 0 1374 68 10 16 181 1220 136 17 831 1374 0 233 21 347 5 2300 604 1188 1726 68 233 0 102 254 2734 23 748 713 1631 10 21 102 0 2571 2408 205 213 1858 4 16 347 254 2571 0 678 24 2182 1737 1021 181 5 2734 2408 678 0 232 5 57 144 1220 2300 23 205 24 232 0

US City Flight Distances US City Flight Distances library(mass) us <- read.csv("http://www.stats.ox.ac.uk/~sejdinov/sdmml/data/uscities.csv") ## use classical MDS to find lower dimensional views of the data ## recover X in 2 dimensions us.classical <- cmdscale(d=us,k=2) plot(us.classical) text(us.classical,labels=names(us)) 1000 500 0 500 1000 MIAMI NY DC ATLANTA CHICAGO HOUSTON DENVER LA SF SEATTLE 1000 500 0 500 1000 Lower-dimensional Reconstructions Crabs Data In classical MDS derivation, we used all eigenvalues in the eigendecomposition of B to reconstruct x i = U i Λ 1 2. library(mass) crabs$spsex=paste(crabs$sp,crabs$sex,sep="") varnames<-c( FL, RW, CL, CW, BD ) Crabs <- crabs[,varnames] Crabs.class <- factor(crabs$spsex) crabsmds <- cmdscale(d= dist(crabs),k=2) plot(crabsmds, pch=20, cex=2,col=unclass(crabs.class)) We can use only the largest k < min(n, p) eigenvalues and eigenvectors in the reconstruction, giving the best k-dimensional view of the data. This is analogous to PCA, where only the largest eigenvalues of X X are used, and the smallest ones effectively suppressed. Indeed, PCA and classical MDS are duals and yield effectively the same result. MDS 2 3 2 1 0 1 2 30 20 10 0 10 20 MDS 1

Crabs Data Varieties of MDS Compare with previous PCA analysis. Classical MDS solution corresponds to the first 2 PCs. 2 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 2 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 Comp.3 Comp.4 Comp.5 20 10 0 10 20 30 2 1 0 1 2 0.5 0.0 0.5 Generally, MDS is a class of dimensionality reduction techniques which represents data points x 1,..., x n R p in a lower-dimensional space z 1,..., z n R k which tries to preserve inter-point (dis)similarities. It requires only the matrix D of pairwise dissimilarities d ij = d(x i, x j ). For example, we can use Euclidean distance d ij = x i x j 2, but other dissimilarities are possible. MDS finds representations z 1,..., z n R k such that z i z j 2 d(x i, x j ) = d ij, and differences in dissimilarities are measured by the appropriate loss (d ij, z i z j 2 ). Goal: Find Z which minimizes the stress function S(Z) = i j (d ij, z i z j 2 ). 20 10 0 10 20 30 2 1 0 1 2 0.5 0.0 0.5 Varieties of MDS Choices of (dis)similarities and (stress) functions lead to different algorithms. Classical/Torgerson: preserves inner products instead - strain function (cmdscale) S(Z) = (b ij z i z, z j z ) 2 i j Metric Shephard-Kruskal: preserves distances w.r.t. squared stress S(Z) = i j (d ij z i z j 2 ) 2 Sammon: preserves shorter distances more (sammon) S(Z) = i j (d ij z i z j 2 ) 2 Non-Metric Shephard-Kruskal: ignores actual distance values, only preserves ranks (isomds) i j S(Z) = min (g(d ij) z i z j 2 ) 2 g increasing i j z i z j 2 2 d ij