Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1

Combinations of features Given a data matrix X n p with p fairly large, it can be difficult to visualize structure. Often useful to look at linear combinations of the features. Each β R p, determines a linear rule f β (x) = x T β Evaluating this on each row X i of X yields a vector (f β (X i )) 1 i n = X β. 2 / 1

Dimension reduction By choosing β appropriately, we may find interesting new features. Suppose we take k much smaller than p vectors of β which we write as a matrix B p k The new data matrix XB has fewer dimensions than X. This is dimension reduction... 3 / 1

Principal Components This method of dimension reduction seeks features that maximize sample variance... Specifically, for β R p, define the sample variance of V β = X β R n Var(V β ) = 1 n 1 n n (V β,i V β ) 2 i=1 V β = 1 n i=1 V β,i Note that Var(V Cβ ) = C 2 Var(V β ) so we usually standardize β 2 = n i=1 β2 i = 1 4 / 1

Principal Components Define the n n matrix H = I n n 1 n 11T This matrix removes means: (Hv) i = v i v. It is also a projection matrix: H T = H H 2 = H 5 / 1

Principal Components with Matrices With this matrix, Var(V β ) = 1 n 1 βt X T HX β. So, maximizing sample variance, with β 2 = 1 is ( ) maximize β T X T HX β. β 2 =1 This boils down to an eigenvalue problem... 6 / 1

Eigenanalysis The matrix X T HX is symmetric, so it can be written as X T HX = VDV T where D k k = diag(d 1,..., d k ), rank(x T HX ) = k and V p k has orthonormal columns, i.e. V T V = I k k. We always have d i 0 and we take d 1 d 2... d k. 7 / 1

Eigenanalysis & PCA Suppose now that β = k j=1 a jv j with v j the columns of V. Then, β 2 = k ( ) β T X T HX β = i=1 a 2 i k ai 2 d i i=1 Choosing a 1 = 1, a j = 0, j 2 maximizes this quantity. 8 / 1

Eigenanalysis & PCA Therefore, β 1 = v 1 the first column of V solves ( ) maximize β T X T HX β. β 2 =1 This yields scores HX β 1 R n. These are the 1st principal component scores. 9 / 1

Higher order components Having found the direction with maximal sample variance we might look for the second most variable direction by solving ( ) maximize β T X T HX β. β 2 =1,β T v 1 =0 Note we restricted our search so we would not just recover v 1 again. Not hard to see that if β = k j=1 a jv j this is solved by taking a 2 = 1, a j = 0, j 2. 10 / 1

Higher order components In matrix terms, all the principal components scores are (HX V ) n k and the loadings are the columns of V. This information can be summarized in a biplot. The loadings describe how each feature contributes to each principal component score. 11 / 1

Olympic data 12 / 1

Olympic data: screeplot 13 / 1

The importance of scale The PCA scores are not invariant to scaling of the features: for Q p p diagonal the PCA scores of X Q are not the same as X. Common to convert all variables to the same scale before applying PCA. Define the scalings to be the sample standard deviation of each feature. In matrix form, let S 2 = 1 ( ) n 1 diag X T HX Define X = HX S 1. The normalized PCA loadings are given by an eigenanalysis of X T X. 14 / 1

Olympic data 15 / 1

Olympic data: screeplot 16 / 1

PCA and the SVD The singular value decomposition of a matrix tells us we can write X = U V T with k k = diag(δ 1,..., δ k ), δ j 0, k = rank( X ), U T U = V T V = I k k. Recall that the scores were X V = (U V T )V = U Also, X T X = V 2 V T so D = 2. 17 / 1

PCA and the SVD Recall X = HX S 1/2. We saw that normalized PCA loadings are given by an eigenanalysis of X T X. It turns out that, via the SVD, the normalized PCA scores are given by an eigenanalysis of X X T because ( X V ) n k = U where U are eigenvectors of X X T = U 2 U T. 18 / 1

Another characterization of SVD Given a data matrix X (or its scaled centered version X ) we might try solving minimize Z:rank(Z)=k X Z 2 F where F stands for Frobenius norm on matrices A 2 F = n p i=1 j=1 A 2 ij 19 / 1

Another characterization of SVD It can be proven that Z = S n k (V T ) k p where S n k is the matrix of the first k PCA scores and the columns of V are the first k PCA loadings. This approximation is related to the screeplot: the height of each bar describes the additional drop in Frobenius norm as the rank of the approximation increases. 20 / 1

Olympic data: screeplot 21 / 1

Other types of dimension reduction Instead of maximizing sample variance, we might try maximizing some other quantity... Independent Component Analysis tries to maximize non-gaussianity of V β. In practice, it uses skewness, kurtosis or other moments to quantify non-gaussianity. These are both unsupervised approaches. Often, these are combined with supervised approaches into an algorithm like: Feature creation Build some set of features using PCA. Validate Use the derived features to see if they are helpful in the supervised problem. 22 / 1

Olympic data: 1st component vs. total score 23 / 1

24 / 1