Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Size: px

Start display at page:

Download "Lecture 4: Principal Component Analysis and Linear Dimension Reduction"

Ethel Wilkerson
6 years ago
Views:

1 Lecture 4: Principal Component Analysis and Linear Dimension Reduction Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics University of Pittsburgh / 56

2 Linear dimension reduction For a random vector X R p, consider reducing the dimension from p to d, i.e., p variables (X 1,..., X p ) to a set of most interesting d variables. Here, 1 d p. Best Subset? Linear dimension reduction: Construct d variables Z 1,..., Z d as linear combinations of X 1,..., X p, i.e. Z i = a i1 X a ip X p = a ix (i = 1,..., d). Linear dimension reduction seeks a sequence of such Z i, or equivalently a sequence of a i, where the random variables (Z i s) are most important among all other choices. 2 / 56

3 Principal Component Analysis In linear dimension reduction, we require a 1 = 1 and a i, a j = 0. Thus the problem is to find an interesting set of direction vectors {a i : i = 1,..., p}, where the projection scores onto a i are useful. Principal Component Analysis (PCA) is a linear dimension reduction technique that gives a set of direction vectors of maximal (projected) variances. Take m = 1. PCA for the distribution of X finds a 1 such that a 1 = argmax Var(Z 1 (a)), a R p, a =1 where Z 1 (a) = a 1 X a p X p = a X. 3 / 56

4 Geometric understanding of PCA for point cloud n (= 200) data points in 2D v_1 v_2 PCA is best understood with a point cloud. Take a look at 2D example. 4 / 56

5 Geometric understanding of PCA for point cloud n (= 200) data points in 2D Projection Scores Take a = (1, 0). v_ Density v_ variance 0.79 N = 200 Bandwidth = / 56

6 Geometric understanding of PCA for point n (= 200) data points in 2D Projection cloud Scores v_2 Density Take a = (1, 0), (0, 1), (1/ 2, 1/ 2). Which one is better? v_1 Projection Scores Density Density variance 0.79 N = 200 Bandwidth = Projection Scores variance 1.32 N = 200 Bandwidth = variance 1.04 N = 200 Bandwidth = / 56

7 Geometric understanding of PCA for 3D point cloud A 3D point cloud. Mean. 1st PC direction (maximizing variance of projections) explains the cloud best. 1st and 2nd directions form a plane. 7 / 56

8 Formulation of population PCA 1 Suppose a random vector X with mean µ, covariance Σ (not necessarily normal). The first principal component (PC) direction vector is the unit vector u 1 R p that maximizes the variance of u 1 X when compared to other unit vectors, i.e., u 1 = argmax Var(u X). u R p, u =1 u 1 = (u 11,..., u 1p ) is the first PC direction vector, sometime called loading vector. (u 11,..., u 1p ) are loadings of the 1st PC. Z 1 = u 11 X u 1p X p = u 1X is the first PC score or the first principal component (random variable). λ 1 = Var(Z 1 ) is the variance explained by the first PC score. 8 / 56

9 Formulation of population PCA 2 The second PC direction is the unit vector u 2 R p that maximizes the variance of u 2 X; is orthogonal to the first PC direction u 1. That is, u 2 = argmax Var(u X). u R p, u =1,u u 1 =0 u 2 = (u 21,..., u 2p ) is the second PC direction vector, and is the vector of the 2nd loadings. Z 2 = u 2X is the second principal component. λ 2 = Var(Z 2 ) is the variance explained by the second PC score, and λ 1 λ 2. Corr(Z 1, Z 2 ) = 0. 9 / 56

10 Formulation of population PCA (3,4,...p) Given the first k 1 PC directions u 1,..., u k, the kth PC direction is the unit vector u k R p that maximizes the variance of u 2 X; is orthogonal to the first to (k 1)th PC directions u j. That is, u k = argmax Var(u X). u R p, u =1 u u j =0,j=1,...,k 1 u k = (u k1,..., u kp ) is the kth PC direction vector, and is the vector of the kth loadings. Z k = u kx is the kth principal component. λ k = Var(Z k ) is the variance explained by the k PC score, and λ 1 λ k 1 λ k. Corr(Z i, Z j ) = 0 for all i j k. 10 / 56

11 Relation to eigen-decomposition of Σ Recall the eigen-decomposition of the symmetric positive definite Σ = UΛU with U = [u 1,..., u p ] orthogonal, Λ = diag(λ 1,..., λ p ) with λ 1... λ p, Σu i = λ i u i. In next two slides we show that: 1 The kth eigenvector u k is the kth PC direction vector. 2 The kth eigenvalue λ k is the variance explained by the kth principal component. 3 PC directions are orthogonal u i u j = 0 (i j) and also Σ-orthogonal u iσu j = 0 Cov(Z i, Z j ) = 0 (i j). 11 / 56

12 Relation to eigen-decomposition of Σ The first PC direction problem is to maximize Var(u X) with the constraint u u = 1. Using Lagrange multiplier λ, the problem of maximization is the same as finding a stationary point of Φ(u, λ) = Var(u X) λ(u u 1) = u Σu λ(u u 1). The stationary point solves the following: which leads 1 Φ(u, λ) = Σu λu = 0, 2 u λ = u Σu = Var(u X), Σu = λu. (1) The eigenvector-eigenvalue pairs (u i, λ i ), (i = 1,..., p) satisfy (1). It is clear that the first PC direction is the first eigenvector u 1, as it gives the largest variance λ 1 = u 1 Σu 1 = Var(u 1 X) λ j (j > 1). 12 / 56

13 Relation to eigen-decomposition of Σ For the kth PC direction, we form a Lagrangian function Φ(u, λ, γ k 1 ) = u Σu λ(u u 1) k 1 j=1 2γ ju j u, given the first k 1 PC directions. The derivative of Φ, equated to zero, is then 1 2 u Φ(u, λ, γk 1 ) = Σu λu k 1 j=1 γ ju j = 0, Φ(u, λ, γ1 k ) = u γ ju = 0. (2) j We have γ j = u j Σu = 0 (since Σu j = λ j u j ), thus λ = u Σu = Var(u X), Σu = λu. (3) The kth to last eigen-pairs (u i, λ i ), (i = k,..., p) satisfy both (3) and (2). Thus, the kth PC direction is u k, as it gives the largest variance λ k = u k Σu k. 13 / 56

14 Example: Principal components of bivariate normal distribution Consider N 2 (0, Σ) with [ ] [ ] [ ] [ ] 2 1 cos(π/4) sin(π/4) 3 0 cos(π/4) sin(π/4) Σ = =, 1 2 sin(π/4) cos(π/4) 0 1 sin(π/4) cos(π/4) The PC directions u i are the principal axes of ellipsoids, representing the density of MVN, with lengths given by λ i. 14 / 56

15 Sample PCA Given multivariate data X = [x 1,..., x n ] p n, the sample PCA sequentially finds orthogonal directions of maximal (projected) sample variance. 1 For x i = x i x, the centered data matrix is X = [ x 1,..., x n ] = X(I n 1 n 11 ). 2 The sample variance matrix is then S = 1 n 1 X X. 3 Eigen-decomposition of S = Û ΛÛ leads to the kth sample PC direction (û k ) and the variance of the kth sample score (ˆλ k ). 15 / 56

16 Sample PCA It can be checked that the eigenvector û k of S satisfies û k = z (k)i (u) = u x i, argmax Var({z (k)i (u) : i = 1,..., n}), u R p, u =1 u û j =0,j=1,...,k 1 (i = 1,..., n). û k = (û k1,..., û kp ) is the kth sample PC direction vector, and is the vector of the kth loadings. z (k) = (z (k)i,..., z (k)n ) 1 n = û k X is the of kth score vector, where Z (k)i = û k x i. λ k the sample variance of {Z (k)i : i = 1,..., n}. The information of the first two principal components is all contained in the score vectors (z (1), z (2) ), the 2D scatters of which is thus most informative with the largest total variance. 16 / 56

17 Example: Swiss Bank Note Data Swiss Bank Note data from HS. p = 6, n = 200. Genuine (blue) and fake (red) samples in original measurements V1 V2 V3 V4 V5 V / 56

18 Example: Swiss Bank Note Data Scatters of principal components Comp.1 Comp.2 Comp.3 Comp.4 Comp Comp / 56

19 How many components to keep? (1) The criterion for PCA is a high variance in the principal components. The question involves how much the PCs explain the the whole data in terms of variance. 1 Total variance in X is the sum of all marginal variances p Var({x ki : i = 1,..., n}) = diag(s) k=1 = p Var({z (k)i : i = 1,..., n}) = k=1 p λ k. k=1 2 Variance of the kth scores: Var({z (k)i : i = 1,..., n}) = λ k. 3 Total variance in the 1 kth scores: λ λ k. For heuristics made from these quantities,see next slide: 19 / 56

20 Example: Swiss Bank Note Data scree plot In scree plot (k, λ k ), we look for an elbow. Im cumulative scree plot, (proportion of variance explained, (k, k j=1 λ j p j=1 λ j )), use 90% as a cutoff. Scree plot Cum. Scree plot lambda cumulative propotion (%) component component 20 / 56

21 How many components to keep? (2) 1 Kaiser s rule (of thumb): Retain PCs 1 k satisfying λ k > λ = 1 p p λ j. j=1 Tends to choose fewer components. 2 Likelihood ratio testing on null hypothesis H 0 (k) : λ k+1 = = λ p, where k components are used if H 0 (k) is not rejected at a specified level. 21 / 56

22 Which variables are most responsible for the principal components? Loadings of principal component direction. Biplot - scatterplot of PC1 and PC2 scores, overlaid with p vectors each representing the loadings of the first two PC directions. In the Swiss Bank Note Data, the loadings are Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 V V V V V V / 56

23 Example: Swiss Bank Note Data biplot Which measurements are most responsible for the first two principal components? PC1 loadings V1 V PC2 loadings V1 V4 Comp V V1V2 V V V Comp / 56

24 Computation of PCA PCA is either computed using eigenvalue decomposition of S = 1 n 1 X X or using the singular value decomposition of X. Eigen-decomposition of S For S = UΛU, 1 PC directions u k (eigenvectors) 2 Variance of PC (scores) λ k (eigenvalues) 3 Matrix of principal component scores V = U X = z (1). z (p). 24 / 56

25 Computation of PCA SVD of X The singular value decomposition (SVD) of p n matrix X has the form X = UDV. The left singular vectors U = [u 1,..., u p ] p p and the right singular vectors V = [v 1,..., v p ] n p are orthogonal (U U = I p, V V = I p ). The columns of U span the column space of X; the columns of V span the row space. D = diag(d 1,..., d p ), d 1 d 2... d p 0 are the singular values of X. 25 / 56

26 SVD of X The SVD of X = UDV. Then Computation of PCA S = 1 n 1 XX = 1 n 1 UDV VDU 1 = Udiag( n 1 d j 2 )U 1 PC directions u k (left singular vectors) 2 Variance of PC (scores) 1 n 1 d 2 j (singular values 2 ) 3 Matrix of principal component scores (right singular vectors) z (1) d 1 v 1. = Z = U X = DV =. z (p) d p v p 26 / 56

27 PCA in R The standard data format is the n p data frame or matrix x. To perform PCA by eigen decomposition: spr <-princomp(x) U<-spr$loadings L<-(spr$sdev)^2 Z <-spr$scores To perform PCA by singular value decompositoin gpr <- prcomp(x) U <- gpr$rotation L <- (gpr$sdev)^2 Z <- gpr$x 27 / 56

28 Correlation PCA PCA is not scale invariant. Correlation matrix of a random vector X is given by R = D 1 2 Σ ΣD 1 2 Σ, where D Σ is the p p diagonal matrix consisting of diagonal elements of Σ. Correlation PCA: The PC directions and variance of PC scores are obtained by eigen-decomposition of R = U R Λ R U R. Preferred if measurements are not commensurate (e.g. X 1 = household income, X 2 = years in school). 28 / 56

Olivetti Faces data PCA for Olivetti Faces data Obtained from http://www.cs.nyu.edu/~roweis/data.html.

29 Olivetti Faces data PCA for Olivetti Faces data Obtained from Grayscale faces 8 bit [0-255], a few (10) images of several (40) different people. 400 total images, 64x64 size. From the Oivetti database at ATT. 4 samples, 64 x 64 pixels, each pixel has a value (darker lighter) 29 / 56

30 sample 1, 64 x 64 pixels, each pixel has a value (darker lighter) Images as data An image is a matrix-valued datum. In Olivetti Faces data, the matrix is of size 64 64, with each pixel having values between [0-255]. The matrix, corresponding one observation, is vecterized (vec d) by stacking each column into one long vector of size d = 4096 = So, x 1 is a d 1 vector corresponding to PCA is applied to the data matrix X = [x 1,..., x n ]. 30 / 56

31 Olivetti Faces data Major components PC2 Direction PC3 Direction PC4 Direction 4 2 x PC1 Direction PC1 Direction PC1 Direction PC1 Direction PC1 Direction PC3 Direction PC4 Direction PC2 Direction 5 x PC2 Direction PC2 Direction PC2 Direction PC1 Direction PC2 Direction PC4 Direction PC3 Direction PC3 Direction x PC3 Direction PC3 Direction PC1 Direction PC2 Direction PC3 Direction PC4 Direction PC4 Direction PC4 Direction x PC4 Direction 31 / 56

32 Olivetti Faces data Scree plots 1 scree plot 1 cum. scree plot proportion of variance cum. proportion of variance component component 1 1 proportion of variance cum. proportion of variance component component 32 / 56

33 Olivetti Faces data Interpretation Examine the mode of variation by walking along the PC direction from the mean. Here, ±1, 2 standard deviations of Z (k) PC PC / 56

34 Olivetti Faces data Interpretation (Eigenfaces) PC walk. + 2 std along PC dir. (Top PC 1, Mid PC 2, Bottom PC 3) PC1 darker to lighter face PC2 feminin to masculine face PC3 oval to rectangle face 34 / 56

35 Olivetti Faces data Interpretation (Eigenfaces Computation) Vectorized data in X = [x 1,..., x n ], with mean x. Denote (u j, λ j ): the jth PC direction and PC variance. Walk along PC j direction and examine at position s = ±2, ±1, 0 by 1 Reconstruction at s: w s = x ± s λ j u j 2 Convert to image by reshaping the vector w s into matrix W s. 35 / 56

36 Olivetti Faces data Reconstruction of original data Approximation to the original data matrix: x i = x + p Z (j)i u j, (i = 1,..., n) j=1 Approximation of the original observation x i by the first m < p principal components: ˆx i = x + m Z (j)i u j, j=1 The larger m, the better approximation by ˆx i. The smaller m, the more succinct dimension reduction of X. 36 / 56

37 Reconstruction of original face Observation index i = 5. 5th face. from top left to bottom right: (mean, 1, 5, 10) & (20, 50, 100, 400) PCs 37 / 56

Reconstruction of original face Observation index i = 19. 19th face.

38 Reconstruction of original face Observation index i = th face. from top left to bottom right: (mean, 1, 5, 10) & (20, 50, 100, 400) PCs 38 / 56

Reconstruction of original face Observation index i = 100. 100th face.

39 Reconstruction of original face Observation index i = th face. from top left to bottom right: (mean, 1, 5, 10) & (20, 50, 100, 400) PCs 39 / 56

40 Olivetti Faces data Reconstruction of original face Human eyes require > 50 principal components to see resemblance between ˆx i and x i. Corresponds to about %90 percent of variance explained in PCs. Subjective and heuristic decision on how many components to use Reconstruction by PCA most useful when - each datum is visually represented (rather than being just numbers). - for example: data objects are images, functions, shapes. 40 / 56

41 Dimension reduction for two sets of random variables When distinction between explanatory and response variables are not so clear, an analysis dealing with the two sets of variables in a symmetric manner is desired. Examples 1 Relationship between genes expressions and biological variables; 2 Relation between two sets of psychological tests, each with multidimensional measurements. 41 / 56

42 Example: nutrimouse dataset a nutrition study in the mouse Reference: pharmaco-moleculaire/acceuil.html Pascal Martin from the Toxicology and Pharmacology Laboratory (French National Institute for Agronomic Research). Obtained from R package, CCA. n = 40 mice, each with r = 120 gene expression levels, associated with nutrition problem, and s = 21 measurements of concentrations of 21 hepatic fatty acids. X , Y / 56

43 X 10 40, Y / 56 Example: nutrimouse dataset Objective XY correlation Focus on the first 10 gene expression levels (Dimension r of the first set of variables is now 10).

44 Example: nutrimouse dataset Objective Interested in dimension reduction of X and Y, while keeping the important association between X and Y. Find linear dimension reduction of X and Y using the cross-correlation matrix X correlation Y correlation Cross correlation / 56

45 Canonical Correlation Analysis (CCA) CCA seeks to identify and quantify the associations between two sets of variables. Given two random vectors X R r and Y R s (r = s or r s), consider linear dimension reduction of each of two random vectors, ξ = g X = g 1 X g r X r, ω = h Y = h 1 Y h s Y s. CCA finds the random variables (ξ, ω) or the projection vectors (g, h) that give maximal correlation between ξ and ω, Corr(ξ, ω) = Cov(ξ, ω) Var(ξ)Var(ω) 45 / 56

46 Population CCA 1 Denote Σ 11 = Cov(X),Σ 22 = Cov(Y) Σ 12 = Cov(X, Y),Σ 21 = Cov(Y, X). The first set of projection vectors (g 1, h 1 ) = argmax g R r,h R s Corr(g X, h Y) (4) g Σ 12 h argmax g R r,h R s g Σ 11 gh Σ 22 h. (5) (g 1, h 1 ) Canonical correlation vectors. (ξ 1 = g 1 X, ω 1 = h 1Y) are called canonical variates. ρ 1 = Corr(ξ 1, ω 1 ) is the largest canonical correlation. 46 / 56

47 Population CCA 2,3,... The kth set of projection vectors, given (g 1,..., g k 1 ) and (h 1,..., h k 1 ), are (g k, h k ) = g Σ 12 h argmax g R r,h R s g Σ 11 gh Σ 22 h. (6) g Σ 11 g j =0, h Σ 22 h j =0,j=1,...,k 1 (g k, h k ) Canonical correlation vectors. (ξ k = g k X, ω k = h ky) the kth canonical variates. ρ k = Corr(ξ k, ω k ) ρ j, j = 1,..., k 1. Corr(ξ k, ξ j ) = 0, Corr(ω k, ω j ) = 0, j = 1,..., k 1. Generally g i g j = 0 NOT true. 47 / 56

48 Sample CCA For n sample (x 1, y i ), (i = 1,..., n), denote The sample CCA is S 11 = Ĉov(X),S 22 = Ĉov(Y) S 12 = Ĉov(X, Y),S 21 = Ĉov(Y, X). (g k, h k ) = g S 12 h argmax g R r,h R s g S 11 gh S 22 h. (7) g S 11 g j =0, h S 22 h j =0,j=1,...,k 1 48 / 56

49 First CC vectors: maximize Finding CCA solution g S 12 h g S 11 gh S 22 h. Change-of-variable: a = S g, b = S h. The problem is now to maximize a S S 12S b a ab b with respect to a R r, b R s, and the solution is given by a 1 = v 1 (S S 12S 1 22 S 21S ), b 1 = v 1 (S S 21S 1 11 S 12S ), where v i (M) is the ith eigenvector (corresponding to the ith largest eigenvalue) of symmetric M. 49 / 56

50 Finding CCA solution For j = 1,..., min(s, r), we have a j = v j (S S 12S 1 22 S 21S ), b j = v j (S S 21S 1 11 S 12S ), g j = S a j, h j = S b j and thus canonical covariate scores ξ j = g j X = (g j x 1,..., g j x n) and ω j = h j Y = (h j y 1,..., h j y n). Note that g j g k = a j S 1 11 a k 0 (not orthogonal); Cov({ξ j(i) }, {ξ k(i) }) = g j S 11g k = 0 (uncorrelated). 50 / 56

51 nutrimouse data Canonical correlation coefficients Reduced dataset: r = 10 and s = 21 variables with n = 40 samples. min(s, r) = 10 pairs of canonical variates, with decreasing canonical correlation coefficients. res.cc$cor / 56

52 First four pairs of canonical variates {(ξ (j)i, ω (j)i ) : i = 1,..., n}. nutrimouse data Canonical variate scores plot 52 / 56

53 Original data, with r n. Problem? nutrimouse data with r = 120 XY correlation / 56

54 Regularized CCA When r n, and s n, there always exist g S 12 h (g 1, h 1 ) = argmax g R r,h R s g S 11 gh S 22 h, satisfying Corr(ξ j, ω k ) = ±1 (perfect correlation!). This is an artifact from that S 11 and S 22 are not invertible (or the rank of X and Ỹ is at most n 1. A way to circumvent this issue and obtain a meaningful estimate of population canonical correlation coefficients, Regularized CCA is often used: g S 12 h (g k, h k ) = argmax g R r,h R s g (S 11 + λ 1 I r )gh (S 22 + λ 2 I s )h, g S 11 g j =0, h S 22 h j =0,j=1,...,k 1 (8) for λ 1, λ 2 > / 56

55 nutrimouse data Regularized CCA 1 the sample canonical correlation coefficients ρ j and 2 the corresponding projection vectors h j and g j are varing depending on the choice of regularization parameter λ 1 = λ 2 = , 0.01, / 56

56 CCA in R n r data frame or matrix x and n s data frame or matrix y To perform CCA: cc=cancor(x,y) rho<-cc$cor g<-cc$xcoef h<-cc$ycoef To perform regularized CCA, use package CCA, library(cca) cc=rcc(x,y,lambda1,lambda2) rho<-cc$cor g<-cc$xcoef h<-cc$ycoef 56 / 56

1 Principal Components Analysis

Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for