Principal Component Analysis Anders Øland David Christiansen 1 Introduction Principal Component Analysis, or PCA, is a commonly used multi-purpose technique in data analysis. It can be used for feature extraction, compression, classification, and dimension reduction et cetera. There are various ways of approaching and implementing PCA. The two most standard ways of viewing it are: 1. variance maximization 2. minimum mean-square error compression In the following we will discuss PCA from the view of variance maximization. Although other interesting variants exist, such as probabilistic PCA (PPCA), we shall focus only on classic PCA 1. PCA can be described as finding a new basis for some matrix A such that each vector in the basis maximizes the variance of A with respect to itself. In other words, the first vector in the new basis is the dimension along which the data varies the most, the next is that along which it varies next-most, and so forth. The intuition is that the principal component along which there is the most variance is that which is most important in the data. Hopefully, the majority of the variance will be accounted for by fewer principal components than the dimensionality of the original data. PCA is deeply connected to the Singular Value Decomposition (SVD), which decomposes any matrix A with rank r into UΣV T, with orthogonal matrices U and V and diagonal Σ. The non-zero values along the diagonal of Σ, called σ 1,..., σ r are positive, and σ n σ n+1. The first step in PCA is to center the data around its mean. Simply subtract the mean of each dimension. If this were not done, then data in a dimension clustered around some point far from 0 would appear to be much more important relative to other dimensions whose data were clustered around 0. 1 We find that discussing PPCA would be out of scope for this report. 1
2 Variance and Covariance The variance of a data set is a measurement of how much those data are spread out. Data have high variance if they differ further from their mean. In this section, assume that all data have a mean of 0 that is, the mean has been subtracted. For some data vector a = (a 1, a 2,..., a n ), the variance i a2 i. σ a is defined as σa 2 = 1 n The covariance of two data vectors a and b with an equal number of elements n is defined by using the products of the corresponding elements in the calculation instead of the squares of individual values, that is, σab 2 = 1 n i a ib i. To the extent that values in b are correlated to the equivalent values in a, σ ab will be large. If they are negatively correlated, σ ab is less than zero. If they are completely uncorrelated, then σ ab is equal to zero. Covariance can be easily generalized to matrices that consist of a number of data vectors. For an m n matrix A, the covariance matrix C A = 1 n AAT. C A is a m m symmetric matrix. The variances of each vector are found on the diagonal, while the covariance of two vectors from A is found at the corresponding location in C A. 3 PCA and Covariance Matrices As we are attempting to find a new orthonormal basis within which some matrix A will have maximal variance along the first vector, next-maximal variance along the second, and so forth, the covariance matrix C A is a good starting point. If Y is A in this new basis, then C Y will be diagonal, and for all 1 i n 1, the i th element of the diagonal is less than or equal to the (i + 1) th element. The coefficient matrix will be diagonal because each component of the new matrix must be orthogonal to the others. We select the values in decreasing order to select the vectors that contribute the most first. We can diagonalize the covariance matrix by finding its eigenvectors and eigenvalues. Therefore, the principal components of A are the eigenvectors of A s covariance matrix, and the corresponding eigenvalue is the variance of A along the associated vector in the basis. 4 Recovering Principal Components from SVD Keep in mind that, for orthogonal matrices Q, Q 1 = Q T. Therefore, we can derive AV = UΣ from the SVD A = UΣV T. From this, we get (for each 1 i r) Av i = σ i u i. Because we know that each vector in U is a unit 2
vector and that the σs are in decreasing order of size, we know that each u contributes more to the final result than the last. UΣ represents the data in the new basis, while V is the matrix that transforms A to that basis. The PCA, then, can be implemented by doing the following to some matrix A that is centered around the means: 1. Find the SVD of A n = UΣV T. The division is necessitated because A n T A n = 1 n AT A, which is the covariance matrix for A. 2. We know then that A n V = UΣ. Because U and V are orthogonal matrices and Σ is diagonal with the values on the diagonal decreasing, we satisfy the requirements for the PCA. 5 Choosing the Number of Principal Components Each principal component has a corresponding eigenvalue (or σ from the SVD) that indicates the extent to which it contributes to the final reconstruction of the data. The PCA is useful to the extent that these coefficients are not equal - if they are all equal, then no component is more important than the others. When using the PCA for data compression, the eigenvalues give a measure of how much data is being lost by eliminating each vector. It is expected that they first few principal components will provide the majority of the original data, and that there will at some point be a sharp fall in the eigenvalues, indicating that the threshold has been reached. 6 Implementation and Results 6.1 Multivariate Gaussian Data - Two Dimensions Perhaps the most immediately understandable illustration of the function of PCA is to apply it to a bivariate Gaussian distribution that is already centered around the origin, plotted in a two-dimensional plane. Then, the bases found in the PCA multiplied by the projection of the mean of the dataset onto that basis can be plotted as vectors superimposed on the plot of the points. The bivariate Gaussian data will form a roughly ellipse-shaped blob on the graph. This ellipse appears rotated, so that the two perpendicular axes 3
15 10 5 0-5 -10-15 -15-10 -5 0 5 10 15 10 5 0-5 -10-10 -5 0 5 10 Figure 1: Principal components of bivariate Gaussian data 4
(a) Dan Witzner Hansen (b) Smiley face (c) The letter P (d) The letter C (e) The letter A Figure 2: Test images for PCA image compression of the ellipse do not necessarily coincide with the x and y axes of the plane. PCA will recover vectors that match the axes of the elliptical area in which the data are found. The first principal component matches the longer of the axes. The components can be see in Figure 1. 6.2 Image Compression Here, we demonstrate the use of PCA to determine the most important components of an image. The test images used can be seen in Figure 2. All images are in grayscale. While the smiley face has only black and white pixels, the letters have fuzzy edges. Each image was loaded into a matrix in Matlab whose dimensions correspond to the pixel dimensions of the image and where each pixel s grayscale value is represented by a an integer from 0 to 255. Next, the principal components of each image were determined, and new images were generated by keeping only the most important of the components. These images are presented in Figures 3 and 4. 5
(a) Dan Witzner Hansen (b) Dan Witzner Hansen (1 (c) Dan Witzner Hansen (2 (original) component) (d) Dan Witzner Hansen (3 (e) Dan Witzner Hansen (5 (f) Dan Witzner Hansen (7 (g) Dan Witzner Hansen (10 (h) Dan Witzner Hansen (20 (i) Dan Witzner Hansen (30 Figure 3: Principal Components of Dan Witzner Hansen 6
Original 1 2 3 4 5 7 15 Figure 4: Principal Components of Simpler Images 7
(a) P (1 componentnentsponentsponents) (b) P (5 compo- (c) P (10 com- (d) P (14 com- (e) P (15 (f) P (16 (g) P (17 (h) P (18 Figure 5: The letter P (i) P (19 (j) P (20 In the case of Dr. Witzner Hansen, the image begins to be recognizable at around seven components, and the difference in visual quality between 1 and 10 components is much more noticeable than the difference between 10 and 20. Likewise, the 10 components added to get from 20 to 30 components leads to a very slight improvement in visual quality. Approximately 10 percent of the original data yields a quite recognizable image. The relatively simple line drawings in Figure 4 are essentially identical to their initial uncompressed forms with only 15 components, with the exception of the letter P, for which a more detailed picture can be seen in Figure 5. Interestingly, it shows very little change between one and fourteen components, while drastic changes are evident with fifteen through nineteen components. From twenty onwards, the pictures is basically identical to its original uncompressed form. This indicates that the data in the original picture may be much more uncorrelated than the data of the other pictures. 7 Uses and Limitations of PCA PCA is useful for recovering from measurement error, where the more important principal component or components are considered to be the signal and the remaining components are the noise. Additionally, it can be used for lossy compression (by throwing out the least important components of data). PCA is useful for finding the axes along which Gaussian data are dis- 8
tributed. It is not particularly useful in the case of data described by a non-linear variable or for non-gaussian data. For example, two multivariate Gaussian distributions in one set of data would not be recovered. Generally speaking, PCA assumes that the relationships between the variables in the data are linear. If the those are instead non-linear, the principal components (or axes) would not constitue a proper representation of the data. Bishop[1] has a good example of when PCA would fail: if the data consisted of measurements of the coordinates of the position of a person riding on a ferris wheel. In that case, and in the general one, it would be a good idea to look for higher-order dependencies in the data before applying PCA. If such dependencies exist, they may be removed by representing the data in a different way. In the current example, by using polar coordinates instead of Cartesian. 9
References [1] Bishop, Christopher. Pattern Recognition and Machine Learning, Chapter 12. Springer, 2006. [2] Hyvärinen, Aapo, et al. Independent Component Analysis, Chapter 6. John Wiley & Sons, USA, 2001. [3] Nabney, Ian. Netlab: Algorithms for Pattern Recognition, Section 7.1. Springer, 2002. [4] Shlens, Jonathon. A Tutorial on Principal Component Analysis. Online, http://www.snl.salk.edu/~shlens/pca.pdf, accessed 1. Aug. 2011. [5] Strang, Gilbert. Introduction to Linear Algebra, Fourth Edition, p. 457. Wellesley Cambridge Press, USA, 2009. 10
A Matlab Source function [ V, D ] = pca ( d a t a ) % PCA implementation % % INPUT : % data Data to be analyzed ( row vectors ) % % OUTPUT : % V Eigenvectors of the covariance matrix % D Diagonal matrix with the eigenvalues of the covariance matrix c o v a r i a n c e = cov ( d o u b l e ( d a t a ), 1); [V,D] = eig ( c o v a r i a n c e ); end function p c a p l o t ( count, c o v a r ) hold on ; clf ; % Generate data points d a t a = mvnrnd ( zeros ( count, 2), c o v a r ); % Plot data points minx = min ( d a t a (:,1) ) - 1; miny = min ( d a t a (:,2) ) - 1; maxx = max ( d a t a (:,1) ) + 1; maxy = max ( d a t a (:,2) ) + 1; s i d e s = max ([ abs ( minx ) abs ( miny ) abs (maxx) abs (maxy)]); axis ([ - s i d e s s i d e s - s i d e s s i d e s ]); s c a t t e r ( d a t a (:,1), d a t a (:, 2), 4, [0.5 0.5 0.5]) ; % Find PCs [V D] = pca ( d a t a ); [V D] = p c e i g s o r t (V, D); summ = V * sqrt (D) hold on ; plot ([0 summ(1,1) ], [0 summ(2, 1)], color, black, markersize, 8, linewidth, 3) plot ([0 summ(1,2) ], [0 summ(2, 2)], color, black, markersize, 8, linewidth, 3) axis ([ - s i d e s s i d e s - s i d e s s i d e s ]); hold o f f ; end function [] = p c a t e s t ( f i l e n a m e ) 11
% Load image, convert to greyscale I = i m r e a d ( f i l e n a m e ); d a t a = d o u b l e ( I ); % Get principal components [V D] = pca ( d a t a ); % Sort eigenvalues in descending order % and permute V & D accordingly [V D] = p c e i g s o r t (V, D); [, x ] = size ( d a t a ); for p c c o u n t = 1: x, R = u i n t 8 ( p c r e d u c t ( data, V, p c c o u n t )); basename = s t r s p l i t ( f i l e n a m e, ".") ; basename = s t r j o i n ( _, basename (1: length ( basename ) -1)); outputname = s t r j o i n ( _, s t r s p l i t ( basename, / )); i m w r i t e (R, s t r c a t ( output /, outputname, int2str ( p c c o u n t ),. png )) % figure, imshow (R); end end function [ V, D ] = p c e i g s o r t ( V, D ) % Primary Components Eigen Sort % Useful when using the maximum variance method for PCA % Sort eigenvalues in descending order [ e i g v a l s p e r m u t a t i o n ] = sort (- diag (D)); % Permute the eigenvectors in V & eigenvalues in D accordingly V(:, p e r m u t a t i o n ) = V; D(:, p e r m u t a t i o n ) = D; D = flipud (D); end function [ R ] = p c r e d u c t ( data, V, numofpc ) % Principal Components Reduction % Reduce the dimesionality of the data % using the primary components % Project the data dmean = o n e s ( size ( data, 1), 1) * mean ( d a t a ); p r o j = ( d a t a - dmean) * V(:, 1:numOfPC); R = p r o j *V(:, 1:numOfPC) + dmean; end 12