Principal Component Analysis

Similar documents
PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

What is Principal Component Analysis?

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Covariance and Principal Components

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Lecture: Face Recognition and Feature Reduction

Numerical Methods I Singular Value Decomposition

Singular Value Decomposition

PCA, Kernel PCA, ICA

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Linear Subspace Models

The Mathematics of Facial Recognition

Dimensionality Reduction

Singular Value Decomposition

Image Compression Using Singular Value Decomposition

Introduction to Machine Learning

Principal Component Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

Machine Learning (Spring 2012) Principal Component Analysis

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Dimensionality Reduction with Principal Component Analysis

Linear Algebra Review. Fei-Fei Li

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Mathematical foundations - linear algebra

Principal Components Analysis (PCA)

Linear Algebra Methods for Data Mining

Chapter 7: Symmetric Matrices and Quadratic Forms

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Linear Algebra Methods for Data Mining

1 Singular Value Decomposition and Principal Component

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Expectation Maximization

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

LECTURE 16: PCA AND SVD

Foundations of Computer Vision

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Principal Component Analysis

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Data Preprocessing Tasks

15 Singular Value Decomposition

PRINCIPAL COMPONENTS ANALYSIS

Probabilistic Latent Semantic Analysis

Linear Algebra Review. Fei-Fei Li

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Clustering VS Classification

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Eigenimaging for Facial Recognition

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Principal Component Analysis (PCA)

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Machine learning for pervasive systems Classification in high-dimensional spaces

Wavelet Transform And Principal Component Analysis Based Feature Extraction

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Example Linear Algebra Competency Test

14 Singular Value Decomposition

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

Advanced Introduction to Machine Learning CMU-10715

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Concentration Ellipsoids

Principal Component Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial


Linear Algebra for Machine Learning. Sargur N. Srihari

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Principal Component Analysis

Face Recognition and Biometric Systems

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

The Principal Component Analysis

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

PCA and admixture models

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis (PCA)

Maths for Signals and Systems Linear Algebra in Engineering

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

COMP 558 lecture 18 Nov. 15, 2010

Introduction to SVD and Applications

Dimensionality Reduction

Homework 1. Yuan Yao. September 18, 2011

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 7: Con3nuous Latent Variable Models

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Table of Contents. Multivariate methods. Introduction II. Introduction I

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Announcements (repeat) Principal Components Analysis

Machine Learning 11. week

Basics of Multivariate Modelling and Data Analysis

STA141C: Big Data & High Performance Statistical Computing

Transcription:

Principal Component Analysis Anders Øland David Christiansen 1 Introduction Principal Component Analysis, or PCA, is a commonly used multi-purpose technique in data analysis. It can be used for feature extraction, compression, classification, and dimension reduction et cetera. There are various ways of approaching and implementing PCA. The two most standard ways of viewing it are: 1. variance maximization 2. minimum mean-square error compression In the following we will discuss PCA from the view of variance maximization. Although other interesting variants exist, such as probabilistic PCA (PPCA), we shall focus only on classic PCA 1. PCA can be described as finding a new basis for some matrix A such that each vector in the basis maximizes the variance of A with respect to itself. In other words, the first vector in the new basis is the dimension along which the data varies the most, the next is that along which it varies next-most, and so forth. The intuition is that the principal component along which there is the most variance is that which is most important in the data. Hopefully, the majority of the variance will be accounted for by fewer principal components than the dimensionality of the original data. PCA is deeply connected to the Singular Value Decomposition (SVD), which decomposes any matrix A with rank r into UΣV T, with orthogonal matrices U and V and diagonal Σ. The non-zero values along the diagonal of Σ, called σ 1,..., σ r are positive, and σ n σ n+1. The first step in PCA is to center the data around its mean. Simply subtract the mean of each dimension. If this were not done, then data in a dimension clustered around some point far from 0 would appear to be much more important relative to other dimensions whose data were clustered around 0. 1 We find that discussing PPCA would be out of scope for this report. 1

2 Variance and Covariance The variance of a data set is a measurement of how much those data are spread out. Data have high variance if they differ further from their mean. In this section, assume that all data have a mean of 0 that is, the mean has been subtracted. For some data vector a = (a 1, a 2,..., a n ), the variance i a2 i. σ a is defined as σa 2 = 1 n The covariance of two data vectors a and b with an equal number of elements n is defined by using the products of the corresponding elements in the calculation instead of the squares of individual values, that is, σab 2 = 1 n i a ib i. To the extent that values in b are correlated to the equivalent values in a, σ ab will be large. If they are negatively correlated, σ ab is less than zero. If they are completely uncorrelated, then σ ab is equal to zero. Covariance can be easily generalized to matrices that consist of a number of data vectors. For an m n matrix A, the covariance matrix C A = 1 n AAT. C A is a m m symmetric matrix. The variances of each vector are found on the diagonal, while the covariance of two vectors from A is found at the corresponding location in C A. 3 PCA and Covariance Matrices As we are attempting to find a new orthonormal basis within which some matrix A will have maximal variance along the first vector, next-maximal variance along the second, and so forth, the covariance matrix C A is a good starting point. If Y is A in this new basis, then C Y will be diagonal, and for all 1 i n 1, the i th element of the diagonal is less than or equal to the (i + 1) th element. The coefficient matrix will be diagonal because each component of the new matrix must be orthogonal to the others. We select the values in decreasing order to select the vectors that contribute the most first. We can diagonalize the covariance matrix by finding its eigenvectors and eigenvalues. Therefore, the principal components of A are the eigenvectors of A s covariance matrix, and the corresponding eigenvalue is the variance of A along the associated vector in the basis. 4 Recovering Principal Components from SVD Keep in mind that, for orthogonal matrices Q, Q 1 = Q T. Therefore, we can derive AV = UΣ from the SVD A = UΣV T. From this, we get (for each 1 i r) Av i = σ i u i. Because we know that each vector in U is a unit 2

vector and that the σs are in decreasing order of size, we know that each u contributes more to the final result than the last. UΣ represents the data in the new basis, while V is the matrix that transforms A to that basis. The PCA, then, can be implemented by doing the following to some matrix A that is centered around the means: 1. Find the SVD of A n = UΣV T. The division is necessitated because A n T A n = 1 n AT A, which is the covariance matrix for A. 2. We know then that A n V = UΣ. Because U and V are orthogonal matrices and Σ is diagonal with the values on the diagonal decreasing, we satisfy the requirements for the PCA. 5 Choosing the Number of Principal Components Each principal component has a corresponding eigenvalue (or σ from the SVD) that indicates the extent to which it contributes to the final reconstruction of the data. The PCA is useful to the extent that these coefficients are not equal - if they are all equal, then no component is more important than the others. When using the PCA for data compression, the eigenvalues give a measure of how much data is being lost by eliminating each vector. It is expected that they first few principal components will provide the majority of the original data, and that there will at some point be a sharp fall in the eigenvalues, indicating that the threshold has been reached. 6 Implementation and Results 6.1 Multivariate Gaussian Data - Two Dimensions Perhaps the most immediately understandable illustration of the function of PCA is to apply it to a bivariate Gaussian distribution that is already centered around the origin, plotted in a two-dimensional plane. Then, the bases found in the PCA multiplied by the projection of the mean of the dataset onto that basis can be plotted as vectors superimposed on the plot of the points. The bivariate Gaussian data will form a roughly ellipse-shaped blob on the graph. This ellipse appears rotated, so that the two perpendicular axes 3

15 10 5 0-5 -10-15 -15-10 -5 0 5 10 15 10 5 0-5 -10-10 -5 0 5 10 Figure 1: Principal components of bivariate Gaussian data 4

(a) Dan Witzner Hansen (b) Smiley face (c) The letter P (d) The letter C (e) The letter A Figure 2: Test images for PCA image compression of the ellipse do not necessarily coincide with the x and y axes of the plane. PCA will recover vectors that match the axes of the elliptical area in which the data are found. The first principal component matches the longer of the axes. The components can be see in Figure 1. 6.2 Image Compression Here, we demonstrate the use of PCA to determine the most important components of an image. The test images used can be seen in Figure 2. All images are in grayscale. While the smiley face has only black and white pixels, the letters have fuzzy edges. Each image was loaded into a matrix in Matlab whose dimensions correspond to the pixel dimensions of the image and where each pixel s grayscale value is represented by a an integer from 0 to 255. Next, the principal components of each image were determined, and new images were generated by keeping only the most important of the components. These images are presented in Figures 3 and 4. 5

(a) Dan Witzner Hansen (b) Dan Witzner Hansen (1 (c) Dan Witzner Hansen (2 (original) component) (d) Dan Witzner Hansen (3 (e) Dan Witzner Hansen (5 (f) Dan Witzner Hansen (7 (g) Dan Witzner Hansen (10 (h) Dan Witzner Hansen (20 (i) Dan Witzner Hansen (30 Figure 3: Principal Components of Dan Witzner Hansen 6

Original 1 2 3 4 5 7 15 Figure 4: Principal Components of Simpler Images 7

(a) P (1 componentnentsponentsponents) (b) P (5 compo- (c) P (10 com- (d) P (14 com- (e) P (15 (f) P (16 (g) P (17 (h) P (18 Figure 5: The letter P (i) P (19 (j) P (20 In the case of Dr. Witzner Hansen, the image begins to be recognizable at around seven components, and the difference in visual quality between 1 and 10 components is much more noticeable than the difference between 10 and 20. Likewise, the 10 components added to get from 20 to 30 components leads to a very slight improvement in visual quality. Approximately 10 percent of the original data yields a quite recognizable image. The relatively simple line drawings in Figure 4 are essentially identical to their initial uncompressed forms with only 15 components, with the exception of the letter P, for which a more detailed picture can be seen in Figure 5. Interestingly, it shows very little change between one and fourteen components, while drastic changes are evident with fifteen through nineteen components. From twenty onwards, the pictures is basically identical to its original uncompressed form. This indicates that the data in the original picture may be much more uncorrelated than the data of the other pictures. 7 Uses and Limitations of PCA PCA is useful for recovering from measurement error, where the more important principal component or components are considered to be the signal and the remaining components are the noise. Additionally, it can be used for lossy compression (by throwing out the least important components of data). PCA is useful for finding the axes along which Gaussian data are dis- 8

tributed. It is not particularly useful in the case of data described by a non-linear variable or for non-gaussian data. For example, two multivariate Gaussian distributions in one set of data would not be recovered. Generally speaking, PCA assumes that the relationships between the variables in the data are linear. If the those are instead non-linear, the principal components (or axes) would not constitue a proper representation of the data. Bishop[1] has a good example of when PCA would fail: if the data consisted of measurements of the coordinates of the position of a person riding on a ferris wheel. In that case, and in the general one, it would be a good idea to look for higher-order dependencies in the data before applying PCA. If such dependencies exist, they may be removed by representing the data in a different way. In the current example, by using polar coordinates instead of Cartesian. 9

References [1] Bishop, Christopher. Pattern Recognition and Machine Learning, Chapter 12. Springer, 2006. [2] Hyvärinen, Aapo, et al. Independent Component Analysis, Chapter 6. John Wiley & Sons, USA, 2001. [3] Nabney, Ian. Netlab: Algorithms for Pattern Recognition, Section 7.1. Springer, 2002. [4] Shlens, Jonathon. A Tutorial on Principal Component Analysis. Online, http://www.snl.salk.edu/~shlens/pca.pdf, accessed 1. Aug. 2011. [5] Strang, Gilbert. Introduction to Linear Algebra, Fourth Edition, p. 457. Wellesley Cambridge Press, USA, 2009. 10

A Matlab Source function [ V, D ] = pca ( d a t a ) % PCA implementation % % INPUT : % data Data to be analyzed ( row vectors ) % % OUTPUT : % V Eigenvectors of the covariance matrix % D Diagonal matrix with the eigenvalues of the covariance matrix c o v a r i a n c e = cov ( d o u b l e ( d a t a ), 1); [V,D] = eig ( c o v a r i a n c e ); end function p c a p l o t ( count, c o v a r ) hold on ; clf ; % Generate data points d a t a = mvnrnd ( zeros ( count, 2), c o v a r ); % Plot data points minx = min ( d a t a (:,1) ) - 1; miny = min ( d a t a (:,2) ) - 1; maxx = max ( d a t a (:,1) ) + 1; maxy = max ( d a t a (:,2) ) + 1; s i d e s = max ([ abs ( minx ) abs ( miny ) abs (maxx) abs (maxy)]); axis ([ - s i d e s s i d e s - s i d e s s i d e s ]); s c a t t e r ( d a t a (:,1), d a t a (:, 2), 4, [0.5 0.5 0.5]) ; % Find PCs [V D] = pca ( d a t a ); [V D] = p c e i g s o r t (V, D); summ = V * sqrt (D) hold on ; plot ([0 summ(1,1) ], [0 summ(2, 1)], color, black, markersize, 8, linewidth, 3) plot ([0 summ(1,2) ], [0 summ(2, 2)], color, black, markersize, 8, linewidth, 3) axis ([ - s i d e s s i d e s - s i d e s s i d e s ]); hold o f f ; end function [] = p c a t e s t ( f i l e n a m e ) 11

% Load image, convert to greyscale I = i m r e a d ( f i l e n a m e ); d a t a = d o u b l e ( I ); % Get principal components [V D] = pca ( d a t a ); % Sort eigenvalues in descending order % and permute V & D accordingly [V D] = p c e i g s o r t (V, D); [, x ] = size ( d a t a ); for p c c o u n t = 1: x, R = u i n t 8 ( p c r e d u c t ( data, V, p c c o u n t )); basename = s t r s p l i t ( f i l e n a m e, ".") ; basename = s t r j o i n ( _, basename (1: length ( basename ) -1)); outputname = s t r j o i n ( _, s t r s p l i t ( basename, / )); i m w r i t e (R, s t r c a t ( output /, outputname, int2str ( p c c o u n t ),. png )) % figure, imshow (R); end end function [ V, D ] = p c e i g s o r t ( V, D ) % Primary Components Eigen Sort % Useful when using the maximum variance method for PCA % Sort eigenvalues in descending order [ e i g v a l s p e r m u t a t i o n ] = sort (- diag (D)); % Permute the eigenvectors in V & eigenvalues in D accordingly V(:, p e r m u t a t i o n ) = V; D(:, p e r m u t a t i o n ) = D; D = flipud (D); end function [ R ] = p c r e d u c t ( data, V, numofpc ) % Principal Components Reduction % Reduce the dimesionality of the data % using the primary components % Project the data dmean = o n e s ( size ( data, 1), 1) * mean ( d a t a ); p r o j = ( d a t a - dmean) * V(:, 1:numOfPC); R = p r o j *V(:, 1:numOfPC) + dmean; end 12