PRINCIPAL COMPONENTS ANALYSIS

Similar documents
Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Eigenvalues, Eigenvectors, and an Intro to PCA

PRINCIPAL COMPONENTS ANALYSIS

20 Unsupervised Learning and Principal Components Analysis (PCA)

Introduction to Machine Learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

PCA, Kernel PCA, ICA

Principal Component Analysis

PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Mathematical foundations - linear algebra

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

7. Variable extraction and dimensionality reduction

STATISTICAL LEARNING SYSTEMS

Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Preprocessing Tasks

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Lecture 7 Spectral methods

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Classification 2: Linear discriminant analysis (continued); logistic regression

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Quantitative Understanding in Biology Principal Components Analysis

Principal Component Analysis

Least Squares Optimization

14 Singular Value Decomposition

Singular Value Decomposition

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Dimensionality Reduction

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Neuroscience Introduction

1 Principal Components Analysis

Lecture: Face Recognition and Feature Reduction

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

IV. Matrix Approximation using Least-Squares

PCA and admixture models

What is Principal Component Analysis?

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Dimensionality Reduction

Lecture: Face Recognition and Feature Reduction

Machine Learning (Spring 2012) Principal Component Analysis

A A x i x j i j (i, j) (j, i) Let. Compute the value of for and

Singular Value Decomposition and Principal Component Analysis (PCA) I

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Linear Algebra, Summer 2011, pt. 2

Linear Algebra Methods for Data Mining

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Machine Learning 11. week

The Principal Component Analysis

STA 414/2104: Lecture 8

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Principal Component Analysis

Regression I: Mean Squared Error and Measuring Quality of Fit

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A =

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

CSC 411 Lecture 12: Principal Component Analysis

Image Analysis. PCA and Eigenfaces

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

7 Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

CS281 Section 4: Factor Analysis and PCA

15 Singular Value Decomposition

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Principal Component Analysis

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

The Neural Network, its Techniques and Applications

MLCC 2015 Dimensionality Reduction and PCA

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Principal Components Theory Notes

Regularized Discriminant Analysis and Reduced-Rank LDA

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Factor Analysis (10/2/13)

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

18.06 Quiz 2 April 7, 2010 Professor Strang

Principal component analysis (PCA) for clustering gene expression data

STA 414/2104: Lecture 8

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Least Squares Optimization

Transcription:

121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves the analysis of eigenvalues and eigenvectors of the covariance or correlation matrix. Its development relies on the following important facts: Theorem 11.0.1: Diagonalization of Symmetric Matrices All n n real valued symmetric matrices (like the covariance and correlation matrix) have two very important properties: 1. They have a complete set of n linearly independent eigenvectors, {v 1,...,v n }, corresponding to eigenvalues l 1 l 2 l n. 2. Furthermore, these eigenvectors can be chosen to be orthonormal so that if V [v 1... v n ] then or equivalently, V 1 V T. V T V I Letting D be a diagonal matrix with D ii l i, by the definition of eigenvalues and eigenvectors we have for any symmetric matrix S, SV VD Thus, any symmetric matrix S can be diagonalized in the following way: V T SV D

122 Covariance and Correlation matrices (when there is no perfect multicollinearity in variables) have the additional property that all of their eigenvalues are positive (nonzero). They are positive definite matrices. Now that we know we have a complete set of eigenvectors, it is common to order them according to the magnitude of their corresponding eigenvalues. From here on out, we will use (l 1, v 1 ) to represent the largest eigenvalue of a matrix and its corresponding eigenvector. When working with a covariance or correlation matrix, this eigenvector associated with the largest eigenvalue is called the first principal component and points in the direction for which the variance of the data is maximal. Example 11.0.1 illustrates this point. Example 11.0.1: Eigenvectors of the Covariance Matrix Suppose we have a matrix of data for 10 individuals on 2 variables, x 1 and x 2. Plotted on a plane, the data appears as follows:

123 Our data matrix for these points is: 0 1 1 1 2 1 2 4 3 1 4 4 X 5 2 6 4 B 6 6 C @ 7 6A 8 8 the means of the variables in X are: 4.4 x. 3.7 When thinking about variance directions, our first step should be to center the data so that it has mean zero. Eigenvectors measure the spread of data around the origin. Variance measures spread of data around the mean. Thus, we need to equate the mean with the origin. To center the data, we simply compute 0 1 0 1 0 1 1 1 3.4 2.7 2 1 2.4 2.7 2 4 2.4 0.3 3 1 1.4 2.7 X c X e x T 4 4 0.4 0.3. 5 2 0.6 1.7 6 4 1.6 0.3 B 6 6 C B C B 1.6 2.3 C @ 7 6A @ A @ 2.6 2.3 A 8 8 3.6 4.3 Examining the new centered data, we find that we ve only translated our data in the plane - we haven t distorted it in any fashion.

124 Thus the covariance matrix is: S 1 9 (XT c X c ) 5.6 4.8 4.8 6.0111 The eigenvalue and eigenvector pairs of S are (rounded to 2 decimal places) as follows: apple apple 0.69 0.72 (l 1, v 1 ) 10.6100, and (l 0.72 2, v 2 ) 1.0012, 0.69 Let s plot the eigenvector directions on the same graph: v1 v2

125 The eigenvector v 1 is called the first principal component. It is the direction along which the variance of the data is maximal. The eigenvector v 2 is the second principal component. In general, the second principal component is the direction, orthogonal to the first, along which the variance of the data is maximal (in two dimensions, there is only one direction possible.) Why is this important? Let s consider what we ve just done. We started with two variables, x 1 and x 2, which appeared to be correlated. We then derived new variables, v 1 and v 2, which are linear combinations of the original variables: v 1 0.69x 1 + 0.72x 2 (11.1) v 2 0.72x 1 + 0.69x 2 (11.2) These new variables are completely uncorrelated. To see this, let s represent our data according to the new variables - i.e. let s change the basis from B 1 [x 1, x 2 ] to B 2 [v 1, v 2 ]. Example 11.0.2: The Principal Component Basis Let s express our data in the basis defined by the principal components. We want to find coordinates (in a 2 10 matrix A) such that our original (centered) data can be expressed in terms of principal components. This is done by solving for A in the following equation (see Chapter 9 and note that the rows of X define the points rather than the columns): X c AV T (11.3) 0 1 0 1 3.4 2.7 a 11 a 12 2.4 2.7 a 21 a 22 2.4 0.3 a 31 a 32 1.4 2.7 a 41 a 42 0.4 0.3 a 51 a 52 v T 1 0.6 1.7 a 61 a 62 v T (11.4) 2 1.6 0.3 a 71 a 72 B 1.6 2.3 C B a 81 a 82 C @ 2.6 2.3 A @ a 91 a 92 A 3.6 4.3 a 10,1 a 10,2 Conveniently, our new basis is orthonormal meaning that V is an orthogonal matrix, so A XV.

126 The new data coordinates reflect a simple rotation of the data around the origin: v2 v1 Visually, we can see that the new variables are uncorrelated. You may wish to confirm this by calculating the covariance. In fact, we can do this in a general sense. If A X c V is our new data, then the covariance matrix is diagonal: S A 1 n 1 AT A 1 n 1 (X cv) T (X c V) 1 n 1 VT ((Xc T X c )V 1 n 1 VT ((n 1)S X )V V T (S X )V V T (VDV T )V D Where S X VDV T comes from the diagonalization in Theorem 11.0.1. By changing our variables to principal components, we have managed to hide the correlation between x 1 and x 2 while keeping the spacial relationships between data points in tact. Transformation back to variables x 1 and x 2 is easily done by using the linear relationships in Equations 11.1 and 11.2.

11.1. Comparison with Least Squares 127 11.1 Comparison with Least Squares In least squares regression, our objective is to maximize the amount of variance explained in our target variable. It may look as though the first principal component from Example 11.0.1 points in the direction of the regression line. This is not the case however. The first principal component points in the direction of a line which minimizes the sum of squared orthogonal distances between the points and the line. Regressing x 2 on x 1, on the other hand, provides a line which minimizes the sum of squared vertical distances between points and the line. This is illustrated in Figure 11.1. Principal Component Regression Line Figure 11.1: Principal Components vs. Regression Lines The first principal component about the mean of a set of points can be represented by that line which most closely approaches the data points. In contrast, linear least squares tries to minimize the distance in the y direction only. Thus, although the two use a similar error metric, linear least squares is a method that treats one dimension of the data preferentially, while PCA treats all dimensions equally. 11.2 Covariance or Correlation Matrix? Principal components analysis can involve eigenvectors of either the covariance matrix or the correlation matrix. When we perform this analysis on the covariance matrix, the geometric interpretation is simply centering the data and then determining the direction of maximal variance. When we perform

11.3. Applications of Principal Components 128 this analysis on the correlation matrix, the interpretation is standardizing the data and then determining the direction of maximal variance. The correlation matrix is simply a scaled form of the covariance matrix. In general, these two methods give different results, especially when the scales of the variables are different. The covariance matrix is the default for R. The correlation matrix is the default in SAS. The covariance matrix method is invoked by the option: proc princomp datax cov; var --0; run; Choosing between the covariance and correlation matrix can sometimes pose problems. The rule of thumb is that the correlation matrix should be used when the scales of the variables vary greatly. In this case, the variables with the highest variance will dominate the first principal component. The argument against automatically using correlation matrices is that it is quite a brutal way of standardizing your data. 11.3 Applications of Principal Components Principal components have a number of applications across many areas of statistics. In the next sections, we will explore their usefulness in the context of dimension reduction. In Chapter 14 we will look at how PCA is used to solve the issue of multicollinearity in biased regression. 11.3.1 PCA for dimension reduction It is quite common for an analyst to have too many variables. There are two different solutions to this problem: 1. Feature Selection: Choose a subset of existing variables to be used in a model. 2. Feature Extraction: Create a new set of features which are combinations of original variables. Feature Selection Let s think for a minute about feature selection. What are we really doing when we consider a subset of our existing variables? Take the two dimensional data in Example 11.0.2 (while two-dimensions rarely necessitate dimension reduction, the geometrical interpretation extends to higher dimensions as usual!). The centered data appears as follows:

11.3. Applications of Principal Components 129 Now say we perform some kind of feature selection (there are a number of ways to do this, chi-square tests for instances) and we determine that the variable x 2 is more important than x 1. So we throw out x 2 and we ve reduced the dimensions from p 2 to k 1. Geometrically, what does our new data look like? By dropping x 1 we set all of those horizontal coordinates to zero. In other words, we project the data orthogonally onto the x 2 axis: (a) Projecting Data Orthogonally (b) New One-Dimensional Data Figure 11.2: Geometrical Interpretation of Feature Selection Now, how much information (variance) did we lose with this projection? The total variance in the original data is The variance of our data reduction is kx 1 k 2 + kx 2 k 2. kx 2 k 2.

11.3. Applications of Principal Components 130 Thus, the proportion of the total information (variance) we ve kept is kx 2 k 2 kx 1 k 2 + kx 2 k 2 6.01 5.6 + 6.01 51.7%. Our reduced dimensional data contains only 51.7% of the variance of the original data. We ve lost a lot of information! The fact that feature selection omits variance in our predictor variables does not make it a bad thing! Obviously, getting rid of variables which have no relationship to a target variable (in the case of supervised modeling like prediction and classification) is a good thing. But, in the case of unsupervised learning techniques, where there is no target variable involved, we must be extra careful when it comes to feature selection. In summary, Feature Selection is important. Examples include: Removing variables which have little to no impact on a target variable in supervised modeling (forward/backward/stepwise selection). Removing variables which have obvious strong correlation with other predictors. Removing variables that are not interesting in unsupervised learning (For example, you may not want to use the words th and of when clustering text). Feature Selection is an orthogonal projection of the original data onto the span of the variables you choose to keep. Feature selection should always be done with care and justification. In regression, could create problems of endogeneity (errors correlated with predictors - omitted variable bias). For unsupervised modelling, could lose important information. Feature Extraction PCA is the most common form of feature extraction. The rotation of the space shown in Example 11.0.2 represents the creation of new features which are linear combinations of the original features. If we have p potential variables for a model and want to reduce that number to k, then the first k principal components combine the individual variables in such a way that is guaranteed to capture as much information (variance) as possible. Again, take our two-dimensional data as an example. When we reduce our data down to onedimension using principal components, we essentially do the same orthogonal projection that we did in Feature Selection, only in this case we conduct that

11.3. Applications of Principal Components 131 projection in the new basis of principal components. Recall that for this data, our first principal component v 1 was 0.69 v 1. 0.73 Projecting the data onto the first principal component is illustrated in Figure 11.3 How much variance do we keep with k principal components? The v1 v1 (a) Projecting Data Orthogonally (b) New One-Dimensional Data Figure 11.3: Illustration of Feature Extraction via PCA proportion of variance explained by each principal component is the ratio of the corresponding eigenvalue to the sum of the eigenvalues (which gives the total amount of variance in the data). Theorem 11.3.1: Proportion of Variance Explained The proportion of variance explained by the projection of the data onto principal component v i is l i  p j1 l. j Similarly, the proportion of variance explained by the projection of the data onto the first k principal components (k < j) is  k i1 l i  p j1 l j In our simple 2 dimensional example we were able to keep l 1 l 1 + l 2 of our variance in one dimension. 10.61 10.61 + 1.00 91.38%