What is Principal Component Analysis?

What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most of the sample's information. Useful for the compression and classification of data. By information we mean the variation present in the sample, given by the correlations between the original variables. 18

Principal Component Analysis Most common form of factor analysis The new variables/dimensions are linear combinations of the original ones are uncorrelated with one another Orthogonal in original dimension space capture as much of the original variance in the data as possible are called Principal Components are ordered by the fraction of the total information each retains 19

Some Simple Demos http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.ht ml 20

Original Variable B What are the new axes? PC 2 PC 1 Original Variable A Orthogonal directions of greatest variance in data Projections along PC1 discriminate the data most along any one axis 21

Principal Components First principal component is the direction of greatest variability (covariance) in the data Second is the next orthogonal (uncorrelated) direction of greatest variability So first remove all the variability along the first component, and then find the next direction of greatest variability And so on 22

Principal Components Analysis (PCA) Principle Linear projection method to reduce the number of parameters Transfer a set of correlated variables into a new set of uncorrelated variables Map the data into a space of lower dimensionality Form of unsupervised learning Properties It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables New axes are orthogonal and represent the directions with maximum variability 23

Computing the Components Data points are vectors in a multidimensional space Projection of vector x onto an axis (dimension) u is u.x Direction of greatest variability is that in which the average square of the projection is greatest I.e. u such that E((u.x) 2 ) over all x is maximized (we subtract the mean along each dimension, and center the original axis system at the centroid of all data points, for simplicity) This direction of u is the direction of the first Principal Component 24

Computing the Components E((u T x) 2 ) = E ((u T x) (u T x)) = E (u T x.x T u) The matrix C = x.x T contains the correlations (similarities) of the original axes based on how the data values project onto them So we are looking for w that maximizes u T Cu, subject to u being unit-length It is maximized when w is the principal eigenvector of the matrix C, in which case u T Cu = u T lu = l if u is unit-length, where l is the principal eigenvalue of the covariance matrix C The eigenvalue denotes the amount of variability captured along that dimension 25

Why the Eigenvectors? Maximise u T xx T u s.t u T u = 1 Construct Langrangian u T xx T u λu T u Vector of partial derivatives set to zero xx T u λu = (xx T λi) u = 0 As u 0 then u must be an eigenvector of xx T with eigenvalue λ 26

Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (u T u = 1) eigenvector u Subsequent roots are ordered such that λ 1 > λ 2 > > λ M with rank(d) non-zero values. Eigenvectors form an orthonormal basis i.e. u it u j = δ ij The eigenvalue decomposition: C = xx T = UΣU T where U = [u 1, u 2,, u M ] and Σ = diag[λ 1, λ 2,, λ M ] Similarly the eigenvalue decomposition of x T x = VΣV T The SVD is closely related to the above x=u Σ 1/2 V T The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues. 27

Computing the Components Similarly for the next axis, etc. So, the new axes are the eigenvectors of the covariance matrix of the original variables, which captures the similarities of the original variables based on how data samples project to them Geometrically: centering followed by rotation Linear transformation 28

PCs, Variance and Least-Squares The first PC retains the greatest amount of variation in the sample The k th PC retains the k th greatest fraction of the variation in the sample The k th largest eigenvalue of the correlation matrix C is the variance in the sample along the k th PC The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal to all previous ones 29

How Many PCs? For n original dimensions, correlation matrix is nxn, and has up to n eigenvectors. So n PCs. Where does dimensionality reduction come from? 30

Variance (%) Dimensionality Reduction Can ignore the components of lesser significance. 25 20 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You do lose some information, but if the eigenvalues are small, you don t lose much p dimensions in original data calculate p eigenvectors and eigenvalues choose only the first d eigenvectors, based on their eigenvalues final data set has only d dimensions 31

Eigenvectors of a Correlation Matrix 32

Geometric picture of principal components z 1 the 1 st PC z 1 is a minimum distance fit to a line in X space z 2 the 2 nd PC is a minimum distance fit to a line in the plane perpendicular to the 1 st PC PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous. 33

Optimality property of PCA Dimension reduction Original data X G T X p n d n G T X X d n G( G Reconstruction T X ) p n T G d p X p n Y G T X d n X p n G p d 34

Optimality property of PCA Main theoretical result: The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem: min G p d X G( G T X ) 2 F subject tog T G I d X X 2 F reconstruction error PCA projection minimizes the reconstruction error among all linear projections of size d. 35

Applications of PCA Eigenfaces for recognition. Turk and Pentland. 1991. Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003. 36

PCA applications -Eigenfaces the principal eigenface looks like a bland and rogynous average human face http://en.wikipedia.org/wiki/image:eigenfaces.png 37

Eigenfaces Face Recognition When properly weighted, eigenfaces can be summed together to create an approximate grayscale rendering of a human face. Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces Hence eigenfaces provide a means of applying data compression to faces for identification purposes. Similarly, Expert Object Recognition in Video 38

PCA for image compression d=1 d=2 d=4 d=8 d=16 d=32 d=64 d=100 Original Image 39