Principal Component Analysis

Similar documents
Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

1 Singular Value Decomposition and Principal Component

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

LECTURE 16: PCA AND SVD

PCA, Kernel PCA, ICA

Eigenvalues and diagonalization

7. Variable extraction and dimensionality reduction

EECS 275 Matrix Computation

Announcements (repeat) Principal Components Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Signal Analysis. Principal Component Analysis

1 Principal Components Analysis

Mathematical foundations - linear algebra

Principal Component Analysis CS498

Multivariate Statistical Analysis

Maximum variance formulation

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

PRINCIPAL COMPONENT ANALYSIS

Introduction to Machine Learning

Independent Component Analysis and Its Application on Accelerator Physics

Review (Probability & Linear Algebra)

Principal Component Analysis

Principal Component Analysis

Linear Algebra - Part II

(a)

Lecture 7: Positive Semidefinite Matrices

Singular Value Decomposition and Principal Component Analysis (PCA) I

CSC 411 Lecture 12: Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Lecture 3: Review of Linear Algebra

12.2 Dimensionality Reduction

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Tutorial on Principal Component Analysis

Polynomial Chaos and Karhunen-Loeve Expansion

Dimensionality Reduction with Principal Component Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Pattern Recognition 2

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex

DS-GA 1002 Lecture notes 10 November 23, Linear models

UNIT 6: The singular value decomposition.

HST.582J/6.555J/16.456J

CS 4495 Computer Vision Principle Component Analysis

Chapter 3 Transformations

Principal Components Theory Notes

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

STATISTICAL SHAPE MODELS (SSM)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

EIGENVALUES AND SINGULAR VALUE DECOMPOSITION

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Chapter 4: Factor Analysis

Parallel Singular Value Decomposition. Jiaxing Tan

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

EECS490: Digital Image Processing. Lecture #26

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture: Face Recognition and Feature Reduction

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

1 Linearity and Linear Systems

Linear Algebra & Geometry why is linear algebra useful in computer vision?

EE731 Lecture Notes: Matrix Computations for Signal Processing

PCA Review. CS 510 February 25 th, 2013

12.4 Known Channel (Water-Filling Solution)

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Foundations of Computer Vision

IV. Matrix Approximation using Least-Squares

Notes on Polar Decomposition and SVD

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Exercises * on Principal Component Analysis

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

CS 143 Linear Algebra Review

Linear Algebra Methods for Data Mining

Concentration Ellipsoids

Principal Component Analysis (PCA)

7. Symmetric Matrices and Quadratic Forms

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

Lecture: Face Recognition and Feature Reduction

Designing Information Devices and Systems II

Apprentissage non supervisée

1 Principal component analysis and dimensional reduction

Statistics-and-PCA. September 7, m = 1 n. x k. k=1. (x k m) 2. Var(x) = S 2 = 1 n 1. k=1

Math 4153 Exam 3 Review. The syllabus for Exam 3 is Chapter 6 (pages ), Chapter 7 through page 137, and Chapter 8 through page 182 in Axler.

EE731 Lecture Notes: Matrix Computations for Signal Processing

Learning with Singular Vectors

Principal Component Analysis

CSE 554 Lecture 7: Alignment

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

PRINCIPAL COMPONENTS ANALYSIS

Linear Algebra for Machine Learning. Sargur N. Srihari

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Statistical signal processing

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Proposition 42. Let M be an m n matrix. Then (32) N (M M)=N (M) (33) R(MM )=R(M)

What is Principal Component Analysis?

Towards theory of generic Principal Component Analysis

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

Transcription:

Principal Component Analysis November 24, 2015

From data to operators Given is data set X consisting of N vectors x n R D. Without loss of generality, assume x n = 0 (subtract mean). Let P be D N matrix whose columns are the data vectors x n. Let H = span{x n } N n=1 RD. Define L : H R N, v P v = L(v) = { v, x n } R N, and its adjoint L : R N H R D, N w L (w) = w[n]x n, n=1 w = (w[1], w[2],..., w[n]). L is called the Bessel (analysis) operator, and L is called the synthesis operator.

Derivation of Covariance The frame operator S can be written as where PP is D D. S : H H, v N v, x n x n = (PP )v, n=1 Hence, up to a scaling factor (of 1/N) and a translation (mean subtraction), S is the linear operator identified with the D D symmetric covariance matrix C = 1 N PP of the data X, i.e. C = 1 N x j [m]x j [n] N j=1 D m,n=1, x j = (x j [1],..., x j [D]) R D.

Principal Component Analysis The covariance matrix C we have just defined is certainly symmetric and also positive semidefinite, since for every vector y, we have y, Cy = 1 N y, x j 2 0. N j=1 Thus, C can be diagonalized, and its eigenvalues are all nonnegative. If K denotes the orthogonal matrix that diagonalizes C, then we have that K CK is diagonal and the whole process of analyzing data using the eigenbasis of covariance matrix is known as Principal Component Analysis (). K is also known as principal orthogonal decomposition, Hoteling transform, or Karhunen-Loève transform. The columns of K are the eigenvectors of C. The number of positive eigenvalues is the actual number of uncorrelated parameters, or degrees of freedom in the original data set X. Each eigenvalue represents the variance of its degree of freedom.

Illustration of Principal Component Analysis The following plot contains a collection of points obtained from a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in the (0.878, 0.478) direction, and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean. Source of imagery: Wikipedia

History K. Pearson, On lines and planes of closest fit to systems of points in space, Philosophical Magazine, vol. 2 (1901), pp. 559 572 H. Hoteling, Analysis of a complex of statistical variables into principal components, Journal of Education Psychology, vol. 24 (1933), pp. 417 44 K. Karhunen, Zur Spektraltheorie stochastischer Prozesse, Ann. Acad. Sci. Fennicae, vol. 34 (1946) M. Loève, Fonctions aléatoire du second ordre, in Processus stochastiques et mouvement Brownien, p. 299, Paris (1948)

Data perspective We shall now present a different, data-inspired, model for. Assume we have D observed (measured) variables: y = [y 1,..., y D ] T. This is our data. Assume we know that our data is obtained by a linear transformation W from d unknown variables x = [x 1,..., x d ] T : Typically we assume d < D. y = W (x). Assume, moreover, that the D d matrix W is a change of a coordinate system, i.e., columns of W (or rows of W T ) are orthonormal to each other: W T W = Id d. Note that WW T need not be an identity matrix.

Given the above assumptions the problem of can be stated as follows: How can we find the transformation W and the dimension d from a finite number of measurements y? We shall need 2 additional assumptions: Assume that the unknown variables are Gaussian; Assume that both the unknown variables and the observations have mean zero (this is easily guaranteed by subtracting the mean, or the sample mean).

minimizing the reconstruction error For a noninvertible matrix, we have its pseudoinverse defined as W + = (W T W ) 1 W T In our case, W + = W T, Thus, if y = Wx, we have or, equivalently, WW T y = WW T Wx = WId d x = y, y WW T y = 0. With the presence of noise, we cannot assume anymore the perfect reconstruction, hence, we shall minimize the reconstruction error defined as E y ( y WW T y 2 2). It is not difficult to see that E y ( y WW T y 2 2) = E y (y T y) E y (y T WW T y).

from minimizing the reconstruction error As E y (y T y) is constant, our minimization of error reconstruction turns into a maximization of E y (y T WW T y). In reality, we known little about y, so we have to rely on the measurements y(k), k = 1,..., N. Then, E y (y T WW T y) 1 N N (y(n)) T WW T (y(n)) 1 N tr(y T WW T Y ), n=1 where Y is the matrix whose columns are the measurements y(n) (hence Y is a D N matrix). Using Singular Value Decomposition (SVD) for Y : Y = V ΣU T, we obtain: E y (y T WW T y) 1 N tr(uσt V T WW T V ΣU T ). Therefore, after some computations we obtain: argmax W E y (y T WW T y) V Id D d. Thus, we have that W V Id D d, and so x Id d D V T y.

from maximizing the decorrelation Another approach to is by assuming that the unknown variables are uncorrelated (in a statistical sense). This can boil down in practice to the assumption that the covariance matrix C is diagonal. Since the observed measurements are often corrupted, we may write C y = E(yy T ) = E(Wxx T W T ) = WE(xx T )W T = WC x W T. Alternatively, because of the orthogonality in W, we have C x = W T C y W. Now, we use eigendecomposition of C y (since we can), to write C y = V ΛV T. This leads to C x = W T V ΛV T W. This equality can hold only when W = V Id D d. Hence, again x = Id d D V T y.

Summary To summarize, we have presented two different approaches which lead to the same result: If we assume that our data Y consists of N measurements y(n) R D, n = 1,..., N and is obtained by an orthogonal transformation W from N (a priori) unknown measurements x(n) R d, n = 1,..., N, with d < D (represented by a d N matrix X) such that : Y = W (X), and if we make some further auxiliary assumptions, then we can we find the variables X, the transformation W, and the dimension d from N measurements Y by letting X = Id d D V T Y, where V comes from SVD of Y and d is the number of nonzero singular values of Y.