Principal Component Analysis and Linear Discriminant Analysis

Similar documents
Dimension Reduction and Low-dimensional Embedding

Pattern Recognition 2

Linear Dimensionality Reduction

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Linear Algebra Methods for Data Mining

PCA and LDA. Man-Wai MAK

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning - MT & 14. PCA and MDS

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA and LDA. Man-Wai MAK

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Example: Face Detection

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

1 Principal Components Analysis

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis (PCA)

Computational Methods. Eigenvalues and Singular Values

Lecture: Face Recognition and Feature Reduction

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Machine Learning 2nd Edition

Principal Component Analysis

Advanced Introduction to Machine Learning CMU-10715

Lecture: Face Recognition and Feature Reduction

ECE 661: Homework 10 Fall 2014

Principal Component Analysis (PCA)

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Tutorial on Principal Component Analysis

What is Principal Component Analysis?

MLCC 2015 Dimensionality Reduction and PCA

Principal Component Analysis

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Introduction to Graphical Models

Computation. For QDA we need to calculate: Lets first consider the case that

Linear & nonlinear classifiers

PCA, Kernel PCA, ICA

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Lecture 8: Linear Algebra Background

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning (Spring 2012) Principal Component Analysis

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Stat 159/259: Linear Algebra Notes

STA 414/2104: Lecture 8

Linear Algebra Review. Fei-Fei Li

CS540 Machine learning Lecture 5

Data Mining and Analysis: Fundamental Concepts and Algorithms

CSC 411 Lecture 12: Principal Component Analysis

Lecture 7 Spectral methods

UNIT 6: The singular value decomposition.

Least Squares Optimization

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

LECTURE 16: PCA AND SVD

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Image Analysis. PCA and Eigenfaces

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Probabilistic Latent Semantic Analysis

Discriminant analysis and supervised classification

Numerical Methods I Singular Value Decomposition

Introduction to Machine Learning

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

In this section again we shall assume that the matrix A is m m, real and symmetric.

Signal Analysis. Principal Component Analysis

IV. Matrix Approximation using Least-Squares

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Multivariate Statistical Analysis

CS181 Midterm 2 Practice Solutions

1 Singular Value Decomposition and Principal Component

Face Recognition. Lecture-14

Principal Component Analysis

The Singular Value Decomposition

A Least Squares Formulation for Canonical Correlation Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Learning with Singular Vectors

Preliminary Examination, Numerical Analysis, August 2016

Dimensionality Reduction

Least Squares Optimization

L11: Pattern recognition principles

Lecture: Face Recognition

Fisher s Linear Discriminant Analysis

18.06SC Final Exam Solutions

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis

Principal Component Analysis CS498

Face Recognition and Biometric Systems

Dimensionality Reduction

Motivating the Covariance Matrix

Computational functional genomics

EECS 275 Matrix Computation

20 Unsupervised Learning and Principal Components Analysis (PCA)

Mathematical foundations - linear algebra

Eigenvalues and diagonalization

Transcription:

Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

Decomposition and Components Decomposition is a great idea. Linear decomposition and linear basis, e.g., the Fourier transform The bases construct the feature space may be orthogonal bases, may be not give the direction to find the components specified vs. learnt? The features are the image (or projection) of the original signal in the feature space e.g., the orthogonal projection of the original signal onto the feature space the projection does not have to be orthogonal Feature extraction 2/29

Outline Principal Component Analysis Linear Discriminant Analysis Comparison between PCA and LDA 3/29

Principal Components and Subspaces Subspaces preserve part of the information (and energy, or uncertainty) Principal components are orthogonal bases and preserve the large portion of the information of the data capture the major uncertainties (or variations) of data Two views Deterministic: minimizing the distortion of projection of the data Statistical: maximizing the uncertainty in data are they the same? under what condition they are the same? 4/29

View 1: Minimizing the MSE x ( WW T )x x R n, and assume centering E{x} = 0. m is the dim of the subspace, m < n orthonormal bases W = [w 1,w 2,...,w m ] where W T W = I, i.e., rotation orthogonal projection of x: Px = m (wi T x)w i = (WW T )x i=1 it achieves the minimum mean-square error (prove it!) e min MSE = E{ x Px 2 } = E{ P x } PCA can be posed as: finding a subspace that minimizes the MSE: argminj MSE (W) = E{ x Px 2 }, s.t., W T W = I W 5/29

Let do it... It is easy to see: So, J MSE (W) = E{x T x} E{x T Px} minimizing J MSE (W) maximizing E{x T Px} Then we have the following constrained optimization problem The Lagrangian is max W E{xT WW T x} s.t. W T W = I L(W,λ) = E{x T WW T x}+λ T (I W T W) The set of KKT conditions gives: L(W, λ) w i = 2E{xx T }w i 2λ i w i, i 6/29

What is it? Let s denote by S = E{xx T } (note: E{x} = 0). The KKT conditions give: or in a more concise matrix form: Sw i = λ i w i, i SW = λ i W What is this? Then, the value of minimum MSE is e min MSE = n i=m+1 i.e., the sum of the eigenvalues of the orthogonal subspace to the PCA subspace. λ i 7/29

View 2: Maximizing the Variation Let s look it from another perspective: We have a linear projection of x to a 1-d subspace y = w T x an important note: E{y} = 0 as E{x} = 0 The first principal component of x is such that the variance of the projection y is maximized of course, we need to constrain w to be a unit vector. so we have the following optimization problem what is it? max w J(w) = E{y2 } = E{(w T x) 2 }, s.t. w T w = 1 max w J(w) = wt Sw, s.t. w T w = 1 8/29

Maximizing the Variation (cont.) The sorted eigenvalues of S are λ 1 λ 2... λ n, and eigenvectors are {e 1,...,e n }. It is clearly that the first PC is y 1 = e T 1 x This can be generalized to m PCs (where m < n) with one more constraint E{y m y k } = 0, k < m i.e., the PCs are uncorrelated with all previously found PCs The solution is: Sounds familiar? w k = e k 9/29

The Two Views Converge The two views lead to the same result! You should prove: uncorrelated components orthonormal projection bases What if we are more greedy, say needing independent components? Do we shall expect orthonormal bases? In which case, we still have orthonormal bases? We ll see it in next lecture. 10/29

The Closed-Form Solution Learning the principal components from {x 1,...,x N }: 1. calculating m = 1 N N k=1 x k 2. centering A = [x 1 m,...,x N m] 3. calculating S = N k=1 (x k m)(x k m) T = AA T 4. eigenvalue decomposition 5. sorting λ i and e i 6. finding the bases S = U T ΣU W = [e 1,e 2,...,e m ] Note: The components for x is y = W T (x m), where x R n and y R m 11/29

A First Issue n is the dimension of input data, N is the size of the training set In practice, n N E.g., in image-based face recognition, if the resolution of a face image is 100 100, when stacking all the pixels, we end up n = 10,000. Note that S is a n n matrix Difficulties: S is ill-conditioned, as in general rank(s) n Eigenvalue decomposition of S is too demanding So, what should we do? 12/29

Solution I: First Trick A is a n N matrix, then S = AA T is n n, but A T A is N N Trick Let s do eigenvalue decomposition on A T A A T Ae = λe AA T Ae = λae i.e., if e is an eigenvector of A T A, then Ae is the eigenvector of AA T and the corresponding eigenvalues are the same Don t forget to normalize Ae Note: This trick does not fully solved the problem, as we still need to do eigenvalue decomposition on a N N matrix, which can be fairly large in practice. 13/29

Solution II: Using SVD Instead of doing EVD, doing SVD (singular value decomposition) is easier A R n R N A = UΣV T U R n R N, and U T U = I Σ R N R N, is diagonal V R N R N, and V T V = VV T = I 14/29

Solution III: Iterative Solution We can design an iterative procedure for finding W, i.e., W W+ W looking at the View of MSE minimization, our cost function: x m (wi T x)w i 2 = x i=1 m y i w i 2 = x (WW T )x 2 i=1 we can stop updating if the KKT is met m w i = γy i [x y i w i ] i=1 Its matrix form is: subspace learning algorithm Two issues: W = γ(xx T W WW T xx T W) The orthogonality is not reinforced Slow convergence 15/29

Solution IV: PAST To speed up the iteration, we can use recursive least squares (RLS). We can consider the following cost function J(t) = t β t i x(i) W(t)y(i) 2 i=1 where β is the forgetting factor. W can be solved recursively by the following PAST algorithm 1. y(t) = W T (t 1)x(t) 2. h(t) = P(t 1)y(t) 3. m(t) = h(t)/(β +y T (t)h(t)) 4. P(t) = 1 β Tri[P(t 1) m(t)ht (t)] 5. e(t) = x(t) W(t 1)y(t) 6. W(t) = W(t 1)+e(t)m T (t) 16/29

Outline Principal Component Analysis Linear Discriminant Analysis Comparison between PCA and LDA 17/29

Face Recognition: Does PCA work well? The same face under different illumination conditions What does PCA capture? Is this what we really want? 18/29

From Descriptive to Discriminative PCA extracts features (or components) that well describe the pattern Are they necessarily good for distinguishing between classes and separating patterns? Examples? We need discriminative features. Supervision (or labeled training data) is needed. The issues are: How do we define the discriminant and separability between classes? How many features do we need? How do we maximizing the separability? Here, we give an example of linear discriminant analysis. 19/29

Linear Discriminant Analysis Finding an optimal linear projection W Catches major difference between classes and discount irrelevant factors In the projected discriminative subspace, data are clustered 20/29

Within-class and Between-class Scatters We have two sets of labeled data: D 1 = {x 1,...,x n1 } and D 2 = {x 1,...,x n2 }. Let s define some terms: The centers of two classes, m i = 1 n i x D i x i Data scatter by definition S = (x m)(x m) T x D Within-class scatter: Between-class scatter: S w = S 1 +S 2 S b = (m 1 m 2 )(m 1 m 2 ) T 21/29

Fisher Liner Discriminant Input: We have two sets of labeled data: D 1 = {x 1,...,x n1 } and D 2 = {x 1,...,x n2 }. Output: We want to find a 1-d linear projection w that maximizes the separability between these two classes. Projected data: Y 1 = w T D 1 and Y 2 = w T D 2 Projected class centers: m i = w T m i Projected within-class scatter (it is a scalar in this case) S w = w T S w w prove it! Projected between-class scatter (it is a scalar in this case) Fisher Linear Discriminant S b = w T S b w prove it! J(w) = m 1 m 2 2 S 1 + S 2 = S b S w = wt S b w w T S w w 22/29

Rayleigh Quotient Theorem f(λ) = Ax λbx B where z B = z T B 1 z is minimized by the Rayleigh quotient λ = xt Ax x T Bx Proof. f(λ) λ = (Bx) T (Bz) = x T B T B 1 z = x T (Ax λbx) = x T Ax λx T Bx setting it to zero to see the result clearly. 23/29

Optimizing Fish Discriminant Theorem J(w) = wt S b w w T S ww is maximized when S b w = λs w w Proof. Let w T S w w = c 0. We can construct the Lagrangian as L(w,λ) = w T S b w λ(w T S w w c) Then KKT is It is clearly that L(w, λ) w = S b w λs w w S b w = λs w w 24/29

An Efficient Solution A naive solution is S 1 w S b w = λw Then we can do EVD on S 1 w S b, which needs some computation Is there a more efficient way? Facts: Sb w is along the direction of m 1 m 2. why? we don t care about λ the scalar factor So we can easily figure out the direction of w by Note: rank(s b ) = 1 w = S 1 w (m 1 m 2 ) 25/29

Multiple Discriminant Analysis Now, we have c number of classes: within-class scatter S w = c i=1 S i as before between-class scatter is a bit different from 2-class total scatter c S b = n i (m i m)(m i m) T i=1 S t = (x m)(x m) T = S w +S b x MDA is to find a subspace with bases W that maximizes J(W) = S b S w = WT S b W W T S w W 26/29

The Solution to MDA The solution is obtained by G-EVD S b w i = λ i S w w i where each w i is a generalized eigenvector In practice, what we can do is the following find the eigenvalues as the root of the characteristic polynomial for each λi, solve w i from S b λ i S w = 0 (S b λ i S w )w i = 0 Note: W is not unique (up to rotation and scaling) Note: rank(s b ) (c 1) (why?) 27/29

Outline Principal Component Analysis Linear Discriminant Analysis Comparison between PCA and LDA 28/29

The Relation between PCA and LDA LDA PCA 29/29