PCA and LDA. Man-Wai MAK

Similar documents
PCA and LDA. Man-Wai MAK

Linear Dimensionality Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Principal Component Analysis and Linear Discriminant Analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

1 Principal Components Analysis

Introduction to Machine Learning

LEC 3: Fisher Discriminant Analysis (FDA)

ECE 661: Homework 10 Fall 2014

Linear Algebra Methods for Data Mining

Constrained Optimization and Support Vector Machines

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Eigenfaces. Face Recognition Using Principal Components Analysis

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Lecture: Face Recognition

Data Mining and Analysis: Fundamental Concepts and Algorithms

STA 414/2104: Lecture 8

Principal Component Analysis

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

What is Principal Component Analysis?

Dimensionality Reduction

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Machine Learning - MT & 14. PCA and MDS

Face Recognition and Biometric Systems

Example: Face Detection

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis (PCA)

CS281 Section 4: Factor Analysis and PCA

Machine Learning 11. week

Computation. For QDA we need to calculate: Lets first consider the case that

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Covariance and Correlation Matrix

Linear & nonlinear classifiers

Fisher Linear Discriminant Analysis

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning 2nd Edition

Lecture 7: Con3nuous Latent Variable Models

Statistical Pattern Recognition

Fisher s Linear Discriminant Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Lecture 13 Visual recognition

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Principal Component Analysis

Maximum variance formulation

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Image Analysis. PCA and Eigenfaces

Some Interesting Problems in Pattern Recognition and Image Processing

Advanced Introduction to Machine Learning CMU-10715

Machine Learning (Spring 2012) Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Principal components analysis COMS 4771

STA 414/2104: Lecture 8

Principal Components Analysis. Sargur Srihari University at Buffalo

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Preprocessing & dimensionality reduction

LECTURE NOTE #10 PROF. ALAN YUILLE

Principal Component Analysis

Notes on Implementation of Component Analysis Techniques

CSC 411 Lecture 12: Principal Component Analysis

Pattern Recognition 2

EECS 275 Matrix Computation

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

PCA and admixture models

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis (PCA) CSC411/2515 Tutorial

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

STA 414/2104: Machine Learning

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Eigenface-based facial recognition

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Face Detection and Recognition

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Principal Component Analysis (PCA)

7. Variable extraction and dimensionality reduction

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Introduction to Graphical Models

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Machine learning for pervasive systems Classification in high-dimensional spaces

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

L11: Pattern recognition principles

PCA, Kernel PCA, ICA

L26: Advanced dimensionality reduction

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Face recognition Computer Vision Spring 2018, Lecture 21

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Face detection and recognition. Detection Recognition Sally

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Transcription:

PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer Vision: Models Learning and Inference, Cambridge University Press, 2012 April 23, 2016 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 1 / 22

Overview 1 Dimension Reduction Why Dimension Reduction Dimension Reduction: Reduce to 1-Dim 2 Principle Component Analysis Derivation of PCA PCA on High-Dimensional Data Eigenface 3 Linear Discriminant Analysis LDA on 2-Class Problems LDA on Multi-class Problems Man-Wai MAK (EIE) PCA and LDA April 23, 2016 2 / 22

Why Dimension Reduction Many applications produce high-dimensional vectors In face recognition, if an image has size 360 260 pixels, the dimension is 93600. In hand-writing digit recognition, if a digit occupies 28 28 pixels, the dimension is 784. In speaker recognition, the dim can be as high as 61440 per utterance. High-dim feature vectors can easily cause the curse-of-dimensionality problem. Redundancy: Some of the elements in the feature vectors are strongly correlated, meaning that knowing one element will also know another element. Irrelevancy: Some elements in the feature vectors are irrelevant to the classification task. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 3 / 22

Dimension Reduction Dimensionality reduction attempt to find a low dimensional (or hidden) representation h which can approximately explain the data x so that x f (h, θ) (1) where f (, ) is a function that takes the hidden variable and a set of parameters θ. Typically, we choose the function family f (, ) and then learn h and θ from training data. Least squares criterion: Given N training vectors X = {x 1,..., x N }, x i R D, we find the parameters θ and latent variables h that minimize the sum of square error: { N } ˆθ, {ĥi} N i=1 = arg min [x i f (h i, θ)] T [x i f (h i, θ)] (2) θ,{h i } N i=1 i=1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 4 / 22

Dimension Reduction: Reduce to 1-Dim Approximate vector x i by a scalar value h i plus the global mean µ: x i φh i + µ, where µ = 1 N N x i, φ R D 1 i=1 Assuming µ = 0 or vectors have been mean-subtracted, i.e., x i x i µ i, we have x i φh i The least squares criterion becomes: ˆθ, {ĥi} N i=1 = arg min θ,{h i } N i=1 = arg min θ,{h i } N i=1 E(θ, h) { N } [x i φh i ] T [x i φh i ] i=1 (3) Man-Wai MAK (EIE) PCA and LDA April 23, 2016 5 / 22

Dimension Reduction: Reduce to 1-Dim Eq. 3 has a problem in that it does not have a unique solution. If we multiply φ by any constant α and divide h i s by the same constant we get the same cost, i.e., (αφ hi α = φh i). We make the solution unique by constraining the length of φ to be 1 using a Lagrange multiplier: E(φ, h) = = N (x i φh i ) T (x i φh i ) + λ(φ T φ 1) i=1 N x T x i 2h i φ T x i + hi 2 + λ(φ T φ 1) i=1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 6 / 22

Dimension Reduction: Reduce to 1-Dim Setting E E φ = 0 and h i = 0, we obtain: i x iĥi = λˆφ and ˆφ T x i = ĥ i = x T i ˆφ Hence, i x i ( x T i ) ( ) ˆφ = x ix T i i ˆφ = λˆφ = Sˆφ = λˆφ where S is the covariance matrix of training data. 1 Therefore, ˆφ is the first eigenvector of S. 1 Note that x i s have been mean subtracted. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 7 / 22

Dimension Reduction: Reduce to 1-Dim 13 Image preprocessing and feature extraction a) c) b) Figure 13.19 Reduction to a single dimension. a) Original data and direction of maximum variance. b) The data are projected onto to produce a one dimensional representation. c) To reconstruct the data, we re-multiply by. Most of the original variation is retained. PCA extends this model to project high dimensional data onto the K orthogonal dimensions with the most variance, to produce a K dimensional representation. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 8 / 22

Principle Component Analysis In PCA, the hidden variable h is multi-dimensional and φ becomes a rectangular matrix Φ = [φ 1 φ 2 φ M ], where M D. Each components of h weights one column of matrix Φ so that data is approximated as x i Φh i, i = 1,..., N The cost function is ˆθ, {ĥ i } N i=1 = arg min θ,{h i } N i=1 = arg min θ,{h i } N i=1 E ( ) θ, {h i } N i=1 { N } (4) [x i Φh i ] T [x i Φh i ] i=1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 9 / 22

Principle Component Analysis To solve the non-uniqueness problem in Eq. 4, we enforce φ T d φ d = 1 d using a set of Lagrange multipliers {λ d } D d=1 : E(Φ, {h i }) = = = N D (x i Φh i ) T (x i Φh i ) + λ d (φ T d φ d 1) i=1 d=1 N (x i Φh i ) T (x i Φh i ) + tr{φλφ T Λ} i=1 N x T x i 2h T i Φ T x i + h T i h i + tr{φλφ T Λ} i=1 (5) where Λ = diag{λ 1,..., λ D } and Φ = [φ 1 φ 2 φ D ]. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 10 / 22

Principle Component Analysis Setting E E Φ = 0 and h i = 0, we obtain: i x iĥt i = ˆΦΛ and ˆΦT xi = ĥi = ĥt i = x T i where we have used the differentiation of matrix trace: Therefore, X tr{xbxt } = XB T + XB. i x ix T i ˆΦ = ˆΦΛ = S ˆΦ = ˆΦΛ So, ˆΦ comprises the M eigenvectors of S. ˆΦ Man-Wai MAK (EIE) PCA and LDA April 23, 2016 11 / 22

PCA on High-Dimensional Data When, the dimension D of x i is very high, computing S and its eigenvectors directly are impractical. However, the rank of S is limited by the number of training examples: If there are N training examples, there will be at most N 1 eigenvectors with non-zero eigenvalues. If N D, the principal components can be computed more easily. Let X be a data matrix comprising the mean-subtracted x i s in its columns. Then, S = XX T and the eigen-decomposition of S is given by Sφ i = XX T φ i = λ i φ i Instead of performing eigen-decomposition of XX T, we perform eigen-decomposition of X T Xψ i = λ i ψ i (6) Man-Wai MAK (EIE) PCA and LDA April 23, 2016 12 / 22

Principle Component Analysis Pre-multipling both side of Eq. 6 by X, we obtain XX T (Xψ i ) = λ i (Xψ i ) This means that if ψ i is an eigenvector of X T X, then φ i = Xψ i is an eigenvector of S = XX T. So, all we need is to compute the N 1 eigenvectors of X T X, which has size N N. Note that φ i computed in this way is un-normalized. So, we need to normalize them by φ i = Xψ i Xψ i, i = 1,..., N 1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 13 / 22

Example Application of PCA: Eigenface One of the most well-known application of PCA. µ φ 1 φ 2 h φ 10 h 10 1 h... 2 h 399 +... Reconstructed faces using 399 eigenfaces Original faces Man-Wai MAK (EIE) PCA and LDA April 23, 2016 14 / 22

Example Application of PCA: Eigenface Faces reconstructed using different number of principal components (eigenfaces): Original 1 PC 20 PCs 50 PCs 100 PCs 200 PCs 399 PCs See Lab2 of EIE4105 in http://www.eie.polyu.edu.hk/ mwmak/myteaching.htm for implementation. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 15 / 22

Fisher Discriminant FDA a classifica-on Analysis method to separate data into two classes. FDABut is afda classification could also method be toconsidered separate dataas intoa two supervised classes. FDAdimension could also bereduc-on considered as method a supervised that dimension reduces reduction the method dimension that reduces to 1. the dimension to 1. Project data onto line joining the 2 means Project data onto FDA subspace 15 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 16 / 22

Fisher Discriminant Analysis The idea of FDA is to find a 1-D line so that the projected data give a large separation between the means of two classes while also giving a small variance within each class, thereby minimizing the class overlap. Assume that training data are projected to one dimension using Fisher criterion: where J(w) = y n = w T x n, n = 1,..., N. Between-class scatter Within-class scatter S B = (µ 2 µ 1 )(µ 2 µ 1 ) T and S W = = wt S B w w T S W w 2 k=1 n C k (x n µ k )(x n µ k ) T are the between-class and within-class scatter matrices, respectively, and µ 1 and µ 2 are the class means. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 17 / 22

Fisher Discriminant Analysis Note that only the direction of w matters. Therefore, we can always find a w that leads to w T S W w = 1. The maximization of J(w) can be rewritten as: The Lagrangian function is Setting L w = 0, we obtain max w w T S B w subject to w T S W w = 1 L(w, λ) = 1 2 wt S B w λ(w T S W w 1) S B w λs W w = 0 = S B w = λs W w = (S 1 W S B)w = λw So, w is the first eigenvector of S 1 W S B. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 18 / 22 (7)

LDA on Multi-class Problems For multiple classes (K > 2 and D > K), we can use LDA to project D-dimensional vectors to M-dimensional vectors, where 1 < M < K. w is extended to a matrix W = [w 1 w M ] and the projected scalar y i is extended to a vector y i : y n = W T (x n µ), where y nj = w T j (x n µ), j = 1,..., M where µ is the global mean of training vectors. The between-class and within-class scatter matrices become S B = S W = K N k (µ k µ)(µ k µ) T k=1 K k=1 n C k (x n µ k )(x n µ k ) T where N k is the number of samples in the class k, i.e., N k = C k. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 19 / 22

LDA on Multi-class Problems The LDA criterion function: J(W) = Between-class scatter Within-class scatter { ( ) ( ) } 1 = Tr W T S B W W T S W W Constrained optimization: max W Tr{W T S B W} subject to W T S W W = I where I is an M M identity matrix. Note that unlike PCA in Eq. 5, because of the matrix S W in the constraint, we need to find one w j at a time. Note also that the constraint W T S W W = I suggests that w j s may not be orthogonal to each other. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 20 / 22

LDA on Multi-class Problems To find w j, we write the Lagrangian function as: L(w j, λ j ) = w T j S B w j λ j (w T j S W w j 1) Using Eq. 7, the optimal solution of w j satisfies (S 1 W S B)w j = λ j w j Therefore, W comprises the first M eigenvectors of S 1 W S B. A more formal proof can be find in [1]. As the maximum rank of S B is K 1, S 1 W S B has at most K 1 non-zero eigenvalues. As a result, M can be at most K 1. After the projection, the vectors y n s can be used to train a classifier (e.g., SVM) for classification. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 21 / 22

References [1] Fukunaga, K. (1990). Introduction to statistical pattern classification. San Diego, California, USA: Academic Press. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 22 / 22