PCA and LDA. Man-Wai MAK

Similar documents
PCA and LDA. Man-Wai MAK

Linear Dimensionality Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Principal Component Analysis and Linear Discriminant Analysis

Constrained Optimization and Support Vector Machines

1 Principal Components Analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

LEC 3: Fisher Discriminant Analysis (FDA)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Introduction to Machine Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

STA 414/2104: Lecture 8

Dimensionality Reduction

Statistical Pattern Recognition

Linear & nonlinear classifiers

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Fisher s Linear Discriminant Analysis

Lecture 7: Con3nuous Latent Variable Models

ECE 661: Homework 10 Fall 2014

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Linear Algebra Methods for Data Mining

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis

Maximum variance formulation

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

What is Principal Component Analysis?

Machine Learning (Spring 2012) Principal Component Analysis

Eigenfaces. Face Recognition Using Principal Components Analysis

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Fisher Linear Discriminant Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

CS281 Section 4: Factor Analysis and PCA

Principal Component Analysis

LECTURE NOTE #10 PROF. ALAN YUILLE

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Lecture: Face Recognition

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine Learning 2nd Edition

Machine Learning - MT & 14. PCA and MDS

Machine Learning 11. week

Principal Component Analysis (PCA)

CSC 411 Lecture 12: Principal Component Analysis

Computation. For QDA we need to calculate: Lets first consider the case that

STA 414/2104: Machine Learning

Lecture: Face Recognition and Feature Reduction

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Face Recognition and Biometric Systems

STA 414/2104: Lecture 8

Example: Face Detection

Lecture: Face Recognition and Feature Reduction

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Lecture 13 Visual recognition

Covariance and Correlation Matrix

Eigenface-based facial recognition

Machine learning for pervasive systems Classification in high-dimensional spaces

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal components analysis COMS 4771

Dimension Reduction and Low-dimensional Embedding

Advanced Introduction to Machine Learning CMU-10715

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Non-linear Dimensionality Reduction

Discriminant Uncorrelated Neighborhood Preserving Projections

Introduction to Graphical Models

Preprocessing & dimensionality reduction

EECS 275 Matrix Computation

Face recognition Computer Vision Spring 2018, Lecture 21

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Image Analysis. PCA and Eigenfaces

Some Interesting Problems in Pattern Recognition and Image Processing

L26: Advanced dimensionality reduction

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Component Analysis (PCA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Pattern Recognition and Machine Learning

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

PCA, Kernel PCA, ICA

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Machine Learning

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Dimensionality Reduction with Principal Component Analysis

Data Preprocessing Tasks

Principal Component Analysis

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Notes on Implementation of Component Analysis Techniques

Pattern Recognition 2

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Transcription:

PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer Vision: Models Learning and Inference, Cambridge University Press, 2012 C. Bishop, Pattern Recognition and Machine Learning, Appendix E, Springer, 2006. October 26, 2018 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 1 / 26

Overview 1 Dimension Reduction Why Dimension Reduction Dimension Reduction: Reduce to 1-Dim 2 Principle Component Analysis Derivation of PCA PCA on High-Dimensional Data Eigenface 3 Linear Discriminant Analysis LDA on 2-Class Problems LDA on Multi-class Problems Man-Wai MAK (EIE) PCA and LDA October 26, 2018 2 / 26

Why Dimension Reduction Many applications produce high-dimensional vectors In face recognition, if an image has size 360 260 pixels, the dimension is 93600. In hand-writing digit recognition, if a digit occupies 28 28 pixels, the dimension is 784. In speaker recognition, the dim can be as high as 61440 per utterance. High-dim feature vectors can easily cause the curse-of-dimensionality problem. Redundancy: Some of the elements in the feature vectors are strongly correlated, meaning that knowing one element will also know some other elements. Irrelevancy: Some elements in the feature vectors are irrelevant to the classification task. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 3 / 26

Dimension Reduction Given a feature vector x R D, dimensionality reduction aims to find a low dimensional representation h R M that can approximately explain x: x f(h, θ) (1) where f(, ) is a function that takes the hidden variable h and a set of parameters θ and M D. Typically, we choose the function family f(, ) and then learn h and θ from training data. Least squares criterion: Given N training vectors X = {x 1,..., x N }, x i R D, we find the parameters θ and latent variables h that minimize the sum of squared error: { N } ˆθ, {ĥi} N i=1 = argmin θ,{h i } N i=1 i=1 [x i f(h i, θ)] T [x i f(h i, θ)] (2) Man-Wai MAK (EIE) PCA and LDA October 26, 2018 4 / 26

Dimension Reduction: Reduce to 1-Dim Approximate vector x i by a scalar value h i plus the global mean µ: x i φh i + µ, where µ = 1 N N x i, φ R D 1 i=1 Assuming µ = 0 or vectors have been mean-subtracted, i.e., x i x i µ i, we have x i φh i The least squares criterion becomes: ˆφ, {ĥi} N i=1 = argmin E(φ, {h i }) φ,{h i } N i=1 = argmin φ,{h i } N i=1 { N } (3) [x i φh i ] T [x i φh i ] i=1 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 5 / 26

Dimension Reduction: Reduce to 1-Dim Eq. 3 has a problem in that it does not have a unique solution. If we multiply φ by any constant α and divide h i s by the same constant we get the same cost, i.e., αφ hi α = φh i. We make the solution unique by constraining φ 2 = 1 using a Lagrange multiplier: L(φ, {h i }) = E(φ, {h i }) + λ(φ T φ 1) N = (x i φh i ) T (x i φh i ) + λ(φ T φ 1) = i=1 N x T x i 2h i φ T x i + h 2 i + λ(φ T φ 1) i=1 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 6 / 26

Dimension Reduction: Reduce to 1-Dim Setting L L φ = 0 and h i = 0, we obtain: i x iĥi = λˆφ and ˆφT xi = ĥi = x T i ˆφ Hence, i x i ( x T ˆφ ) ( ) i = x ix T i i ˆφ = λˆφ = Sˆφ = λˆφ where S is the covariance matrix of training data. 1 Therefore, ˆφ is the first eigenvector of S. 1 Note that x i s have been mean subtracted. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 7 / 26

Dimension Reduction: Reduce to 1-Dim 13 Image preprocessing and feature extraction a) c) b) Figure 13.19 Reduction to a single dimension. a) Original data and direction of maximum variance. b) The data are projected onto to produce a one dimensional representation. c) To reconstruct the data, we re-multiply by. Most of the original variation is retained. PCA extends this model to project high dimensional data onto the K orthogonal dimensions with the most variance, to produce a K dimensional representation. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 8 / 26

find a small number of axes in which the data have Dimension the highest Reduction: variability. 3D to 2D The axes may not parallel to the original axes. E.g., projec@on from 3D to 2D space x 3 x 2 x 1 5 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 9 / 26

Principle Component Analysis In PCA, the hidden variables {h i } are multi-dimensional and φ becomes a rectangular matrix Φ = [φ 1 φ 2 φ M ], where M D. Each components of h i weights one column of matrix Φ so that data is approximated as x i Φh i, i = 1,..., N The cost function is 2 ˆΦ, {ĥi} N i=1 = argmin Φ,{h i } N i=1 = argmin Φ,{h i } N i=1 E ( Φ, {h i } N ) i=1 { N } (4) [x i Φh i ] T [x i Φh i ] i=1 2 Note that we have defined θ Φ in Eq. 2. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 10 / 26

Principle Component Analysis To solve the non-uniqueness problem in Eq. 4, we enforce φ T d φ d = 1, d = 1,..., M, using a set of Lagrange multipliers {λ d } M d=1 : L(Φ, {h i }) = = = N M (x i Φh i ) T (x i Φh i ) + λ d (φ T d φ d 1) i=1 d=1 N (x i Φh i ) T (x i Φh i ) + tr{φλ M Φ T Λ} i=1 N x T x i 2h T i Φ T x i + h T i h i + tr{φλ M Φ T Λ} i=1 (5) where h i R M, Λ = diag{λ 1,..., λ M, 0,..., 0} R D D, Λ M = diag{λ 1,..., λ M } R M M, and Φ = [φ 1 φ 2 φ M ] R D M. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 11 / 26

Principle Component Analysis Setting L L Φ = 0 and h i = 0, we obtain: i x iĥt i = ˆΦΛ M and ˆΦT xi = ĥi = ĥt i = x T i ˆΦ where we have used: X tr{xbxt } = XB T + XB and a T X T b X = bat. Therefore, i x ix T i ˆΦ = ˆΦΛ M = S ˆΦ = ˆΦΛ M (6) So, ˆΦ comprises the M eigenvectors of S. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 12 / 26

Interpretation of Λ M Denote X as a D N centered data matrix whose n-th column is given by (x n 1 N N i=1 x i). The projected data matrix is given by Y = ˆΦ T X The covariance matrix of the transformed data is ( ) ( ) T YY T = ˆΦT X ˆΦT X = ˆΦ T XX T ˆΦ = ˆΦ T ˆΦΛM (see the eigen-equation in Eq. 6) = Λ M Therefore, the eigenvalues represent the variance of the projected vectors. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 13 / 26

PCA on High-Dimensional Data When, the dimension D of x i is very high, computing S and its eigenvectors directly are impractical. However, the rank of S is limited by the number of training examples: If there are N training examples, there will be at most N 1 eigenvectors with non-zero eigenvalues. If N D, the principal components can be computed more easily. Let X be a data matrix comprising the mean-subtracted x i s in its columns. Then, S = XX T and the eigen-decomposition of S is given by Sφ i = XX T φ i = λ i φ i Instead of performing eigen-decomposition of XX T, we perform eigen-decomposition of X T Xψ i = λ i ψ i (7) Man-Wai MAK (EIE) PCA and LDA October 26, 2018 14 / 26

Principle Component Analysis Pre-multipling both side of Eq. 7 by X, we obtain XX T (Xψ i ) = λ i (Xψ i ) This means that if ψ i is an eigenvector of X T X, then φ i = Xψ i is an eigenvector of S = XX T. So, all we need is to compute the N 1 eigenvectors of X T X, which has size N N. Note that φ i computed in this way is un-normalized. So, we need to normalize them by φ i = Xψ i Xψ i, i = 1,..., N 1 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 15 / 26

Example Application of PCA: Eigenface Eigenface is one of the most well-known applications of PCA. µ φ 1 φ 2 h φ 10 h 10 1 h... 2 h 399 +... Reconstructed faces using 399 eigenfaces Original faces Man-Wai MAK (EIE) PCA and LDA October 26, 2018 16 / 26

Example Application of PCA: Eigenface Faces reconstructed using different numbers of principal components (eigenfaces): Original 1 PC 20 PCs 50 PCs 100 PCs 200 PCs 399 PCs See Lab2 of EIE4105 in http://www.eie.polyu.edu.hk/ mwmak/myteaching.htm for implementation. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 17 / 26

Limitations of PCA Limita&ons of PCA PCA will will fail fail if the if the subspace subspace is non-linear is non-linear PCA can only find this Linear subspace (PCA is fine) Nonlinear subspace (PCA fails) Solution: Use non-linear embedding such as ISOMAP or DNN Solu9ons: Using non-linear embedding such as ISOMAP or DNN 15 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 18 / 26

Fisher Discriminant FDA a classifica-on Analysis method to separate data into two classes. FDABut is afda classification could also method be toconsidered separate dataas intoa two supervised classes. FDAdimension could also bereduc-on considered as method a supervised that dimension reduces reduction the method dimension that reduces to 1. the dimension to 1. Project data onto line joining the 2 means Project data onto FDA subspace 15 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 19 / 26

Fisher Discriminant Analysis The idea of FDA is to find a 1-D line so that the projected data give a large separation between the means of two classes while also giving a small variance within each class, thereby minimizing the class overlap. Assume that training data are projected onto a 1-D space using Fisher criterion: where J(w) = y n = w T x n, n = 1,..., N. Between-class scatter Within-class scatter S B = (µ 2 µ 1 )(µ 2 µ 1 ) T and S W = = wt S B w w T S W w 2 k=1 n C k (x n µ k )(x n µ k ) T are the between-class and within-class scatter matrices, respectively, and µ 1 and µ 2 are the class means. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 20 / 26

Fisher Discriminant Analysis Note that only the direction of w matters. Therefore, we can always find a w that leads to w T S W w = 1. The maximization of J(w) can be rewritten as: The Lagrangian function is Setting L w = 0, we obtain max w w T S B w subject to w T S W w = 1 L(w, λ) = 1 2 wt S B w λ(w T S W w 1) S B w λs W w = 0 = S B w = λs W w = (S 1 W S B)w = λw So, w is the first eigenvector of S 1 W S B. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 21 / 26 (8)

LDA on Multi-class Problems For multiple classes (K > 2 and D > K), we can use LDA to project D-dimensional vectors to M-dimensional vectors, where 1 < M < K. w is extended to a matrix W = [w 1 w M ] and the projected scalar y i is extended to a vector y i : y n = W T (x n µ), where y nj = w T j (x n µ), j = 1,..., M where µ is the global mean of training vectors. The between-class and within-class scatter matrices become S B = S W = K N k (µ k µ)(µ k µ) T k=1 K k=1 n C k (x n µ k )(x n µ k ) T where N k is the number of samples in the class k, i.e., N k = C k. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 22 / 26

LDA on Multi-class Problems The LDA criterion function: J(W) = Between-class scatter Within-class scatter { ( ) ( ) } 1 = Tr W T S B W W T S W W Constrained optimization: max W Tr{W T S B W} subject to W T S W W = I where I is an M M identity matrix. Note that unlike PCA in Eq. 5, because of the matrix S W in the constraint, we need to find one w j at a time. Note also that the constraint W T S W W = I suggests that w j s may not be orthogonal to each other. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 23 / 26

LDA on Multi-class Problems To find w j, we write the Lagrangian function as: L(w j, λ j ) = w T j S B w j λ j (w T j S W w j 1) Using Eq. 8, the optimal solution of w j satisfies (S 1 W S B)w j = λ j w j Therefore, W comprises the first M eigenvectors of S 1 W S B. A more formal proof can be find in [1]. As the maximum rank of S B is K 1, S 1 W S B has at most K 1 non-zero eigenvalues. As a result, M can be at most K 1. After the projection, the vectors y n s can be used to train a classifier (e.g., SVM) for classification. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 24 / 26

PCA vs. LDA LDA Projec+on: HCI Example Project 784-dim images to 3-dim LDA subspace formed by the 3 eigenvectors with the largest eigenvalues, i.e., W is a 784 x 3 matrix or W R 784 3 Project 784-dim vectors derived from 28 28 handwritten digits to 3-D space: LDA PCA Man-Wai MAK (EIE) PCA and LDA October 26, 2018 25 / 26

References [1] Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. San Diego, California, USA: Academic Press. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 26 / 26