Fisher Linear Discriminant Analysis

Similar documents
Introduction to Signal Detection and Classification. Phani Chavali

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

LEC 3: Fisher Discriminant Analysis (FDA)

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Fisher s Linear Discriminant Analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Motivating the Covariance Matrix

Machine Learning 11. week

L11: Pattern recognition principles

Linear Dimensionality Reduction

Data Mining and Analysis: Fundamental Concepts and Algorithms

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Dimension Reduction (PCA, ICA, CCA, FLD,

LECTURE NOTE #10 PROF. ALAN YUILLE

Face Recognition. Lecture-14

Lecture: Face Recognition

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria

Face Recognition. Lecture-14

Machine Learning (Spring 2012) Principal Component Analysis

Clustering VS Classification

Statistical Pattern Recognition

Principal Component Analysis and Linear Discriminant Analysis

Discriminant analysis and supervised classification

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Linear Algebra Methods for Data Mining

Machine Learning 2nd Edition

Linear & nonlinear classifiers

A Statistical Analysis of Fukunaga Koontz Transform

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Discriminant Kernels based Support Vector Machine

1 Singular Value Decomposition and Principal Component

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Constructing Optimal Subspaces for Pattern Classification. Avinash Kak Purdue University. November 16, :40pm. An RVL Tutorial Presentation

COMP 558 lecture 18 Nov. 15, 2010

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Lecture 7: Con3nuous Latent Variable Models

Machine learning for pervasive systems Classification in high-dimensional spaces

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

1 Principal Components Analysis

SGN (4 cr) Chapter 5

Eigenvalues, Eigenvectors, and an Intro to PCA

ECE 661: Homework 10 Fall 2014

Ch 4. Linear Models for Classification

Introduction to gradient descent

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

COMP 408/508. Computer Vision Fall 2017 PCA for Recognition

Machine Learning - MT & 14. PCA and MDS

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Principal Component Analysis

Introduction to Machine Learning Spring 2018 Note 18

Family Feud Review. Linear Algebra. October 22, 2013

6-1. Canonical Correlation Analysis

Feature selection and extraction Spectral domain quality estimation Alternatives

CSC 411 Lecture 12: Principal Component Analysis

Symmetric Two Dimensional Linear Discriminant Analysis (2DLDA)

Least Squares Optimization

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Least Squares Optimization

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Linear Dimensionality Reduction by Maximizing the Chernoff Distance in the Transformed Space

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

COS 429: COMPUTER VISON Face Recognition

Example: Face Detection

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CS281 Section 4: Factor Analysis and PCA

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Notes on Implementation of Component Analysis Techniques

Multivariate statistical methods and data mining in particle physics

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Singular Value Decompsition

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

c 4, < y 2, 1 0, otherwise,

Lecture: Face Recognition and Feature Reduction

Least Squares Optimization

Elementary Linear Algebra

Myoelectrical signal classification based on S transform and two-directional 2DPCA

3.3 Eigenvalues and Eigenvectors

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Classification: The rest of the story

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Maximum variance formulation

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Transcription:

Fisher Linear Discriminant Analysis Cheng Li, Bingyu Wang August 31, 2014 1 What s LDA Fisher Linear Discriminant Analysis (also called Linear Discriminant Analysis(LDA)) are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. LDA is closely related to PCA, for both of them are based on linear, i.e. matrix multiplication, transformations. For the case of PCA, the transformation is baed on minimizing mean square error between original data vectors and data vectors that can be estimated fro the reduced dimensionality data vectors. And the PCA does not take into account any difference in class. But for the case of LDA, the transformation is based on maximizing a ratio of between-class variance to within-class variance with the goal of reducing data variation in the same class and increasing the separation between classes. Let s see an example of LDA as below(figure1): Figure 1: LDA examples The left plot shows samples from two classes (depicted in red and blue) along with the histograms resulting from projection onto the line joining the class means. Note that there is considerable class overlap in the projected space. The right plot shows the corresponding projection based on the Fisher linear discriminant, showing the greatly improved class separation. So our job is seeking to obtain a scalar y by projecting the samples X onto a line: y = θ T X 1

Then try to find the θ to maximize the ratio of between-class variance to within-class variance. Next, we will introduce how to use mathematic way to present this problem. 2 Theory and Model To figure out the LDA, first we need know how to translate between-class variance and within-class variance to mathematic language. Then we try to maximize the ratio between these two. To simplify the problem, we start with two classes problem. 2.1 Two Classes Problem 2.1.1 Head the Problem Assume we have a set of D-dimensional samples X = {x (1), x (2),...x (m) }, N 1 of which belong to class C 1, and N 2 of which belong to class C 2. We also assume the mean vector of two classes in X-space: u k = 1 N k x (i) where k = 1, 2. and in y-space: û k = 1 N k y (i) = 1 θ T x (i) = θ T u k where k = 1, 2. N k One way to define a measure of separation between two classes is to choose the distance between the projected means, which is in y-space, so the betweenclass variance is: û 2 û 1 = θ T (u 2 u 1 ) Also, we can define the within-class variance for each class C k is: ŝ 2 k = (y (i) û k ) 2 where k = 1, 2. Then, we get the between-class variance and within-class variance, we can define our objective function J(θ) as: J(θ) = (û 2 û 1 ) 2 ŝ 2 1 + ŝ2 2 In fact, if maximizing the objective function J, we are looking for a projection where examples from the class are projected very close to each other and at the same time, the projected means are as farther apart as possible. 2.1.2 Transform the Problem To find the optimum θ, we must express J(θ) as a function of θ. Before the optimum,we need introduce scatter instead of variance. We define some measures of the scatter as following: 2

The scatter in feature space-x: S k = (x (i) u k )(x (i) u k ) T Within-class scatter matrix: S W = S 1 + S 2 Between-class scather matrix: S B = (u 2 u 1 )(u 2 u 1 ) T Let s see J(θ) again: J(θ) = (û 2 û 1 ) 2 ŝ 2 1 + ŝ2 2 The scatter of the projection y can then be expressed as a function of the scatter matrix in feature space x: ŝ 2 k = (y (i) û k ) 2 = (θ T x (i) θ T u k ) 2 = θ T (x (i) u k )(x (i) u k ) T θ So we can get: = θ T S k θ ŝ 2 1 + ŝ 2 2 = θ T S 1 θ + θ T S 2 θ = θ T S W θ Similarly, the difference between the projected means can be expressed in terms of the means in the original feature space: (û 2 û 1 ) 2 = (θ T u 2 θ T u 1 ) 2 = θ T (u 2 u 1 )(u 2 u 1 ) T θ = θ T S B θ We can finally express the Fisher criterion in terms of S W and S B as: J(θ) = θt S B θ θ T S W θ Next, we will maximize this objective function. 2.1.3 Solve the Problem The easiest way to maximize the object function J is to derive it and set it to zero. J(θ) θ = θ ( θt S B θ θ T S W θ ) = (θ T S W θ) (θt S B θ) (θ T S B θ) (θt S W θ) = 0 θ θ = = (θ T S W θ)2s B θ (θ T S B θ)2s W θ = 0 3

Divided by θ T S W θ : = ( θt S W θ θ T S W θ )S Bθ ( θt S B θ θ T S W θ )S W θ = 0 = S B θ JS W θ = 0 = S 1 W S Bθ Jθ = 0 = Jθ = S 1 W S Bθ = Jθ = S 1 W (u 2 u 1 )(u 2 u 1 ) T θ = Jθ = S 1 W (u 2 u 1 )((u 2 u 1 ) T θ) }{{} c R = Jθ = cs 1 W (u 2 u 1 ) = θ = c J S 1 W (u 2 u 1 ) For now, the problem has been solved and we just want to get the direction of the θ, which is the optimum θ : θ S 1 W (u 2 u 1 ) This is known as Fisher s linear discriminant(1936), although it is not a discriminant but rather a specific choice of direction for the projection of the data down to one dimension, which is y = θ T X. 2.2 MultiClasses Problem Based on two classes problem, we can see that the fisher s LDA generalizes gracefully for multiple classes problem. Assume we still have a set of D-dimensional samples X = {x (1), x (2),..., x (m) }, and there are totally C classes. Instead of one projection y, mentioned above, we now will seek (C 1) projections [y 1, y 2,... y C 1 ] by means of (C 1) projection vectors θ i arranged by columns into a projection matrix Θ = [θ 1 θ 2... θ C 1 ], where: 2.2.1 Derivation y i = θ T i X = y = Θ T X First we will use the scatters in space-x as following: Within-class scatter matrix: S W = C S i where S i = (x (i) ui )(x (i) u i ) T and u i = 1 N i i C i Between-class scatter matrix: S B = C N i (u i u)(u i u) T where u = 1 m m x (i) = 1 m i C i x (i) C N i u i Total scatter matrix: S T = S B + S W 4

Figure 2: LDA Multi-Class examples Before moving on, let us see a picture for the multi-class example in Figure2: Similarly, we define the mean vector and scatter matrices for the projected samples as: û i = 1 N i i C i y (i) û = 1 N m y(i) ŜW = C y C i (y û i )(y û i ) T ŜB = C N i(û i û)(û i û) T From our derivation for the two-class problem, we can get: Ŝ W = Θ T S W Θ (1) Ŝ B = Θ T S B Θ (2) Recall that we are looking for a projection that maximizes the ratio of betweenclass to within-class scatter. Since the projection is no longer a scalar (it has C 1 dimensions), we use the determinant of the scatter matrices to obtain a scalar objective function: J(W ) = ŜB ŜW = ΘT S B Θ Θ T S W Θ And now, our job is to seek the projection matrix Θ that maximize this ratio. We will not give the derivation process. But we know that the optimal projection matrix Θ is the one whose columns are the eigenvectors corresponding to the 5

largest eigenvalues of the following generalized eigenvalue problem: Θ = [θ 1 θ 2... θ C 1] = argmax ΘT S B Θ Θ T S W Θ = (S B λ i S W )θ i = 0 Thus, if S W is a non-singular matrix, and can be inverted, then the Fisher s criterion is maximized when the projection matrix Θ is composed of the eigenvectors of: S 1 W S B Noticed that, there will be at most C 1 eigenvectors with non-zero real corresponding eigenvalues λ i. This is because S B is of rank (C 1) or less. So we can see that LDA can represent a massive reduction in the dimensionality of the problem. In face recognition for example there may be several thousand variables, but only a few hundred classes. 3 References 1. L10: Linear discriminants analysis [http://research.cs.tamu.edu/prism/ lectures/pr/pr_l10.pdf] 2. LDA-linear discriminant analysis [http://webdancer.is-programmer.com/ posts/37867.html] 3. Lecture 16: Linear Discriminant Analysis [http://www.doc.ic.ac.uk/ ~dfg/probabilisticinference/idapilecture16.pdf] 4. Nonlinear Dimensionality Reduction Methods for Use with Automatic Speech Recognition By Stephen A. Zahorian and Hongbing Hu 6