Fisher s Linear Discriminant Analysis

Similar documents
Nonnegative Matrix Factorization

Kernel Principal Component Analysis

Manifold Learning: Theory and Applications to HRI

Bayes Decision Theory

Probabilistic Latent Semantic Analysis

Linear Dimensionality Reduction

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Statistical Pattern Recognition

Logistic Regression. Seungjin Choi

Independent Component Analysis (ICA)

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Regularized Discriminant Analysis and Reduced-Rank LDA

Notes on Implementation of Component Analysis Techniques

Linear Models for Regression

PCA and LDA. Man-Wai MAK

Linear Algebra Methods for Data Mining

Learning gradients: prescriptive models

Non-linear Dimensionality Reduction

LECTURE NOTE #10 PROF. ALAN YUILLE

Unsupervised dimensionality reduction

Linear Models for Regression

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

PCA and LDA. Man-Wai MAK

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Fisher Linear Discriminant Analysis

Introduction to Graphical Models

Nonlinear Dimensionality Reduction. Jose A. Costa

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Principal Component Analysis and Linear Discriminant Analysis

Apprentissage non supervisée

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Machine Learning 11. week

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Ch 4. Linear Models for Classification

Lecture 10: Dimension Reduction Techniques

14 Singular Value Decomposition

ECE 661: Homework 10 Fall 2014

A Least Squares Formulation for Canonical Correlation Analysis

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Finite Automata. Seungjin Choi

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Conditional Independence and Factorization

Introduction to Signal Detection and Classification. Phani Chavali

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Density Estimation. Seungjin Choi

Information Theory Primer:

A Coupled Helmholtz Machine for PCA

MLCC 2015 Dimensionality Reduction and PCA

Intrinsic Structure Study on Whale Vocalizations

LECTURE NOTE #11 PROF. ALAN YUILLE

LEC 3: Fisher Discriminant Analysis (FDA)

L11: Pattern recognition principles

Properties of Context-Free Languages

Advanced data analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

Nonparameteric Regression:

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Dimension Reduction and Low-dimensional Embedding

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Linear algebra and applications to graphs Part 1

Multivariate Statistical Analysis

Clustering VS Classification

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Dimension Reduction (PCA, ICA, CCA, FLD,

Introduction to Machine Learning Spring 2018 Note 18

Localized Sliced Inverse Regression

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Example: Face Detection

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Least Squares Optimization

Independent Component Analysis

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Discriminant Uncorrelated Neighborhood Preserving Projections

Data dependent operators for the spatial-spectral fusion problem

COMPSCI 514: Algorithms for Data Science

Lecture 6: Methods for high-dimensional problems

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

The Singular Value Decomposition

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Machine Learning (Spring 2012) Principal Component Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

Least Squares Optimization

Dimensionality Reduction:

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Introduction to Spectral Graph Theory and Graph Clustering

Introduction to Automata

Introduction to Machine Learning

Functional Analysis Review

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine Learning 2nd Edition

Transcription:

Fisher s Linear Discriminant Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 26

FLD or LDA Introduced by Fisher (1936) One of widely-used linear discriminant analysis (LDA) methods Curse of dimensionality Linear dimensionality reduction: PCA, ICA, FLD, MDS Nonlinear dimensionality reduction: Isomap, LLE, Laplacian eigenmap FLD aims at achieving an optimal linear dimensionality reduction for classification 2 / 26

An Example: Isotropic Case x 2 µ 1 µ 2 x 1 3 / 26

FLD: A Graphical Illustration x 2 µ 1 µ 2 x 1 4 / 26

Two Classes Given a set of data points, {x R D }, one wished to find a linear projection of the data onto a 1-dimensional space, y = w x. Sample means for x: µ i = 1 x. N i x C i Sample means for the projected points: µ i = 1 N i y = 1 w x = w µ N i. i y Y i x C i We know that the difference between sample means is not always a good measure of the separation between projected points: µ 1 µ 2 = w (µ 1 µ 2 ). Scale w µ 1 µ 2 (not desirable!). 5 / 26

FLD: Two Classes Define the within-class scatter for projected samples by s 1 2 + s2 2, where s i 2 = [ ] (y µ i ) 2 = w (x µ i ) (x µ i ) w. y Y i x C i FLD finds } {{ } S i w = arg max w µ 1 µ 2 2 s 2 1 + s2 2 = arg max w w S B w w S W w, where S W = S 1 + S 2 (within-class scatter matrix) and S B = (µ 1 µ 2 ) (µ 1 µ 2 ) (between-class scatter matrix). arg maxw w S B w w S W w S Bw = λs W w (generalized eigenvalue problem). 6 / 26

Multiple Discriminant Functions For the case of K classes, FLD involves K 1 discriminant functions, i.e., the projection is from R D to R K 1. Given a set of data {x R D }, one wishes to find a linear lower-dimensional embedding W R (K 1) D such that {y = W x} are classified as well as possible in the lower-dimensional space. y 1. y K 1 }{{} y = w 1. w K 1 } {{ } W x 1. x D. }{{} x 7 / 26

Scatter Matrices Within-class scatter matrix S W = K Between-class scatter matrix K S B = (µ i µ) (µ i µ) = C i Total scatter matrix: S T = S W + S B x C i (x µ i ) (x µ i ). K N i (µ i µ) (µ i µ). S T = x (x µ) (x µ). Rank(S B ) K 1, Rank(S W ) N K, Rank(S T ) N 1. 8 / 26

Total Scatter Matrix Define X = [X 1,..., X K ] where X i is a matrix whose columns are associated with data vectors belonging to C i. Define H W = [ X 1 µ 1 e 1,..., X K µ K e K ], H B = [ (µ 1 µ)e 1,..., (µ K µ)e K ], H T = [x 1 µ 1,..., x N µ]. One can easily see that H T = X µe = H W + H B. We also have S W = H W H W, S B = H B H B, S T = H T H T. Since H W H B = 0, we have S T = (H W + H B )(H W + H B ) = S W + S B. The column vectors of S W and S B are linear combinations of centered data samples. 9 / 26

FLD: Multiple Classes Define S W = S B = K (y µ i ) (y µ i ) y Y i K N i ( µ i µ) ( µ i µ). One can easily show that S W = W S W W, S B = W S B W. 10 / 26

FLD seeks K 1 discriminant functions W such that y = W x: leading to W = arg max W = arg max W = arg max W J FLD { S 1 tr S } W B { ( ) 1 ( ) } tr W S W W W S B W, arg max J FLD S B w i = λ i S W w i. W generalized eigenvalue problem 11 / 26

Rayleigh Quotient Definition Let A R m m be symmetric. The Rayleigh quotient R(x, A) is defined by R(x, A) = x Ax x x. Theorem Let A R m m be symmetric with its eigenvalues being {λ 1 λ m }. For x 0 R m, we have and in particular, λ m x Ax x x λ 1, x Ax λ m = min x 0 x x, λ x Ax 1 = max x 0 x x. 12 / 26

An Extremal Property of Generalized Eigenvalues Theorem Let A and B be m m matrices, with A being nonnegative definite and B positive definite. For h = 1,..., m, define X h = [x 1,..., x h ], Y h = [x h,..., x m ], where x 1,..., x m are linear independent eigenvectors of B 1 A corresponding to the eigenvalues Then where x = 0 is excluded. λ 1 ( B 1 A ) λ m ( B 1 A ). λ m ( B 1 A ) = min Y h+1bx =0 λ 1 ( B 1 A ) = max X h 1Bx =0 x Ax x Bx, x Ax x Bx, 13 / 26

Relation to Least Squares Regression: Binary Class Given a training set {x i, y i } N, where x i R D and y i {1, 1}, consider a linear discriminant function: f (x i ) = w x i + b. Partition the data matrix into two groups, each group of which contains examples in class 1 or class 2, i.e., X = [X 1, X 2 ], where X 1 R D N1 and X 2 R D N2. Define binary label vector y R N, then LS regression is formulated as arg min y X w b1 N 2, w,b where 1 N is the N-dimensional vector of all ones, which can be re-written as [ ] [ arg min X 1 1 N1 w w,b 1 N2 b X 2 ] [ ] 1 N1 2. 1 N2 14 / 26

The solution to this LS problem satisfies the normal equation: [ ] [ ] [ ] [ ] [ ] X 1 X 2 X 1 1 N1 w X 1 X 2 1 1 N 1 1 N 2 X = N1 2 1 N2 b 1 N 1 1, N 2 1 N2 which is written as [ X 1 X 1 + X 2 X 2 X 1 1 N1 + X 2 1 N2 1 N 1 X 1 + 1 N 2 X 2 1 N 1 1 N1 + 1 N 2 1 N2 ] [ w b ] [ X 1 1 N1 X 2 1 N2 = 1 N 1 1 N1 1 N 2 1 N2 ]. Recall S B = (µ 1 µ 2 )(µ 1 µ 2 ) ) ) ) ) S W = (X 1 µ 1 1 N1 (X 1 µ 1 1 N1 + (X 2 µ 2 1 N2 (X 2 µ 2 1 N2 = X 1 X 1 N 1 µ 1 µ 2 + X 2 X 2 N 2 µ 2 µ 2. 15 / 26

With S B and S W, the normal equation is written as [ SW + N 1 µ 1 µ 1 + N ] [ 2µ 2 µ 2 N 1 µ 1 + N 2 µ 2 w (N 1 µ 1 + N 2 µ 2 ) N 1 + N 2 b ] = [ ] N1 µ 1 N 2 µ 2. N 1 N 2 Solve the 2nd equation for b to obtain b = (N 1 N 2 ) (N 1 µ 1 + N 2 µ 2 ) w N 1 + N 2. Substitute this into the 1st equation to obtain [ S W + N ] 1N 2 S B w = 2N 1 N 2 (µ N 1 + N 1 µ 2 ). 2 16 / 26

Note that the vector S B w is in the direction of µ 1 µ 2 for w, since S B w = (µ 1 µ 2 )(µ 1 µ 2 ) w. Thus we write Then we have N 1 N 2 N 1 + N 2 S B w = (2N 1 N 2 α)(µ 1 µ 2 ). w = αs 1 W (µ 1 µ 2 ), which is identical to FLD solutions except for scaling factor. 17 / 26

Simultaneous Diagonalization The goal: Given two symmetric matrices, Σ 1 and Σ 2, find a linear transformation W such that W Σ 1 W = I, W Σ 2 W = Λ. (diagonal) Methods: It turns out that simultaneous diagonalization involves the generalized eigen-decomposition. Two-stage method 1. whitening 2. unitary transformation Single-stage method: generalized eigenvalue decomposition 18 / 26

Simultaneous Diagonalization: Algorithm Outline 1. First, whiten Σ 1, i.e., where Σ 1 = U 1 DU 1. D 1 2 U 1 Σ 1 U 1 D 1 2 = I, D 1 2 U 1 Σ 2 U 1 D 1 2 = K, (not diagonal), 2. Second, apply an unitary transformation to diagonalize K, i.e., where K = U 2 ΛU 2. U 2 I U 2 = I, U 2 KU 2 = Λ, Then, the transformation W which simultaneously diagonalizes Σ 1 and Σ 2, is given by, W = U 1 D 1 2 U2, such that W Σ 1 W = I and W Σ 2 W = Λ 19 / 26

Simultaneous Diagonalization: Generalized Eigen-Decomposition Alternatively we can diagonalize two symmetric matrices Σ 1 and Σ 2 as W Σ 1 W = I, W Σ 2 W = Λ, (diagonal) where Λ, W are eigenvalues and eigenvectors of Σ 1 1 Σ 2, i.e., Σ 1 1 Σ 2W = W Λ. Prove it! 20 / 26

Example: Multi-Modal Data 21 / 26

Alternative Expressions of S W and S B Alternatively, S W and S B are expressed as S W = 1 A W ij (x i x j )(x i x j ), 2 i j S B = 1 A B ij (x i x j )(x i x j ), 2 i j A W ij = A B ij = { 1 N k if x i C k and x j C k, 0 if x i and x i are in different classes, { 1 N 1 N k if x i C k and x j C k, if x i and x i are in different classes. 1 N 22 / 26

S W = = = = = = 1 2 = 1 2 K (x µ i )(x µ i ) x C i K x j 1 x u x j 1 x v N x i N j C i x i u C i x v C i K x jx j 1 x j x v 1 x ux j + 1 N x i N j C i x i v C i x N u C i 2 i K x i x 1 i x ux v N i x u C i x v C i ( N ) A W ij x i x i A W ij x i x j j=1 j=1 A W ij j=1 ( x i x i + x j x j x i x j x j x i A W ij (x i x j )(x i x j ). j=1 ) x u C i x ux v x v C i 23 / 26

S B = S T S W = (x i µ)(x i µ) S W = = 1 2 = 1 2 ( N ) 1 x i x i N 1 2 j=1 j=1 j=1 j=1 A W ij ( 1 N AW ij 1 N x i x j j=1 ( x i x i + x j x j x i x j x j x i ) ( x i x i + x j x j x i x j x j x i ( ) 1 N AW ij (x i x j )(x i x j ). ) ) 24 / 26

Local Within-Class and Between-Class Scatter Given weighted adjacency matrix [A ij ], introduce local within-class scatter and local between-class scatter: S W = 1 A W ij (x i x j )(x i x j ), 2 i j S B = 1 A B ij (x i x j )(x i x j ), 2 i j A W ij = A B ij = { Aij N k if x i C k and x j C k, 0 if x i and x i are in different classes, { ( ) 1 A ij N 1 N k if x i C k and x j C k, 1 N if x i and x i are in different classes. 25 / 26

Local Fisher Discriminant Analysis (LFDA) Proposed by M. Sugiyama (ICML-2006). LFDA seeks K 1 discriminant functions W such that y = W x: arg max W Local within-class scatter matrix { ( ) 1 ( ) } tr W S W W W S B W, S W = 1 A W ij (x i x j )(x i x j ), 2 i j Local between-class scatter matrix S B = 1 A B ij (x i x j )(x i x j ). 2 i j 26 / 26