Fisher Linear Discriminant Analysis

Similar documents
Kernel Methods and SVMs Extension

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Lecture 10 Support Vector Machines II

Support Vector Machines CS434

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Linear Approximation with Regularization and Moving Least Squares

Statistical pattern recognition

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Support Vector Machines

Difference Equations

2.3 Nilpotent endomorphisms

Support Vector Machines

1 Matrix representations of canonical matrices

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

The Prncpal Component Transform The Prncpal Component Transform s also called Karhunen-Loeve Transform (KLT, Hotellng Transform, oregenvector Transfor

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Singular Value Decomposition: Theory and Applications

Quantum Mechanics I - Session 4

CSE 252C: Computer Vision III

Lecture 3: Dual problems and Kernels

APPENDIX A Some Linear Algebra

6.854J / J Advanced Algorithms Fall 2008

Assortment Optimization under MNL

The exam is closed book, closed notes except your one-page cheat sheet.

Lecture 10 Support Vector Machines. Oct

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Section 8.3 Polar Form of Complex Numbers

Which Separator? Spring 1

Linear Feature Engineering 11

Errors for Linear Systems

Complex Numbers. x = B B 2 4AC 2A. or x = x = 2 ± 4 4 (1) (5) 2 (1)

14 Lagrange Multipliers

Important Instructions to the Examiners:

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Generalized Linear Methods

MMA and GCMMA two methods for nonlinear optimization

Lecture 12: Discrete Laplacian

1 GSW Iterative Techniques for y = Ax

Formulas for the Determinant

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

COS 521: Advanced Algorithms Game Theory and Linear Programming

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lagrange Multipliers Kernel Trick

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Ensemble Methods: Boosting

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Canonical transformations

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

SL n (F ) Equals its Own Derived Group

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

1 Convex Optimization

Support Vector Machines CS434

A Local Variational Problem of Second Order for a Class of Optimal Control Problems with Nonsmooth Objective Function

PHYS 705: Classical Mechanics. Calculus of Variations II

SIO 224. m(r) =(ρ(r),k s (r),µ(r))

Some modelling aspects for the Matlab implementation of MMA

Lecture 12: Classification

10-701/ Machine Learning, Fall 2005 Homework 3

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Support Vector Machines

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Feature Selection: Part 1

Math 261 Exercise sheet 2

The Geometry of Logit and Probit

Norms, Condition Numbers, Eigenvalues and Eigenvectors

Problem Set 9 Solutions

Perfect Competition and the Nash Bargaining Solution

Pattern Classification

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

LECTURE 9 CANONICAL CORRELATION ANALYSIS

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

The Feynman path integral

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

Composite Hypotheses testing

Unified Subspace Analysis for Face Recognition

Lecture 17: Lee-Sidford Barrier

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

12. The Hamilton-Jacobi Equation Michael Fowler

Inexact Newton Methods for Inverse Eigenvalue Problems

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Approximate Smallest Enclosing Balls

Representation theory and quantum mechanics tutorial Representation theory and quantum conservation laws

Lecture Notes on Linear Regression

Numerical Heat and Mass Transfer

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

The Order Relation and Trace Inequalities for. Hermitian Operators

Solutions Homework 4 March 5, 2018

MATH 241B FUNCTIONAL ANALYSIS - NOTES EXAMPLES OF C ALGEBRAS

Transcription:

Fsher Lnear Dscrmnant Analyss Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan Fsher lnear dscrmnant analyss. 1 Fsher LDA The most famous example of dmensonalty reducton s prncpal components analyss. Ths technque searches for drectons n the data that have largest varance and subsequently project the data onto t. In ths way, we obtan a lower dmensonal representaton of the data, that removes some of the nosy drectons. There are many dffcult ssues wth how many drectons one needs to choose, but that s beyond the scope of ths note. PCA s an unsupervsed technque and as such does not nclude label nformaton of the data. For nstance, f we magne 2 cgar lke clusters n 2 dmensons, one cgar has y = 1 and the other y = 1. The cgars are postoned n parallel and very closely together, such that the varance n the total data-set, gnorng the labels, s n the drecton of the cgars. For classfcaton, ths would be a terrble projecton, because all labels get evenly mxed and we destroy the useful nformaton. A much more useful projecton s orthogonal to the cgars,.e. n the drecton of least overall varance, whch would perfectly separate the data-cases (obvously, we would stll need to perform classfcaton n ths 1-D space). So the queston s, how do we utlze the label nformaton n fndng nformatve projectons? To that purpose Fsher-LDA consders maxmzng the followng objectve: J(w) = wt S w w T S W w (1) where S s the between classes scatter matrx and S W s the wthn classes scatter matrx. Note that due to the fact that scatter matrces are proportonal to the covarance matrces we could have defned J usng covarance matrces the proportonalty constant would have no effect on the soluton. The defntons of the scatter matrces are: S = c S W = c N c (µ c x)(µ c x) T (2) (x µ c )(x µ c ) T (3) c

where, µ c = 1 x (4) N c c x = = 1 x = 1 N c µ c (5) N N c and N c s the number of cases n class c. Oftentmes you wll see that for 2 classes S s defned as S = (µ 1 µ 2 )(µ 1 µ 2 ) T. Ths s the scatter of class 1 wth respect to the scatter of class 2 and you can show that S = N 1N 2 N S, but snce t bols down to multplyng the objectve wth a constant s makes no dfference to the fnal soluton. Why does ths objectve make sense. Well, t says that a good soluton s one where the class-means are well separated, measured relatve to the (sum of the) varances of the data assgned to a partcular class. Ths s precsely what we want, because t mples that the gap between the classes s expected to be bg. It s also nterestng to observe that snce the total scatter, S T = (x x)(x x) T (6) s gven by S T = S W + S the objectve can be rewrtten as, J(w) = wt S T w w T S W w 1 (7) and hence can be nterpreted as maxmzng the total scatter of the data whle mnmzng the wthn scatter of the classes. An mportant property to notce about the objectve J s that s s nvarant w.r.t. rescalngs of the vectors w αw. Hence, we can always choose w such that the denomnator s smply w T S W w = 1, snce t s a scalar tself. For ths reason we can transform the problem of maxmzng J nto the followng constraned optmzaton problem, mn w 1 2 wt S w (8) correspondng to the lagrangan, s.t. w T S W w = 1 (9) L P = 1 2 wt S w + 1 2 λ(wt S W w 1) (10) (the halves are added for convenence). The KKT condtons tell us that the followng equaton needs to hold at the soluton, S w = λs W w W S w = λw (11) Ths almost looks lke an egen-value equaton, f the matrx W S would have been symmetrc (n fact, t s called a generalzed egen-problem). However, we can apply the followng transformaton, usng the fact that S s symmetrc postve defnte and can hence be wrtten as, where s constructed from ts egenvalue decomposton as S = UΛU T = UΛ 1 2 U T. Defnng v = w we get, W v = λv (12) Ths problem s a regular egenvalue problem for a symmetrc, postve defnte matrx W and for whch we can can fnd soluton λ k and v k that would correspond to solutons w k = S 1 2 v k.

Remans to choose whch egenvalue and egenvector corresponds to the desred soluton. Pluggng the soluton back nto the objectve J, we fnd, J(w) = wt S w w T S W w = λ wk T S W w k k wk T S = λ k (13) W w k from whch t mmedately follows that we want the largest egenvalue to maxmze the objectve 1. 2 Kernel Fsher LDA So how do we kernelze ths problem? Unlke SVMs t doesn t seem the dual problem reveal the kernelzed problem naturally. ut nspred by the SVM case we make the followng key assumpton, w = α Φ(x ) (14) Ths s a central recurrent equaton that keeps poppng up n every kernel machne. It says that although the feature space s very hgh (or even nfnte) dmensonal, wth a fnte number of data-cases the fnal soluton, w, wll not have a component outsde the space spanned by the data-cases. It would not make much sense to do ths transformaton f the number of data-cases s larger than the number of dmensons, but ths s typcally not the case for kernel-methods. So, we argue that although there are possbly nfnte dmensons avalable a pror, at most N are beng occuped by the data, and the soluton w must le n ts span. Ths s a case of the representers theorem that ntutvely reasons as follows. The soluton w s the soluton to some egenvalue equaton, W w = λw, where both S and S W (and hence ts nverse) le n the span of the data-cases. Hence, the part w that s perpendcular to ths span wll be projected to zero and the equaton above puts no constrants on those dmensons. They can be arbtrary and have no mpact on the soluton. If we now assume a very general form of regularzaton on the norm of w, then these orthogonal components wll be set to zero n the fnal soluton: w = 0. In terms of α the objectve J(α) becomes, J(α) = αt S Φ α α T S Φ W α (15) where t s understood that vector notaton now apples to a dfferent space, namely the space spanned by the data-vectors, R N. The scatter matrces n kernel space can expressed n terms of the kernel only as follows (ths requres some algebra to verfy), S Φ = c [ κc κ T c κκ T ] (16) S Φ W = K 2 c N c κ c κ T c (17) κ c = 1 K j (18) N c c κ = 1 K j (19) N 1 If you try to fnd the dual and maxmze that, you ll get the wrong sgn t seems. My best guess of what goes wrong s that the constrant s not lnear and as a result the problem s not convex and hence we cannot expect the optmal dual soluton to be the same as the optmal prmal soluton.

So, we have managed to express the problem n terms of kernels only whch s what we were after. Note that snce the objectve n terms of α has exactly the same form as that n terms of w, we can solve t by solvng the generalzed egenvalue equaton. Ths scales as N 3 whch s certanly expensve for many datasets. More effcent optmzaton schemes solvng a slghtly dfferent problem and based on effcent quadratc programs exst n the lterature. Projectons of new test-ponts nto the soluton space can be computed by, w T Φ(x) = α K(x, x) (20) as usual. In order to classfy the test pont we stll need to dvde the space nto regons whch belong to one class. The easest possblty s to pck the cluster wth smallest Mahalonobs dstance: d(x, µ Φ c ) = (x α µ α c ) 2 /(σc α ) 2 where µ α c and σc α represent the class mean and standard devaton n the 1-d projected space respectvely. Alternatvely, one could tran any classfer n the 1-d subspace. One very mportant ssue that we dd not pay attenton to s regularzaton. Clearly, as t stands the kernel machne wll overft. To regularze we can add a term to the denomnator, S W S W + βi (21) y addng a dagonal term to ths matrx makes sure that very small egenvalues are bounded away from zero whch mproves numercal stablty n computng the nverse. If we wrte the Lagrangan formulaton where we maxmze a constraned quadratc form n α, the extra term appears as a penalty proportonal to α 2 whch acts as a weght decay term, favorng smaller values of α over larger ones. Fortunately, the optmzaton problem has exactly the same form n the regularzed case. 3 A Constraned Convex Programmng Formulaton of FDA We wll now gve a smplfed dervaton of an equvalent mathematcal program derved by Mka and co. We frst represent the problem n yet another form as, mn w 1 2 wt S W w (22) s.t. w T S w = c (23) where we have swtched the role of wthn and between scatter (and replaced a mnus sgn wth a plus sgn). Now we note that by shftng the coordnates x x + a we can always acheve that the overall mean of the data s wherever we lke to be. The soluton for w does not depend on t. We also recall that the constrant on S can be equvalently wrtten as, w T S w = c µ w 1 µ w 2 2 = g (24) where µ w c s mean n the projected space. Snce both g and x are free at our dsposal, we can equvalently pck µ w 1 and µ w 2 and let c and x be determned by that choce. We choose µ w 1 = 1 and µ w 2 = 1 or µ w c = y c for convenence. The objectve can be expressed as, w T S W w = (w T x µ w 1 ) 2 + :y =+1 :y = 1 (w T x µ w 1 ) 2 (25) We can replace µ c = y c n the above expresson f explctely add ths constrant. Defnng ξ = w T x y we fnd w T S W w = (ξ 1 ) 2 + ξ 2 (26) :y =+1 :y = 1(ξ 2 ) 2 =

by defnton of ξ. To express the constrants µ w c = y c c = 1, 2 we note that, ξ 1 = (w T x 1) = N(µ w 1 1). (27) :y =+1 :y =+1 Hence by constranng :y =c ξc = 0 we enforce the constrant. So fnally, the program s, 1 ξ 2 (28) 2 s.t. ξ = w T x y (29) ξ c = 0; c = 1, 2 (30) :y =c To move to kernel space you smply replace w T x j α jk(x j, x ) n the defnton of ξ and you add a regularzaton term on α to the objectve. Ths s typcally of the form α 2 or α T Kα. Ths exercse reveals two mportant thngs. Frstly, the end result looks a lot lke the program for the SVM and SVR case. In some sense we are regressng on the labels. The other thng s that we can change the norms on ξ and α from L 2 to L 1. Changng the norm on α wll have the effect of makng the soluton sparse n α.