Fsher Lnear Dscrmnant Analyss Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan Fsher lnear dscrmnant analyss. 1 Fsher LDA The most famous example of dmensonalty reducton s prncpal components analyss. Ths technque searches for drectons n the data that have largest varance and subsequently project the data onto t. In ths way, we obtan a lower dmensonal representaton of the data, that removes some of the nosy drectons. There are many dffcult ssues wth how many drectons one needs to choose, but that s beyond the scope of ths note. PCA s an unsupervsed technque and as such does not nclude label nformaton of the data. For nstance, f we magne 2 cgar lke clusters n 2 dmensons, one cgar has y = 1 and the other y = 1. The cgars are postoned n parallel and very closely together, such that the varance n the total data-set, gnorng the labels, s n the drecton of the cgars. For classfcaton, ths would be a terrble projecton, because all labels get evenly mxed and we destroy the useful nformaton. A much more useful projecton s orthogonal to the cgars,.e. n the drecton of least overall varance, whch would perfectly separate the data-cases (obvously, we would stll need to perform classfcaton n ths 1-D space). So the queston s, how do we utlze the label nformaton n fndng nformatve projectons? To that purpose Fsher-LDA consders maxmzng the followng objectve: J(w) = wt S w w T S W w (1) where S s the between classes scatter matrx and S W s the wthn classes scatter matrx. Note that due to the fact that scatter matrces are proportonal to the covarance matrces we could have defned J usng covarance matrces the proportonalty constant would have no effect on the soluton. The defntons of the scatter matrces are: S = c S W = c N c (µ c x)(µ c x) T (2) (x µ c )(x µ c ) T (3) c
where, µ c = 1 x (4) N c c x = = 1 x = 1 N c µ c (5) N N c and N c s the number of cases n class c. Oftentmes you wll see that for 2 classes S s defned as S = (µ 1 µ 2 )(µ 1 µ 2 ) T. Ths s the scatter of class 1 wth respect to the scatter of class 2 and you can show that S = N 1N 2 N S, but snce t bols down to multplyng the objectve wth a constant s makes no dfference to the fnal soluton. Why does ths objectve make sense. Well, t says that a good soluton s one where the class-means are well separated, measured relatve to the (sum of the) varances of the data assgned to a partcular class. Ths s precsely what we want, because t mples that the gap between the classes s expected to be bg. It s also nterestng to observe that snce the total scatter, S T = (x x)(x x) T (6) s gven by S T = S W + S the objectve can be rewrtten as, J(w) = wt S T w w T S W w 1 (7) and hence can be nterpreted as maxmzng the total scatter of the data whle mnmzng the wthn scatter of the classes. An mportant property to notce about the objectve J s that s s nvarant w.r.t. rescalngs of the vectors w αw. Hence, we can always choose w such that the denomnator s smply w T S W w = 1, snce t s a scalar tself. For ths reason we can transform the problem of maxmzng J nto the followng constraned optmzaton problem, mn w 1 2 wt S w (8) correspondng to the lagrangan, s.t. w T S W w = 1 (9) L P = 1 2 wt S w + 1 2 λ(wt S W w 1) (10) (the halves are added for convenence). The KKT condtons tell us that the followng equaton needs to hold at the soluton, S w = λs W w W S w = λw (11) Ths almost looks lke an egen-value equaton, f the matrx W S would have been symmetrc (n fact, t s called a generalzed egen-problem). However, we can apply the followng transformaton, usng the fact that S s symmetrc postve defnte and can hence be wrtten as, where s constructed from ts egenvalue decomposton as S = UΛU T = UΛ 1 2 U T. Defnng v = w we get, W v = λv (12) Ths problem s a regular egenvalue problem for a symmetrc, postve defnte matrx W and for whch we can can fnd soluton λ k and v k that would correspond to solutons w k = S 1 2 v k.
Remans to choose whch egenvalue and egenvector corresponds to the desred soluton. Pluggng the soluton back nto the objectve J, we fnd, J(w) = wt S w w T S W w = λ wk T S W w k k wk T S = λ k (13) W w k from whch t mmedately follows that we want the largest egenvalue to maxmze the objectve 1. 2 Kernel Fsher LDA So how do we kernelze ths problem? Unlke SVMs t doesn t seem the dual problem reveal the kernelzed problem naturally. ut nspred by the SVM case we make the followng key assumpton, w = α Φ(x ) (14) Ths s a central recurrent equaton that keeps poppng up n every kernel machne. It says that although the feature space s very hgh (or even nfnte) dmensonal, wth a fnte number of data-cases the fnal soluton, w, wll not have a component outsde the space spanned by the data-cases. It would not make much sense to do ths transformaton f the number of data-cases s larger than the number of dmensons, but ths s typcally not the case for kernel-methods. So, we argue that although there are possbly nfnte dmensons avalable a pror, at most N are beng occuped by the data, and the soluton w must le n ts span. Ths s a case of the representers theorem that ntutvely reasons as follows. The soluton w s the soluton to some egenvalue equaton, W w = λw, where both S and S W (and hence ts nverse) le n the span of the data-cases. Hence, the part w that s perpendcular to ths span wll be projected to zero and the equaton above puts no constrants on those dmensons. They can be arbtrary and have no mpact on the soluton. If we now assume a very general form of regularzaton on the norm of w, then these orthogonal components wll be set to zero n the fnal soluton: w = 0. In terms of α the objectve J(α) becomes, J(α) = αt S Φ α α T S Φ W α (15) where t s understood that vector notaton now apples to a dfferent space, namely the space spanned by the data-vectors, R N. The scatter matrces n kernel space can expressed n terms of the kernel only as follows (ths requres some algebra to verfy), S Φ = c [ κc κ T c κκ T ] (16) S Φ W = K 2 c N c κ c κ T c (17) κ c = 1 K j (18) N c c κ = 1 K j (19) N 1 If you try to fnd the dual and maxmze that, you ll get the wrong sgn t seems. My best guess of what goes wrong s that the constrant s not lnear and as a result the problem s not convex and hence we cannot expect the optmal dual soluton to be the same as the optmal prmal soluton.
So, we have managed to express the problem n terms of kernels only whch s what we were after. Note that snce the objectve n terms of α has exactly the same form as that n terms of w, we can solve t by solvng the generalzed egenvalue equaton. Ths scales as N 3 whch s certanly expensve for many datasets. More effcent optmzaton schemes solvng a slghtly dfferent problem and based on effcent quadratc programs exst n the lterature. Projectons of new test-ponts nto the soluton space can be computed by, w T Φ(x) = α K(x, x) (20) as usual. In order to classfy the test pont we stll need to dvde the space nto regons whch belong to one class. The easest possblty s to pck the cluster wth smallest Mahalonobs dstance: d(x, µ Φ c ) = (x α µ α c ) 2 /(σc α ) 2 where µ α c and σc α represent the class mean and standard devaton n the 1-d projected space respectvely. Alternatvely, one could tran any classfer n the 1-d subspace. One very mportant ssue that we dd not pay attenton to s regularzaton. Clearly, as t stands the kernel machne wll overft. To regularze we can add a term to the denomnator, S W S W + βi (21) y addng a dagonal term to ths matrx makes sure that very small egenvalues are bounded away from zero whch mproves numercal stablty n computng the nverse. If we wrte the Lagrangan formulaton where we maxmze a constraned quadratc form n α, the extra term appears as a penalty proportonal to α 2 whch acts as a weght decay term, favorng smaller values of α over larger ones. Fortunately, the optmzaton problem has exactly the same form n the regularzed case. 3 A Constraned Convex Programmng Formulaton of FDA We wll now gve a smplfed dervaton of an equvalent mathematcal program derved by Mka and co. We frst represent the problem n yet another form as, mn w 1 2 wt S W w (22) s.t. w T S w = c (23) where we have swtched the role of wthn and between scatter (and replaced a mnus sgn wth a plus sgn). Now we note that by shftng the coordnates x x + a we can always acheve that the overall mean of the data s wherever we lke to be. The soluton for w does not depend on t. We also recall that the constrant on S can be equvalently wrtten as, w T S w = c µ w 1 µ w 2 2 = g (24) where µ w c s mean n the projected space. Snce both g and x are free at our dsposal, we can equvalently pck µ w 1 and µ w 2 and let c and x be determned by that choce. We choose µ w 1 = 1 and µ w 2 = 1 or µ w c = y c for convenence. The objectve can be expressed as, w T S W w = (w T x µ w 1 ) 2 + :y =+1 :y = 1 (w T x µ w 1 ) 2 (25) We can replace µ c = y c n the above expresson f explctely add ths constrant. Defnng ξ = w T x y we fnd w T S W w = (ξ 1 ) 2 + ξ 2 (26) :y =+1 :y = 1(ξ 2 ) 2 =
by defnton of ξ. To express the constrants µ w c = y c c = 1, 2 we note that, ξ 1 = (w T x 1) = N(µ w 1 1). (27) :y =+1 :y =+1 Hence by constranng :y =c ξc = 0 we enforce the constrant. So fnally, the program s, 1 ξ 2 (28) 2 s.t. ξ = w T x y (29) ξ c = 0; c = 1, 2 (30) :y =c To move to kernel space you smply replace w T x j α jk(x j, x ) n the defnton of ξ and you add a regularzaton term on α to the objectve. Ths s typcally of the form α 2 or α T Kα. Ths exercse reveals two mportant thngs. Frstly, the end result looks a lot lke the program for the SVM and SVR case. In some sense we are regressng on the labels. The other thng s that we can change the norms on ξ and α from L 2 to L 1. Changng the norm on α wll have the effect of makng the soluton sparse n α.