CS81B/Stat41B: Advanced Topcs n Learnng & Decson Makng Mamal Margn Classfer Lecturer: Mchael Jordan Scrbes: Jana van Greunen Corrected verson - /1/004 1 References/Recommended Readng 1.1 Webstes www.kernel-machnes.org 1. Books Learnng wth Kernels - MIT press Shawe-Taylor and Crstann (nd edton) Separatng Bnary Data wth a Hyper-plane.1 Observatons from Lecture 1 From the dscusson of the four classfers n lecture 1, t can be seen that all methods use the nner product between the data vectors j, j, nstead of the data vectors themselves. Ths fact s obvous for the Perceptron classfer. In lecture 1 we saw that θ at each step s a weghted sum of s. The predctons are determned by takng the nner product of θ. Ths nner product epands nto an nner product between the vectors. The use of the nner product nstead of the data vectors themselves wll yeld great computatonal savngs and allow kernelzaton (see lecture 3 for more detals).. Mamal Margn cont In ths lecture we wll consder problems for whch the nput data does not overlap, n other words, there ests a hyper-plane whch separates the data nto the two sets correspondng to y = 1 and y = 1. Note: In ths problem t wll be very useful to consder the Lagrangan dual problem. The Lagrangan dual turns out to be an unconstraned quadratc problem that s smpler to solve. The Lagrangan approach also gves nsght about the structure of the soluton (.e. the soluton s also a weghted lnear combnaton of the data elements.) Frst, consder a vector w that s perpendcular to the separatng hyper-plane. Defne: θ = (w, b) T where b s a scalar value. 1
Mamal Margn Classfer The prmal problem can be formulated n the followng way: mnmze w T w subject to y (w t + b) 1 The optmzaton problem s the mnmzaton of the norm of w. It s a quadratc problem because the constrants are lnear whle the objectve functon s quadratc. The fact that mnmzng the norm of w solves the problem may be surprsng at frst but can be shown wth the followng algebra. Pck z 1 and z 1 so that they are the vectors lyng on the boundares defnng the margns (see Fgure 1 for a depcton of z 1 and z 1 ). (Note: for the mamal margn problem, all data ponts that do not le on ether z 1 or z 1 can be dscarded because they do not contrbute to the margns.) It s easy to see that the separatng hyperplane needs to le at md-dstance between z 1 and z 1 (otherwse you could move the plane towards the mdpont to ncrease the mnmal margn). And so the mnmal margn wll smply be half the perpendcular dstance between z 1 and z 1, that s half the vector between z 1 and z 1 projected on the unt normal to the plane: To fnd the value of ths epresson, we do the followng manpulatons: Now: And: w T margn = 1 w (z 1 z 1 ) (1) w T z 1 + b = 1 w T z 1 + b = 1 (w T z 1 + b) (w T z 1 + b) = w T (z 1 z 1 ) = and so comparng wth (1), we get the epresson for the margn: margn = 1 w Thus, mnmzng w s equvalent to mamzng the mnmal margn 1/ w..3 Usng a Lagrangan A general conve optmzaton has the followng form: mnmze f() s.t. g() 0 Now, defne a Lagrangan: L(, ) = f() + T g() Clam: the orgnal problem can be wrtten as follows: mn ma L(, ) s.t. 0
Mamal Margn Classfer 3 Fgure 1: Plot of Data Showng Perpendcular vector w and vectors z 1 and z 1. Ths can be seen by takng the nner term: { } f() g() 0 ma L(, ) = g() 0 When we are n the non-feasble regon the solutons s whch wll never be pcked by the outer loop mnmzaton. When we are n the feasble regon, the objectve functon f() s mnmzed..4 Leap of Intuton (Dual Mamal Margn) Instead of wrtng the problem n the followng way (prmal): mn ma L(, ) s.t. 0 we swap the two operators and solve ma mn L(, ). In general, the two problems do not have the same soluton. However, f certan constrant qualfcatons hold, the solutons are the same. One eample of a constrant qualfcaton s Slater s qualfcaton, whch means that the problem must be strctly feasble. Now, why does swappng make sense?
4 Mamal Margn Classfer Consder a plot of the mage of the doman of under the mappng (g(), f()) (see fgure ). The optmal prmal soluton les on the ordnate, on the lower boundary of the mage of ths mapppng. In the dual problem, the Lagrangan f() + g() s beng mnmzed. On the graph ths s the y-ntercept of the lne wth slope passng through the pont (g(), f()). The mnmzaton fnds the smallest such y-ntercept, rangng over all. Ths corresponds to the dual functon. The subsequent mamzaton of the dual functon takes the mamum of such y-ntercepts. Ths yelds the same pont as the prmal soluton. Fgure : Dual and Prmal solutons for Conve Data. In general, the problem can be solved n the followng way: start wth a the solve to get a lower bound, adjust and solve agan. Or the problem can be solved by choosng an as a startng pont for the prmal problem and a as a startng pont for the dual-problem and then closng the gap between the solutons. These are called prmal-dual algorthms. When the set s not conve there ests a dualty gap. Ths s demonstrated n fgure 3..5 Solvng for w and b We can rewrte the Lagrangan as follows: L(w, b, α) = 1 wt w + We frst take the dervatve wth respect to w and set to zero: n α (1 y (w T + b) 1 δl δw = 0
Mamal Margn Classfer 5 Fgure 3: Dualty Gap for Non-Conve Data. => w = α y. Substtutng w back nto the Lagrangan we get: α α y b 1 α α j y y j T j. j Thus, for α such that α y = 0 we get: θ(α) = α 1 α α j y y j T j. j On the other hand, for α such that α y 0, we get θ(α) = (by takng b to nfnty). When mamzng the dual functon, t s clear that ponts α such that α y 0 cannot be mama. Thus we have uncovered an mplct constrant n the problem, namely that α y = 0. Thus the dual problem reduces to mamzng: θ(α) = α α y b 1 α α j y y j T j j subject to α 0 and α y = 0.