Margin Maximizing Loss Functions

Size: px

Start display at page:

Download "Margin Maximizing Loss Functions"

Erin Heath
5 years ago
Views:

1 Margn Maxmzng Loss Functons Saharon Rosset Watson Research Center IBM Yorktown, NY, J Zhu Deartment of Statstcs Unversty of Mchgan Ann Arbor, MI, Trevor Haste Deartment of Statstcs Stanford Unversty Stanford, CA, Abstract Margn maxmzng roertes lay an mortant role n the analyss of class caton models, such as boostng and suort vector machnes. Margn maxmzaton s theoretcally nterestng because t facltates generalzaton error analyss, and ractcally nterestng because t resents a clear geometrc nterretaton of the models beng bult. We formulate and rove a suf cent condton for the solutons of regularzed loss functons to converge to margn maxmzng searators, as the regularzaton vanshes. Ths condton covers the hnge loss of SVM, the exonental loss of AdaBoost and logstc regresson loss. We also generalze t to mult-class class caton roblems, and resent margn maxmzng multclass versons of logstc regresson and suort vector machnes. 1 Introducton Assume we have a class caton learnng samle {x,y } n =1 wth y { 1, +1}. We wsh to buld a model F (x) for ths data by mnmzng (exactly or aroxmately) a loss crteron C(y,F(x )) = C(y F (x )) whch s a functon of the margns y F (x ) of ths model on ths data. Most common class caton modelng aroaches can be cast n ths framework: logstc regresson, suort vector machnes, boostng and more. The model F (x) whch these methods actually buld s a lnear combnaton of dctonary functons comng from a dctonary H whch can be large or even n nte: F (x) = β j h j (x) h j H and our redcton at ont x based on ths model s sgnf (x). When H s large, as s the case n most boostng or kernel SVM alcatons, some regularzaton s needed to control the comlexty of the model F (x) and the resultng over- ttng. Thus, t s common that the quantty actually mnmzed on the data s a regularzed verson of the loss functon: (1) = mn C(y β h(x )) + λ β β where the second term enalzes for the l norm of the coef cent vector β ( 1 for convexty, and n ractce usually {1, }), and λ 0 s a tunng regularzaton arameter. The 1- and -norm suort vector machne tranng roblems wth slack can be cast n ths form ([6], chater 1). In [8] we have shown that boostng aroxmately follows the

2 ath of regularzed solutons traced by (1) as the regularzaton arameter λ vares, wth the arorate loss and an l 1 enalty. The man queston that we answer n ths aer s: for what loss functons does converge to an otmal searator as λ 0? The de nton of otmal whch we wll use deends on the l norm used for regularzaton, and we wll term t the l -margn maxmzng searatng hyer-lane. More concsely, we wll nvestgate for whch loss functons and under whch condtons we have: () lm λ 0 = arg max mn β =1 y β h(x ) Ths margn maxmzng roerty s nterestng for three dstnct reasons. Frst, t gves us a geometrc nterretaton of the lmtng model as we relax the regularzaton. It tells us that ths loss seeks to otmally searate the data by maxmzng a dstance between a searatng hyer-lane and the closest onts. A theorem by Mangasaran [7] allows us to nterret l margn maxmzaton as l q dstance maxmzaton, wth 1/ +1/q =1, and hence make a clear geometrc nterretaton. Second, from a learnng theory ersectve large margns are an mortant quantty generalzaton error bounds that deend on the margns have been generated for suort vector machnes ([10] usng l margns) and boostng ( [9] usng l 1 margns). Thus, showng that a loss functon s margn maxmzng n ths sense s useful and romsng nformaton regardng ths loss functon s otental for generatng good redcton models. Thrd, ractcal exerence shows that exact or aroxmate margn maxmzaon (such as non-regularzed kernel SVM solutons, or n nte boostng) may actually lead to good class caton redcton models. Ths s certanly not always the case, and we return to ths hotly debated ssue n our dscusson. Our man result s a suf cent condton on the loss functon, whch guarantees that () holds, f the data s searable,.e. f the maxmum on the RHS of () s ostve. Ths condton s resented and roven n secton. It covers the hnge loss of suort vector machnes, the logstc log-lkelhood loss of logstc regresson, and the exonental loss, most notably used n boostng. We dscuss these and other examles n secton 3. Our result generalzes elegantly to mult-class models and loss functons. We resent the resultng margn-maxmzng versons of SVMs and logstc regresson n secton 4. Suf cent condton for margn maxmzaton The followng theorem shows that f the loss functon vanshes quckly enough, then t wll be margn-maxmzng as the regularzaton vanshes. It rovdes us wth a un ed margn-maxmzaton theory, coverng SVMs, logstc regresson and boostng. Theorem.1 Assume the data {x,y } n =1 s searable,.e. β s.t. mn y β h(x ) > 0. Let C(y, f) =C(yf) be a monotone non-ncreasng loss functon deendng on the margn only. If T > 0 (ossbly T = ) such that: C(t [1 ɛ]) (3) lm =, ɛ >0 t T C(t) Then C s a margn maxmzng loss functon n the sense that any convergence ont of the normalzed solutons to the regularzed roblems (1) as λ 0 s an l margnmaxmzng searatng hyer-lane. Consequently, f ths margn-maxmzng hyer-lane s unque, then the solutons converge to t: (4) lm λ 0 = arg max mn y β h(x ) β =1

3 Proof We rove the result searately for T = and T<. a. T = : Lemma. λ 0 Proof Snce T = then C(m) > 0 m >0, and lm m C(m) =0. Therefore, for loss+enalty to vansh as λ 0, must dverge, to allow the margns to dverge. Lemma.3 Assume β 1,β are two searatng models, wth β 1 = β =1, and β 1 searates the data better,.e.: 0 <m = mn y h(x ) β <m 1 = mn y h(x ) β 1. Then U = U(m 1,m ) such that t >U, C(y h(x ) (tβ 1 )) < C(y h(x ) (tβ )) In words, f β 1 searates better than β then scaled-u versons of β 1 wll ncur smaller loss than scaled-u versons of β, f the scalng factor s large enough. Proof Snce condton (3) holds wth T =, there exsts U such that t >U, n. Thus from C beng non-ncreasng we mmedately get: t >U, C(y h(x ) (tβ 1 )) n C(tm 1 ) <C(tm ) < C(tm ) C(tm 1) > C(y h(x ) (tβ )) Proof of case a.: Assume β s a convergence ont of as λ 0, wth β =1. Now assume by contradcton β has β =1and bgger mnmal l margn. Denote the mnmal margns for the two models by m and m, resectvely, wth m < m. By contnuty of the mnmal margn n β, there exsts some oen neghborhood of β on the l shere: N β = {β : β =1, β β <δ} and an ɛ>0, such that: mn y β h(x ) < m ɛ, β N β Now by lemma.3 we get that exsts U = U( m, m ɛ) such that t β ncurs smaller loss than tβ for any t>u, β N β. Therefore β cannot be a convergence ont of b. T< Lemma.4 C(T )=0and C(T δ) > 0, δ >0. Proof From condton (3), C(T Tɛ) C(T ) Lemma.5 lm λ 0 mn y h(x )=T. =. Both results follow mmedately, wth δ = Tɛ. Proof Assume by contradcton that there s a sequence λ 1,λ,... 0 and ɛ>0s.t. j, mn y ˆβ(λj ) h(x ) T ɛ. Pck any searatng normalzed model β.e. β =1and m := mn y β h(x ) > 0. Then for any λ< m C(T ɛ) T we get: C(y T m β h(x )) + λ T m β <C(T ɛ)

4 snce the rst term (loss) s 0 and the enalty s smaller than C(T ɛ) by condton on λ. But j 0 s.t. λ j0 < m C(T ɛ) T and so we get a contradcton to otmalty of ˆβ(λ j0 ), snce we assumed mn y ˆβ(λj0 ) h(x ) T ɛ and thus: C(y ˆβ(λj0 ) h(x )) C(T ɛ) We have thus roven that lm nf λ 0 mn y h(x ) T. It remans to rove equalty. Assume by contradcton that for some value of λ we have m := mn y h(x ) >T. Then the re-scaled model T m has the same zero loss as, but a smaller enalty, snce T m = T m <. So we get a contradcton to otmalty of. Proof of case b.: Assume β s a convergence ont of as λ 0, wth β =1. Now assume by contradcton β has β =1and bgger mnmal margn. Denote the mnmal margns for the two models by m and m, resectvely, wth m < m. Let λ 1,λ,... 0 be a sequence along whch ˆβ(λ j) ˆβ(λ j) β. By lemma.5 and our assumton, ˆβ(λ j ) T m > T m. Thus, j 0 such that j >j 0, ˆβ(λ j ) > consequently: C(y ˆβ(λj ) h(x )) + λ ˆβ(λ j ) >λ( T m ) = T C(y m βh(x )) + λ T m β So we get a contradcton to otmalty of ˆβ(λ j ). Thus we conclude for both cases a. and b. that any convergence ont of T m and must maxmze the l margn. Snce =1, such convergence onts obvously exst. If the l -margn-maxmzng searatng hyer-lane s unque, then we can conclude: Necessty results ˆβ := arg max mn β =1 y β h(x ) A necessty result for margn maxmzaton on any searable data seems to requre ether addtonal assumtons on the loss or a relaxaton of condton (3). We conjecture that f we also requre that the loss s convex and vanshng (.e. lm m C(m) =0) then condton (3) s suf cent and necessary. However ths s stll a subject for future research. 3 Examles Suort vector machnes Suort vector machnes (lnear or kernel) can be descrbed as a regularzed roblem: (5) mn [1 y β h(x )] + + λ β β where =for the standard ( -norm ) SVM and =1for the 1-norm SVM. Ths formulaton s equvalent to the better known norm mnmzaton SVM formulaton n the sense that they have the same set of solutons as the regularzaton arameter λ vares n (5) or the slack bound vares n the norm mnmzaton formulaton.

5 The loss n (5) s termed hnge loss snce t s lnear for margns less than 1, then xed at 0 (see gure 1). The theorem obvously holds for T =1, and t ver es our knowledge that the non-regularzed SVM soluton, whch s the lmt of the regularzed solutons, maxmzes the arorate margn (Eucldean for standard SVM, l 1 for 1-norm SVM). Note that our theorem ndcates that the squared hnge loss (AKA truncated squared loss): C(y,F(x )) = [1 y F (x )] + s also a margn-maxmzng loss. Logstc regresson and boostng The two loss functons we consder n ths context are: (6) Exonental : C e (m) = ex( m) (7) Log lkelhood : C l (m) = log(1 + ex( m)) These two loss functons are of great nterest n the context of two class class caton: C l s used n logstc regresson and more recently for boostng [4], whle C e s the mlct loss functon used by AdaBoost - the orgnal and most famous boostng algorthm [3]. In [8] we showed that boostng aroxmately follows the regularzed ath of solutons usng these loss functons and l 1 regularzaton. We also roved that the two loss functons are very smlar for ostve margns, and that ther regularzed solutons converge to margn-maxmzng searators. Theorem.1 rovdes a new roof of ths result, snce the theorem s condton holds wth T = for both loss functons. Some nterestng non-examles Commonly used class caton loss functons whch are not margn-maxmzng nclude any olynomal loss functon: C(m) = 1 m, C(m) =m, etc. do not guarantee convergence of regularzed solutons to margn maxmzng solutons. Another nterestng method n ths context s lnear dscrmnant analyss. Although t does not corresond to the loss+enalty formulaton we have descrbed, t does nd a decson hyer-lane n the redctor sace. For both olynomal loss functons and lnear dscrmnant analyss t s easy to nd examles whch show that they are not necessarly margn maxmzng on searable data. 4 A mult-class generalzaton Our man result can be elegantly extended to versons of mult-class logstc regresson and suort vector machnes, as follows. Assume the resonse s now mult-class, wth K ossble values.e. y {c 1,..., c K }. Our model conssts of a redcton for each class: F k (x) = β (k) j h j (x) h j H wth the obvous redcton rule at x beng arg max k F k (x). Ths gves rse to a K 1 dmensonal margn for each observaton. For y = c k, de ne the margn vector as: (8) m(c k,f 1,..., f K )=(f k f 1,..., f k f k 1,f k f k+1,..., f k f K ) And our loss s a functon of ths K 1 dmensonal margn: C(y, f 1,..., f K )= k I{y = c k }C(m(c k,f 1,..., f K ))

6 3.5 hnge exonental logstc Fgure 1: Margn maxmzng loss functons for -class roblems (left) and the SVM 3-class loss functon of secton 4.1 (rght) The l -regularzed roblem s now: (9) = arg mn C(y,h(x ) β (1),..., h(x ) β (K) )+λ β (1),...,β (K) k β (k) Where =(ˆβ (1) (λ),..., ˆβ (K) (λ)) R K H. In ths formulaton, the concet of margn maxmzaton corresonds to maxmzng the mnmal of all n (K 1) normalzed l -margns generated by the data: (10) max β (1) β (K) =1 mn mn h(x ) (β (y) β (k) ) y c k Note that ths margn maxmzaton roblem stll has a natural geometrc nterretaton, as h(x ) (β (y) β (k) ) > 0, k y mles that the hyer-lane h(x) (β (j) β (k) )=0 successfully searates classes j and k for any two classes. Here s a generalzaton of the otmal searaton theorem.1 to mult-class models: Theorem 4.1 Assume C(m) s commutatve and decreasng n each coordnate, then f T >0 (ossbly T = ) such that: C(t[1 ɛ],tu 1,...tu K ) (11) lm t T =, C(t, tv 1,..., tv K ) ɛ >0, u 1 1,..., u K 1,v 1 1,...v K 1 Then C s a margn-maxmzng loss functon for mult-class models, n the sense that any convergence ont of the normalzed solutons to (9),, attans the otmal searaton as de ned n (10) Idea of roof The roof s essentally dentcal to the two class case, now consderng the n (K 1) margns on whch the loss deends. The condton (11) mles that as the regularzaton vanshes the model s determned by the mnmal margn, and so an otmal model uts the emhass on maxmzng that margn.

7 Corollary 4. In the -class case, theorem 4.1 reduces to theorem.1. Proof The loss deends on β (1) β (), the enalty on β (1) + β (). An otmal soluton to the regularzed roblem must thus have β (1) + β () =0, snce by transformng: β (1) β (1) β(1) + β (), β () β () β(1) + β () we are not changng the loss, but reducng the enalty, by Jensen s nequalty: β (1) β(1) + β () + β () β(1) + β () = β(1) β () β (1) + β () So we can conclude that ˆβ (1) (λ) = ˆβ () (λ) and consequently that the two margn maxmzaton tasks (), (10) are equvalent. 4.1 Margn maxmzaton n mult-class SVM and logstc regresson Here we aly theorem 4.1 to versons of mult-class logstc regresson and SVM. For logstc regresson, we use a slghtly dfferent formulaton than the standard logstc regresson models, whch uses class K as a reference class,.e. assumes that β (K) =0. Ths s requred for non-regularzed ttng, snce wthout t the soluton s not unquely de ned. However, usng regularzaton as n (9) guarantees that the soluton wll be unque and consequently we can symmetrze the model whch allows us to aly theorem 4.1. So the loss functon we use s (assume y = c k belongs to class k): (1) C(y, f 1,..., f K ) e f k = log e f e = fk = log(e f1 f k e f k 1 f k +1+e f k+1 f k e fk f k ) wth the lnear model: f j (x )=h(x ) β (j). It s not df cult to verfy that condton (11) holds for ths loss functon wth T =, usng the fact that log(1 + ɛ) =ɛ + O(ɛ ). The sum of exonentals whch results from alyng ths rst-order aroxmaton sats es (11), and as ɛ 0, the second order term can be gnored. For suort vector machnes, consder a mult-class loss whch s a natural generalzaton of the two-class loss: (13) K 1 C(m) = [1 m j ] + j=1 Where m j s the j th comonent of the mult-margn m as n (8). Fgure 1 shows ths loss for K =3classes as a functon of the two margns. The loss+enalty formulaton usng 13 s equvalent to a standard otmzaton formulaton of mult-class SVM (e.g. [11]): max s.t. c h(x ) (β (y) β (k) ) c(1 ξ k ), {1,...n}, k {1,..., K}, c k y ξ k 0, ξ k B, β (k) =1,k k As both theorem 4.1 (usng T =1) and the otmzaton formulaton ndcate, the regularzed solutons to ths roblem converge to the l margn maxmzng mult-class soluton.

8 5 Dscusson What are the roertes we would lke to have n a class caton loss functon? Recently there has been a lot of nterest n Bayes-consstency of loss functons and algorthms ([1] and references theren), as the data sze ncreases. It turns out that ractcally all reasonable loss functons are consstent n that sense, although convergence rates and other measures of degree of consstency may vary. Margn maxmzaton, on the other hand, s a nte samle otmalty roerty of loss functons, whch s otentally of decreasng nterest as samle sze grows, snce the tranng data-set s less lkely to be searable. Note, however, that n very hgh dmensonal redctor saces, such as those tycally used by boostng or kernel SVM, searablty of any nte-sze data-set s a mld assumton, whch s volated only n athologcal cases. We have shown that the margn maxmzng roerty s shared by some oular loss functons used n logstc regresson, suort vector machnes and boostng. Knowng that these algorthms converge, as regularzaton vanshes, to the same model (rovded they use the same regularzaton) s an nterestng nsght. So, for examle, we can conclude that 1-norm suort vector machnes, exonental boostng and l 1 -regularzed logstc regresson all facltate the same non-regularzed soluton, whch s an l 1 -margn maxmzng searatng hyer-lane. From Mangasaran s theorem [7] we know that ths hyer-lane maxmzes the l dstance from the closest onts on ether sde. The most nterestng statstcal queston whch arses s: are these otmal searatng models really good for redcton, or should we exect regularzed models to always do better n ractce? Statstcal ntuton suorts the latter, as do some margn-maxmzng exerments by Breman [] and Grove and Schuurmans [5]. However t has also been observed that n many cases margn-maxmzaton leads to reasonable redcton models, and does not necessarly result n over- ttng. We have had smlar exerence wth boostng and kernel SVM. Settlng ths ssue s an ntrgung research toc, and one that s crtcal n determnng the ractcal mortance of our results, as well as that of margn-based generalzaton error bounds. References [1] Bartlett, P., Jordan, M. & McAulffe, J. (003). Convexty, Class caton and Rsk Bounds. Techncal reorts, det. of Statstcs, UC Berkeley. [] Breman, L. (1999). Predcton games and arcng algorthms. Neural Comutaton 7: [3] Freund, Y. & Scahre, R.E. (1995). A decson theoretc generalzaton of on-lne learnng and an alcaton to boostng. Proc. of nd Euroean Conf. on Comutatonal Learnng Theory. [4] Fredman, J. H., Haste, T. & Tbshran, R. (000). Addtve logstc regresson: a statstcal vew of boostng. Annals of Statstcs 8, [5] Grove, A.J. & Schuurmans, D. (1998). Boostng n the lmt: Maxmzng the margn of learned ensembles. Proc. of 15th Natonal Conf. on AI. [6] Haste, T., Tbshran, R. & Fredman, J. (001). Elements of Stat. Learnng. Srnger-Verlag. [7] Mangasaran, O.L. (1999). Arbtrary-norm searatng lane. Oeratons Research Letters, Vol. 4 1-:15-3 [8] Rosset, R., Zhu, J & Haste, T. (003). Boostng as a regularzed ath to a maxmum margn class er. Techncal reort, Det. of Statstcs, Stanford Unv. [9] Scahre, R.E., Freund, Y., Bartlett, P. & Lee, W.S. (1998). Boostng the margn: a new exlanaton for the effectveness of votng methods. Annals of Statstcs 6(5): [10] Vank, V. (1995). The Nature of Statstcal Learnng Theory. Srnger. [11] Weston, J. & Watkns, C. (1998). Mult-class suort vector machnes. Techncal reort CSD- TR-98-04, det of CS, Royal Holloway, Unversty of London.

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Exerments-I MODULE III LECTURE - 2 EXPERIMENTAL DESIGN MODELS Dr. Shalabh Deartment of Mathematcs and Statstcs Indan Insttute of Technology Kanur 2 We consder the models