Boosting as a Regularized Path to a Maximum Margin Classifier

Size: px

Start display at page:

Download "Boosting as a Regularized Path to a Maximum Margin Classifier"

Jodie Tyler
5 years ago
Views:

1 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset, J Zhu, Trevor Haste Department of Statstcs Stanford Unversty Stanford, CA, {saharon,jzhu,haste}@stat.stanford.edu May 5, 2003 Abstract In ths paper we study boostng methods from a new perspectve. We buld on recent work by Efron et al. to show that boostng approxmately (and n some cases exactly) mnmzes ts loss crteron wth an l 1 constrant on the coeffcent vector. Ths helps understand the success of boostng wth early stoppng as regularzed fttng of the loss crteron. For the two most commonly used crtera (exponental and bnomal loglkelhood), we further show that as the constrant s relaxed or equvalently as the boostng teratons proceed the soluton converges (n the separable case) to an l 1-optmal separatng hyper-plane. We prove that ths l 1-optmal separatng hyper-plane has the property of maxmzng the mnmal l 1-margn of the tranng data, as defned n the boostng lterature. An nterestng fundamental smlarty between boostng and kernel support vector machnes emerges, as both can be descrbed as methods for regularzed optmzaton n hgh-dmensonal predctor space, utlzng a computatonal trck to make the calculaton practcal, and convergng to margn-maxmzng solutons. Whle ths statement descrbes SVMs exactly, t apples to boostng only approxmately. Key-words: Boostng, regularzed optmzaton, support vector machnes, margn maxmzaton 1 Introducton and outlne Boostng s a method for teratvely buldng an addtve model F T (x) = T α t h jt (x), (1) t=1 1

2 where h jt H a large (but we wll assume fnte) dctonary of canddate predctors or weak learners ; and h jt s the bass functon selected as the best canddate to modfy at stage t. The model F T can equvalently be represented by assgnng a coeffcent to each dctonary functon h Hrather than to the selected h jt s only: J F T (x) = h j (x) β (T ) j, (2) j=1 where J = H and β (T ) j = j t=j α t. The β representaton allows us to nterpret the coeffcent vector β (T ) as a vector n R J or, equvalently, as the hyper-plane whch has β (T ) as ts normal. Ths nterpretaton wll play a key role n our exposton. Some examples of common dctonares are: The tranng varables themselves, n whch case h j (x) =x j. Ths leads to our addtve model F T beng just a lnear model n the orgnal data. The number of dctonary functons wll be J = d, the dmenson of x. Polynomal dctonary( of degree) p, n whch case the number of dctonary p + d functons wll be J = d Decson trees wth up to k termnal nodes, f we lmt the splt ponts to data ponts (or md-way between data ponts as CART does). The number of possble trees s bounded from above (trvally) by J (np) k 2 k2. Note that regresson trees do not ft nto our framework, snce they wll gve J =. The boostng dea was frst ntroduced by Freund and Schapre [6], wth ther AdaBoost algorthm. AdaBoost and other boostng algorthms have attracted a lot of attenton due to ther great success n data modelng tasks, and the mechansm whch makes them work has been presented and analyzed from several perspectves. Fredman et al. [8] develop a statstcal perspectve, whch ultmately leads to vewng AdaBoost as a gradent-based ncremental search for a good addtve model (more specfcally, t s a coordnate descent algorthm), usng the exponental loss functon C(y, F) = exp( yf), where y { 1, 1}. The gradent boostng [7] and anyboost [11] generc algorthms have used ths approach to generalze the boostng dea to wder famles of problems and loss functons. In partcular, [8] have ponted out that the bnomal log-lkelhood loss C(y, F) = log(1 + exp( yf)) s a more natural loss for classfcaton, and s more robust to outlers and msspecfed data. A dfferent analyss of boostng, orgnatng n the Machne Learnng communty, concentrates on the effect of boostng on the margns y F (x ). For example, [16] uses margn-based arguments to prove convergence of boostng to perfect classfcaton performance on the tranng data under general condtons, and to derve bounds on the generalzaton error (on future, unseen data). 2

3 In ths paper we combne the two approaches, to conclude that gradentbased boostng can be descrbed, n the separable case, as an approxmate margn maxmzng process. The vew we develop of boostng as an approxmate path of optmal solutons to regularzed problems also justfes early stoppng n boostng as specfyng a value for regularzaton parameter. We consder the problem of mnmzng convex loss functons (n partcular the exponental and bnomal log-lkelhood loss functons) over the tranng data, wth an l 1 bound on the model coeffcents: ˆβ(c) = arg mn C(y,h(x ) β). (3) β 1 c Where h(x )=[h 1 (x ),h 2 (x ),...,h J (x )] T and J = H. Haste et al. ([9], chapter 10) have observed that slow gradent-based boostng (.e. we set α t = ɛ, t n (1), wth ɛ small) tends to follow the penalzed path ˆβ(c) as a functon of c, under some mld condtons on ths path. In other words, usng the notaton of (2), (3), ths mples that β (c/ɛ) ˆβ(c) vanshes wth ɛ, for all (or a wde range of) values of c. Fgure 1 llustrates ths equvalence between ɛ-boostng and the optmal soluton of (3) on a real-lfe data set, usng squared error loss as the loss functon. In ths paper we demonstrate Lasso Forward Stagewse lcavol lcavol Coeffcents sv lweght pgg45 lbph gleason age Coeffcents sv lweght pgg45 lbph gleason age lcp j ˆβ j (c) lcp Iteraton Fgure 1: Exact coeffcent paths(left) for l 1 -constraned squared error regresson and boostng coeffcent paths (rght) on the data from a prostate cancer study ths equvalence further and formally state t as a conjecture. Some progress towards provng ths conjecture has been made by Efron et al. [5] who prove a weaker local result for the case where C s squared error loss, under some mld condtons on the optmal path. We generalze ther result to general convex loss functons. Combnng the emprcal and theoretcal evdence, we conclude that boostng can be vewed as an approxmate ncremental method for followng the l 1 -regularzed path. 3

4 We then prove that n the separable case, for both the exponental and logstc log-lkelhood loss functons, ˆβ(c)/c converges as c to an optmal separatng hyper-plane ˆβ descrbed by: ˆβ = arg max mn y β h(x ) (4) β 1=1 In other words, ˆβ maxmzes the mnmal margn among all vectors wth l1 - norm equal to 1. Ths result generalzes easly to other l p -norm constrants. For example, f p = 2, then ˆβ descrbes the optmal separatng hyper-plane n the Eucldean sense,.e. the same one that a non-regularzed support vector machne would fnd. Combnng our two man results, we get the followng characterzaton of boostng: ɛ-boostng can be descrbed as a gradent-descent search, approxmately followng the path of l 1 -constraned optmal solutons to ts loss crteron, and convergng, n the separable case, to a margn maxmzer n the l 1 sense. Note that boostng wth a large dctonary H (n partcular f n<j= H ) guarantees that the data wll be separable (except for pathologes), hence separablty s a very mld assumpton here. As n the case of support vector machnes n hgh dmensonal feature spaces, the non-regularzed optmal separatng hyper-plane s usually of theoretcal nterest only, snce t typcally represents a hghly over-ftted model. Thus, we would want to choose a good regularzed model. Our results ndcate that Boostng gves a natural method for dong that, by stoppng early n the boostng process. Furthermore, they pont out the fundamental smlarty between Boostng and SVMs: both approaches allow us to ft regularzed models n hgh-dmensonal predctor space, usng a computatonal trck. They dffer n the regularzaton approach they take exact l 2 regularzaton for SVMs, approxmate l 1 regularzaton for Boostng and n the computatonal trck that facltates fttng the kernel trck for SVMs, coordnate descent for Boostng. 1.1 Related work Schapre et al. [16] have dentfed the normalzed margns as dstance from an l 1 -normed separatng hyper-plane. Ther results relate the boostng teratons success to the mnmal margn of the combned model. Raetsch et al. [13] take ths further usng an asymptotc analyss of AdaBoost. They prove that the normalzed mnmal margn, mn y t α th t (x )/ t α t, s asymptotcally equal for both classes. In other words, they prove that the asymptotc separatng hyper-plane s equally far away from the closest ponts on ether sde. Ths s a property of the margn maxmzng separatng hyper-plane as we defne t. Both papers also llustrate the margn maxmzng effects of AdaBoost through expermentaton. However, they both stop short of provng the convergence to optmal (margn maxmzng) solutons. 4

5 Motvated by our result, Raetsch and Warmuth [14] have recently asserted the margn-maxmzng propertes of ɛ-adaboost, usng a dfferent approach than the one used n ths paper. Ther results relate only to the asymptotc convergence of nfntesmal AdaBoost, compared to our analyss of the regularzed path traced along the way and of a varety of boostng loss functons, whch also leads to a convergence result on bnomal log-lkelhood loss. The convergence of boostng to an optmal soluton from a loss functon perspectve has been analyzed n several papers. Raetsch et al. ([12]) and Collns et al ([3]) gve results and bounds on the convergence of tranng-set loss, C(y, t α th t (x )), to ts mnmum. However, n the separable case convergence of the loss to 0 s nherently dfferent from convergence of the lnear separator to the optmal separator. Any soluton whch separates the two classes perfectly can drve the exponental (or log-lkelhood) loss to 0, smply by scalng coeffcents up lnearly. 2 Boostng as gradent descent Generc gradent-based boostng algorthms such as [7, 11] attempt to fnd a good lnear combnaton of the members of some dctonary of bass functons to optmze a gven loss functon over a sample. Ths s done by searchng, at each teraton, for the bass functon whch gves the steepest descent n the loss, and changng ts coeffcent accordngly. In other words, ths s a coordnate descent algorthm n R J, where we assgn one dmenson (or coordnate) for the coeffcent of each dctonary functon. Assume we have data {x,y } n =1, a loss (or cost) functon C(y, F), and a set of bass functons {h j (x)} : R d R. Then all of these algorthms follow the same essental steps: Algorthm 1 Generc gradent-based boostng algorthm 1. Set β (0) =0. 2. For t =1:T, (a) Let F = β (t 1) h(x ),=1,...,n (the current ft). (b) Set w = C(y,F) F,=1,...,n. (c) Identfy j t = arg max j w h j (x ). (d) Set β (t) j t = β (t 1) j t α t and β (t) k = β (t 1) k,k j t. Here β (t) s the current coeffcent vector and α t > 0 s the current step sze. Notce that w h jt (x )= C(y,F) β jt. As we mentoned, algorthm 1 can be nterpreted smply as a coordnate descent algorthm n weak learner space. Implementaton detals nclude the dctonary H of weak learners, the loss functon C(y, F), the method of searchng for the optmal j t and the way n whch α t s determned (the sgn of α t 5

6 wll always be sgn( w h jt (x )), snce we want the loss to be reduced. In most cases, the dctonary H s negaton closed, and so t s assumed WLOG that α t > 0 ). For example, the orgnal AdaBoost algorthm uses ths scheme wth the exponental loss C(y, F) = exp( yf), and an mplct lne search to fnd the best α t once a drecton j has been chosen (see [9], chapter 10 and [11] for detals). The dctonary used by AdaBoost n ths formulaton would be a set of canddate classfers,.e. h j (x ) { 1, +1} usually decson trees areusednpractce. 2.1 Practcal mplementaton of boostng The dctonares used for boostng are typcally very large practcally nfnte and therefore the generc boostng algorthm we have presented cannot be mplemented verbatm. In partcular, t s not practcal to exhaustvely search for the maxmzer n step 2(c). Instead, an approxmate, usually greedy search s conducted to fnd a good canddate weak learner h jt whch makes the frst order declne n the loss large (even f not maxmal among all possble models). In the common case that the dctonary of weak learners s comprsed of decson trees wth up to k nodes, the way AdaBoost and other boostng algorthms solve stage 2(c) s by buldng a decson tree to a re-weghted verson of the data, wth the weghts w (w as defned above). Thus they frst replace step 2(c) wth mnmzaton of: w 1{y h jt (x )} whch s easly shown to be equvalent to the orgnal step 2(c). They then use a greedy decson-tree buldng algorthm such as CART or C5 to buld a decson tree whch mnmzes ths quantty,.e. acheves low weghted msclassfcaton error on the weghted data. Snce the tree s bult greedly one splt at a tme t wll not be the global mnmzer of weghted msclassfcaton error among all k-node decson trees. However, t wll be a good ft for the re-weghted data, and can be consdered an approxmaton to the optmal tree. Ths use of approxmate optmzaton technques s crtcal, snce much of the strength of the boostng approach comes from ts ablty to buld addtve models n very hgh-dmensonal predctor spaces. In such spaces, standard exact optmzaton technques are mpractcal: any approach whch requres calculaton and nverson of Hessan matrces s completely out of the queston, and even approaches whch requre only frst dervatves, such as coordnate descent, can only be mplemented approxmately. 2.2 Gradent-based boostng as a generc modelng tool As [7, 11] menton, ths vew of boostng as gradent descent allows us to devse boostng algorthms for any functon estmaton problem all we need s an approprate loss and an approprate dctonary of weak learners. For example, [8] suggested usng the bnomal log-lkelhood loss nstead of the exponental 6

7 loss of AdaBoost for bnary classfcaton, resultng n the LogtBoost algorthm. However, there s no need to lmt boostng algorthms to classfcaton [7] appled ths methodology to regresson estmaton, usng squared error loss and regresson trees, and [15] appled t to densty estmaton, usng the log-lkelhood crteron and Bayesan networks as weak learners. Ther experments and those of others llustrate that the practcal usefulness of ths approach coordnate descent n hgh dmensonal predctor space carres beyond classfcaton, and even beyond supervsed learnng. The vew we present n ths paper, of coordnate-descent boostng as approxmate l 1 -regularzed fttng, offers some nsght nto why ths approach would be good n general: t allows us to ft regularzed models drectly n hgh dmensonal predctor space. In ths t bears a conceptual smlarty to support vector machnes, whch exactly ft an l 2 regularzed model n hgh dmensonal (RKH) predctor space. 2.3 Loss functons The two most commonly used loss functons for boostng classfcaton models are the exponental and the (mnus) bnomal log-lkelhood: Exponental : Loglkelhood : C e (y, F) = exp( yf) C l (y, F) = log(1 + exp( yf)) These two loss functons bear some mportant smlartes to each other. As [8] 2.5 Exponental Logstc Fgure 2: The two classfcaton loss functons shows, the populaton mnmzer of expected loss at pont x s smlar for both loss functons and s gven by: [ ] P (y =1 x) ˆF (x) =c log P (y = 1 x) where c e =1/2 for exponental loss and c l = 1 for bnomal loss. 7

8 More mportantly for our purpose, we have the followng smple proposton, whch llustrates the strong smlarty between the two loss functons for postve margns (.e. correct classfcatons): Proposton 1 yf 0 0.5C e (y, F) C l (y, F) C e (y, F) (5) In other words, the two losses become smlar f the margns are postve, and both behave lke exponentals. Proof: Consder the functons f 1 (z) =z and f 2 (z) =log(1 + z) forz [0, 1]. Then f 1 (0) = f 2 (0) = 0, and: f 1 (z) 1 z 1 2 f 2(z) = 1 z 1+z 1 Thus we can conclude 0.5f 1 (z) f 2 (z) f 1 (z). Now set z = exp( yf) andwe get the desred result. For negatve margns the behavors of C e and C l are very dfferent, as [8] have noted. In partcular, C l s more robust aganst outlers and msspecfed data. 2.4 Lne-search boostng vs. ɛ-boostng As mentoned above, AdaBoost determnes α t usng a lne search. In our notaton for algorthm 1 ths would be: α t = arg mn C(y,F + αh jt (x )) α The alternatve approach, suggested by Fredman [7, 9], s to shrnk all α t to a sngle small value ɛ. Ths may slow down learnng consderably (dependng on how small ɛ s), but s attractve theoretcally: the frst-order theory underlyng gradent boostng mples that the weak learner chosen s the best ncrement only locally. It can also be argued that ths approach s stronger than lne search, as we can keep selectng the same h jt repeatedly f t remans optmal and so ɛ-boostng domnates lne-search boostng n terms of tranng error. In practce, ths approach of slowng the learnng rate usually performs better than lne-search n terms of predcton error as well (see [7]). For our purposes, we wll mostly assume ɛ s nfntesmally small, so the theoretcal boostng algorthm whch results s the lmt of a seres of boostng algorthms wth shrnkng ɛ. In regresson termnology, the lne-search verson s equvalent to forward stage-wse modelng, nfamous n the statstcs lterature for beng too greedy and hghly unstable (See [7]). Ths s ntutvely obvous, snce by ncreasng the coeffcent untl t saturates we are destroyng sgnal whch may help us select other good predctors. 8

9 Fgure 3: A smple data example, wth two observatons from class O and two observatons from class X. The full lne s the Eucldean margn-maxmzng separatng hyper-plane. 3 l p margns, support vector machnes and boostng We now ntroduce the concept of margns as a geometrc nterpretaton of a bnary classfcaton model. In the context of boostng, ths vew offers a dfferent understandng of AdaBoost from the gradent descent vew presented above. In the followng sectons we connect the two vews. 3.1 The Eucldean margn and the support vector machne Consder a classfcaton model n hgh dmensonal predctor space: F (x) = j h j(x)β j. We say that the model separates the tranng data {x,y } n =1 f sgn(f (x )) = y,. From a geometrcal perspectve ths means that the hyper-plane defned by F (x) = 0 s a separatng hyper-plane for ths data, and we defne ts (Eucldean) margn as: m 2 (β) = mn y F (x ) β 2 (6) The margn-maxmzng separatng hyper-plane for ths data would be defned by β whch maxmzes m 2 (β). Fgure 3 shows a smple example of separable data n 2 dmensons, wth ts margn-maxmzng separatng hyper-plane. The Eucldean margn-maxmzng separatng hyper-plane s the (non regularzed) support vector machne soluton. Its margn maxmzng propertes play a central role n dervng generalzaton error bounds for these models, and form the bass for a rch lterature. 9

10 Fgure 4: l 1 margn maxmzng separatng hyper-plane for the same data set as fgure 3. The dfference between the dagonal Eucldean optmal separator and the vertcal l 1 optmal separator llustrates the sparsty effect of optmal l 1 separaton 3.2 The l 1 margn and ts relaton to boostng Instead of consderng the Eucldean margn as n (6) we can defne an l p margn concept as y F (x ) m p (β) = mn (7) β p Of partcular nterest to us s the case p = 1. Fgure 4 shows the l 1 margn maxmzng separatng hyper-plane for the same smple example as fgure 3. Note the fundamental dfference between the two solutons: the l 2 -optmal separator s dagonal, whle the l 1 -optmal one s vertcal. To understand why ths s so we can relate the two margn defntons to each other as: yf(x) β 1 = yf(x) β 2 β 2 β 1 (8) From ths representaton we can observe that the l 1 margn wll tend to be bg f the rato β 2 β 1 s bg. Ths rato wll generally be bg f β s sparse. To see ths, consder fxng the l 1 norm of the vector and then comparng the l 2 norm of two canddates: one wth many small components and the other a sparse one wth a few large components and many zero components. It s easy to see that the second vector wll have bgger l 2 norm, and hence (f the l 2 margn for both vectors s equal) a bgger l 1 margn. A dfferent perspectve on the dfference between the optmal solutons s gven by a theorem due to Mangasaran [10], whch states that the l p margn maxmzng separatng hyper plane maxmzes the l q dstance from the closest ponts to the separatng hyper-plane, wth 1 p + 1 q = 1. Thus the Eucldean optmal separator (p = 2) also maxmzes Eucldean dstance between the ponts and the hyper-plane, whle the l 1 optmal separator maxmzes l dstance. Ths nterestng result gves another ntuton why l 1 optmal separatng hyperplanes tend to be coordnate-orented (.e. have sparse representatons): snce 10

11 l projecton consders only the largest coordnate dstance, some coordnate dstances may be 0 at no cost of ncreased l dstance. [16] have ponted out the relaton between AdaBoost and the l 1 margn. They prove that, n the case of separable data, the boostng teratons ncrease the boostng margn of the model, defned as: mn y F (x) α 1 (9) In other words, ths s the l 1 margn of the model, except that t uses the α ncremental representaton rather than the β geometrc representaton for the model. The two representatons gve the same l 1 norm f there s sgn consstency, or monotoncty n the coeffcent paths traced by the model,.e. f at every teraton t of the boostng algorthm: β jt 0 sgn(α t )=sgn(β jt ) (10) As we wll see later, ths monotoncty condton wll play an mportant role n the equvalence between boostng and l 1 regularzaton. The l 1 -margn maxmzaton vew of AdaBoost presented by [16] and a whole plethora of papers that followed s mportant for the analyss of boostng algorthms for two dstnct reasons: It gves an ntutve, geometrc nterpretaton of the model that AdaBoost s lookng for a model whch separates the data well n ths l 1 -margn sense. Note that the vew of boostng as gradent descent n a loss crteron doesn t really gve the same knd of ntuton: f the data s separable, then any model whch separates the tranng data wll drve the exponental or bnomal loss to 0 when scaled up: m 1 (β) > 0 = C(y,dβ x ) 0 as d The l 1 -margn behavor of a classfcaton model on ts tranng data facltates generaton of generalzaton (or predcton) error bounds, smlar to those that exst for support vector machnes. From a statstcal perspectve, however, we should be suspcous of margnmaxmzaton as a method for buldng good predcton models n hgh dmensonal predctor space. Margn maxmzaton s by nature a non-regularzed objectve, and solvng t n hgh dmensonal space s lkely to lead to overfttng and bad predcton performance.ths has been observed n practce by many authors, n partcular Breman [2]. In secton 7 we return to dscuss these ssues n more detal. 11

12 4 Boostng as approxmate ncremental l 1 constraned fttng In ths secton we ntroduce an nterpretaton of the generc coordnate-descent boostng algorthm as trackng a path of approxmate solutons to l 1 -constraned (or equvalently, regularzed) versons of ts loss crteron. Ths vew serves our understandng of what boostng does, n partcular the connecton between early stoppng n boostng and regularzaton. We wll also use ths vew to get a result about the asymptotc margn-maxmzaton of regularzed classfcaton models, and by analogy of classfcaton boostng. We buld on deas frst presented by [9] (chapter 10) and [5]. Gven a loss crteron C(, ), consder the 1-dmensonal path of optmal solutons to l 1 constraned optmzaton problems over the tranng data: ˆβ(c) = arg mn C(y,h(x ) β). (11) β 1 c As c vares, we get that ˆβ(c) traces a 1-dmensonal optmal curve through R J. If an optmal soluton for the non-constraned problem exsts and has fnte l 1 norm c 0, then obvously ˆβ(c) = ˆβ(c 0 )= ˆβ, c >c 0. Note that n the case of separable 2-class data, usng ether C e or C l, there s no fnte-norm optmal soluton. Rather, the constraned soluton wll always have ˆβ(c) 1 = c. A dfferent way of buldng a soluton whch has l 1 norm c, s to run our ɛ-boostng algorthm for c/ɛ teratons. Ths wll gve an α (c/ɛ) vector whch has l 1 norm exactly c. For the norm of the geometrc representaton β (c/ɛ) to also be equal to c, we need the monotoncty condton (10) to hold as well. Ths condton wll play a key role n our exposton. We are gong to argue that the two soluton paths ˆβ(c) andβ (c/ɛ) are very smlar for ɛ small. Let us start by observng ths smlarty n practce. Fgure 1 n the ntroducton shows an example of ths smlarty for squared error loss fttng wth l 1 (lasso) penalty. Fgure 5 shows another example n the same mold, taken from [5]. The data s a dabetes study and the dctonary used s just the orgnal 10 varables. The panel on the left shows the path of optmal l 1 -constraned solutons ˆβ(c) and the panel on the rght shows the ɛ- boostng path wth the 10-dmensonal dctonary (the total number of boostng teratons s about 6000). The 1-dmensonal path through R 10 s descrbed by 10 coordnate curves, correspondng to each one of the varables. The nterestng phenomenon we observe s that the two coeffcent traces are not completely dentcal. Rather, they agree up to the pont where varable 7 coeffcent path becomes non monotone,.e. t volates (10) (ths pont s where varable 8 comes nto the model, see the arrow on the rght panel). Ths example llustrates that the monotoncty condton and ts mplcaton that α 1 = β 1 s crtcal for the equvalence between ɛ-boostng and l 1 -constraned optmzaton. The two examples we have seen so far have used squared error loss, and we should ask ourselves whether ths equvalence stretches beyond ths loss. Fgure 6 shows a smlar result, but ths tme for the bnomal log-lkelhood loss, C l. 12

13 Lasso Stagwse 9 9 ^fj t = P j ^f j j! t = P j ^f j j! Fgure 5: Another example of the equvalence between the Lasso optmal soluton path (left) and ɛ-boostng wth squared error loss. Note that the equvalence breaks down when the path of varable 7 becomes non-monotone We used the spam dataset, taken from the UCI repostory [1]. We chose only 5 predctors of the 57 to make the plots more nterpretable and the computatons more accommodatng. We see that there s a perfect equvalence between the exact constraned soluton (.e. regularzed logstc regresson) and ɛ-boostng n ths case, snce the paths are fully monotone. To justfy why ths observed equvalence s not surprsng, let us consder the followng l 1 -locally optmal monotone drecton problem of fndng the best monotone ɛ ncrement to a gven model β 0.: mn C(β) (12) s.t. β 1 β 0 1 ɛ β β 0 Here we use C(β) as shorthand for C(y,h(x ) t β). expanson gves us: A frst order Taylor C(β) =C(β 0 )+ C(β 0 ) t (β β 0 )+O(ɛ 2 ) And gven the l 1 constrant on the ncrease n β 1, t s easy to see that a frst-order optmal soluton (and therefore an optmal soluton as ɛ 0) wll make a coordnate descent step,.e.: β j β 0,j C(β 0 ) j = max C(β 0 ) k k assumng the sgns match (.e. sgn(β 0j )= sgn( C(β 0 ) j )). 13

14 6 Exact constraned soluton 6 ε Stagewse β values 2 β values β 1 β 1 Fgure 6: Exact coeffcent paths (left) for l 1 -constraned logstc regresson and boostng coeffcent paths (rght) wth bnomal log-lkelhood loss on fve varables from the spam dataset. The boostng path was generated usng ɛ = and 7000 teratons. So we get that f the optmal soluton to (12) wthout the monotoncty constrant happens to be monotone, then t s equvalent to a coordnate descent step. And so t s reasonable to expect that f the optmal l 1 regularzed path s monotone (as t ndeed s n fgures 1,6), then an nfntesmal ɛ-boostng algorthm would follow the same path of solutons. Furthermore, even f the optmal path s not monotone, we can stll use the formulaton (12) to argue that ɛ-boostng would tend to follow an approxmate l 1 -regularzed path. The man dfference between the ɛ-boostng path and the true optmal path s that t wll tend to delay becomng non-monotone, as we observe for varable 7 n fgure 5. To understand ths specfc phenomenon would requre analyss of the true optmal path, whch falls outsde the scope of our dscusson [5] cover the subject for squared error loss, and ther dscusson apples to any contnuously dfferentable convex loss, usng second-order approxmatons. We can utlze ths understandng of the relatonshp between boostng and l 1 -regularzaton to construct l p boostng algorthms by changng the coordnateselecton crteron n the coordnate descent algorthm. We wll get back to ths pont n secton 7, where we desgn an l 2 boostng algorthm. The expermental evdence and heurstc dscusson we have presented lead us to the followng conjecture whch connects slow boostng and l 1 -regularzed optmzaton: Conjecture 2 Consder applyng the ɛ-boostng algorthm to any convex loss functon, generatng a path of solutons β (ɛ) (t). Then f the optmal coeffcent paths are monotone c <c 0,.e. f j, ˆβ(c) j s non-decreasng n the range c<c 0, then: lm ɛ 0 β(ɛ) (c 0 /ɛ) = ˆβ(c 0 ) 14

15 Efron et al. [5] (theorem 2) prove a weaker local result for the case of squared error loss only. We generalze ther result for any convex loss. However ths result stll does not prove the global convergence whch the conjecture clams, and the emprcal evdence mples. For the sake of brevty and readablty, we defer ths proof, together wth concse mathematcal defnton of the dfferent types of convergence, to appendx A. In the context of real-lfe boostng, where the number of bass functons s usually very large, and makng ɛ small enough for the theory to apply would requre runnng the algorthm forever, these results should not be consdered drectly applcable. Instead, they should be taken as an ntutve ndcaton that boostng especally the ɛ verson s, ndeed, approxmatng optmal solutons to the constraned problems t encounters along the way. 5 l p -constraned classfcaton loss functons Havng establshed the relaton between boostng and l 1 regularzaton, we are gong to turn our attenton to the regularzed optmzaton problem. By analogy, our results wll apply to boostng as well. We concentrate on C e and C l, the two classfcaton losses defned above, and the soluton paths of ther l p constraned versons: ˆβ (p) (c) = arg mn C(y,β h(x )) (13) β p c where C s ether C e or C l. As we dscussed below equaton (11), f the tranng data s separable n span(h), then we have ˆβ (p) (c) p = c for all values of c. Consequently: ˆβ (p) (c) p =1 c We may ask what are the convergence ponts of ths sequence as c. The followng theorem shows that these convergence ponts descrbe l p -margn maxmzng separatng hyper-planes. Theorem 3 Assume the data s separable,.e. β s.t., y β h(x ) > 0. ˆβ Then for both C e and C l, (p) (c) c converges to the l p -margn-maxmzng separatng hyper-plane (f t s unque) n the followng sense: ˆβ ˆβ (p) (p) (c) = lm = arg max c c mn β p=1 y β h(x ) (14) If the l p -margn-maxmzng separatng hyper-plane s not unque, then may have multple convergence ponts, but they wll all represent l p -margnmaxmzng separatng hyper-planes. Proof: Ths proof apples for both C e and C l, gven the property n (5). Consder two ˆβ(c) c 15

16 separatng canddates β 1 and β 2 such that β 1 p = β 2 p = 1. Assume that β 1 separates better,.e.: m 1 := mn y β 1h(x ) >m 2 := mn y β 2h(x ) > 0 Then we have the followng smple lemma: Lemma 4 There exsts some D = D(m 1,m 2 ) such that d >D, dβ 1 ncurs smaller loss than dβ 2, n other words: C(y,dβ 1h(x )) < C(y,dβ 2h(x )) Gven ths lemma, we can now prove that any convergence pont of ˆβ (p) (c) c must be an l p -margn maxmzng separator. Assume β s a convergence pont of ˆβ (p) (c) c. Denote ts mnmal margn on the data by m. If the data s separable, clearly m > 0 (snce otherwse the loss of dβ does not even converge to 0 as d ). Now, assume some β wth β p = 1 has bgger mnmal margn m >m. By contnuty of the mnmal margn n β, there exsts some open neghborhood of β : and an ɛ>0, such that: N β = {β : β β 2 <δ} mn y β h(x ) < m ɛ, β N β Now by the lemma we get that there exsts some D = D( m, m ɛ) such that d β ncurs smaller loss than dβ for any d>d, β N β. Therefore β cannot be a convergence pont of ˆβ (p) (c) c. We conclude that any convergence pont of the sequence ˆβ (p) (c) c must be an l p -margn maxmzng separator. If the margn maxmzng separator s unque then t s the only possble convergence pont, and therefore: ˆβ ˆβ (p) (p) (c) = lm = arg max c c mn y β h(x ) β p=1 Proof of Lemma: Usng (5) and the defnton of C e, we get for both loss functons: C(y,dβ 1h(x )) n exp( d m 1 ) Now, snce β 1 separates better, we can fnd our desred D = D(m 1,m 2 )= 16 logn + log2 m 1 m 2

17 such that: d >D, nexp( d m 1 ) < 0.5exp( d m 2 ) And usng (5) and the defnton of C e agan we can wrte: 0.5exp( d m 2 ) C(y,dβ 2h(x )) Combnng these three nequaltes we get our desred result: d >D, C(y,dβ 1h(x )) C(y,dβ 2h(x )) We thus conclude that f the l p -margn maxmzng separatng hyper-plane s unque, the normalzed constraned soluton converges to t. In the case that the margn maxmzng separatng hyper-plane s not unque, ths theorem can easly be generalzed to characterze a unque soluton by defnng te-breakers: f the mnmal margn s the same, then the second mnmal margn determnes whch model separates better, and so on. Only n the case that the whole order statstcs of the l p margns s common to many solutons can there really be more than one convergence pont for ˆβ (p) (c) c. 5.1 Implcatons of theorem 3 Boostng mplcatons u t αu Combned wth our results from secton 4, theorem 3 ndcates that the normalzed boostng path β(t) wth ether C e or C l used as loss approxmately converges to a separatng hyper-plane ˆβ, whch attans: max mn y β h(x ) = max mn y d β 2, (15) β 1=1 β 1=1 where d s the Eucldean dstance from the tranng pont to the separatng hyper-plane. In other words, t maxmzes Eucldean dstance scaled by an l 2 norm. As we have mentoned already, ths mples that the asymptotc boostng soluton wll tend to be sparse n representaton, due to the fact that for fxed l 1 norm, the l 2 norm of vectors that have many 0 entres wll generally be larger. We conjecture that ths asymptotc soluton ˆβ = lm ˆβ(1) c (c)/c, wll have at most n (the number of observatons) non-zero coeffcents. Ths n fact holds for squared error loss, where there always exsts a fnte optmal soluton ˆβ (1) wth at most n non-zero coeffcents (see Efron et al. [5]). Logstc regresson mplcatons Recall, that the logstc regresson (maxmum lkelhood) soluton s undefned f the data s separable n the Eucldean space spanned by the predctors. Theorem 3 allows us to defne a logstc regresson soluton for separable data, as follows: 17

18 1. Set a hgh constrant value c max 2. Fnd ˆβ (p) (c max ), the soluton to the logstc regresson problem subject to the constrant β p c max. The problem s convex for any p 1and dfferentable for any p>1, so nteror pont methods can be used to solve ths problem. 3. Now you have (approxmately) the l p -margn maxmzng soluton for ths data, descrbed by ˆβ (p) (c max ) c max Ths s a soluton to the orgnal problem n the sense that t s, approxmately, the convergence pont of the normalzed l p -constraned solutons, as the constrant s relaxed. Of course, wth our result from theorem 3 t would probably make more sense to smply fnd the optmal separatng hyper-plane drectly ths s a lnear programmng problem for l 1 separaton and a quadratc programmng problem for l 2 separaton. We can then consder ths optmal separator as a logstc regresson soluton for the separable data. 6 Examples 6.1 Spam dataset We now know f the data are separable and we let boostng run forever, we wll approach the same optmal separator for both C e and C l. However f we stop early or f the data s not separable the behavor of the two loss functons may dffer sgnfcantly, snce C e weghs negatve margns exponentally, whle C l s approxmately lnear n the margn for large negatve margns (see [8] for detaled dscusson). Consequently, we can expect C e to concentrate more on the hard tranng data, n partcular n the non-separable case. Fgure 7 llustrates the behavor of ɛ-boostng wth both loss functons, as well as that of AdaBoost, on the spam dataset (57 predctors, bnary response). We used 10 node trees and ɛ =0.1. The left plot shows the mnmal margn as a functon of the l 1 norm of the coeffcent vector β 1. Bnomal loss creates a bgger mnmal margn ntally, but the mnmal margns for both loss functons are convergng asymptotcally. AdaBoost ntally lags behnd but catches up ncely and reaches the same mnmal margn asymptotcally. The rght plot shows the test error as the teratons proceed, llustratng that both ɛ-methods ndeed seem to over-ft eventually, even as ther separaton (mnmal margn) s stll mprovng. AdaBoost dd not sgnfcantly over-ft n the 1000 teratons t was allowed to run, but t obvously would have f t were allowed to run on. 18

19 Mnmal margns Test error exponental logstc AdaBoost mnmal margn test error exponental logstc AdaBoost β β 1 Fgure 7: Behavor of boostng wth the two loss functons on spam dataset 6.2 Smulated data To make a more educated comparson and more compellng vsualzaton, we have constructed an example of separaton of 2-dmensonal data usng a 8-th degree polynomal dctonary (45 functons). The data conssts of 50 observatons of each class, drawn from a mxture of Gaussans, and presented n fgure 8. Also presented, n the sold lne, s the optmal l 1 separator for ths data n ths dctonary (easly calculated as a lnear programmng problem - note the dfference from the l 2 optmal decson boundary, presented n secton 7.2, fgure 11 ). The optmal l 1 separator has only 12 non-zero coeffcents out of 45. We ran an ɛ-boostng algorthm on ths dataset, usng the logstc loglkelhood loss C l, wth ɛ =0.001, and fgure 8 shows two of the models generated after 10 5 and teratons. We see that the models seem to converge to the optmal separator. A dfferent vew of ths convergence s gven n fgure 9, where we see two measures of convergence the mnmal margn (left) and the l 1 -norm dstance between the normalzed models (rght), gven by: ˆβ j β(t) j β (t) 1 where ˆβ s the optmal separator wth l 1 norm 1 and β (t) s the boostng model after t teratons. We can conclude that on ths smple artfcal example we get nce convergence of the logstc-boostng model path to the l 1 -margn maxmzng separatng hyper-plane. We can also use ths example to llustrate the smlarty between the boosted path and the path of l 1 optmal solutons, as we have dscussed n secton 4. Fgure 10 shows the class decson boundares for 4 models generated along the boostng, compared to the optmal solutons to the constraned logstc 19

20 optmal boost 10 5 ter boost 3*10 6 ter Fgure 8: Artfcal data set wth l 1 -margn maxmzng separator (sold), and boostng models after 10 5 teratons (dashed) and 10 6 teratons (dotted) usng ɛ = We observe the convergence of the boostng separator to the optmal separator regresson problem wth the same bound on the l 1 norm of the coeffcent vector. We observe the clear smlartes n the way the solutons evolve and converge to the optmal l 1 separator. The fact that they dffer (n some cases sgnfcantly) s not surprsng f we recall the monotoncty condton presented n secton 4 for exact correspondence between the two model paths. In ths case f we look at the coeffcent paths (not shown), we observe that the monotoncty condton s consstently volated n the low norm ranges, and hence we can expect the paths to be smlar n sprt but not dentcal. 7 Dscusson 7.1 Regularzed and non-regularzed behavor of the loss functons We can now summarze what we have learned about boostng from the prevous sectons: Boostng approxmately follows the path of l 1 -regularzed models for ts loss crteron If the loss crteron s the exponental loss of AdaBoost or the bnomal log-lkelhood loss of logstc regresson, then the l 1 regularzed model converges to an l 1 -margn maxmzng separatng hyper-plane, f the data are separable n the span of the weak learners 20

21 Mnmal margn l 1 dfference β β 1 Fgure 9: Two measures of convergence of boostng model path to optmal l 1 separator: mnmal margn (left) and l 1 dstance between the normalzed boostng coeffcent vector and the optmal model (rght) We may ask, whch of these two ponts s the key to the success of boostng approaches. One emprcal clue to answerng ths queston, can be found n The work of Leo Breman [2], who programmed an algorthm to drectly maxmze the margns. Hs results were that hs algorthm consstently got sgnfcantly hgher mnmal margns than AdaBoost on many data sets, but had slghtly worse predcton performance. Hs concluson was that margn maxmzaton s not the key to AdaBoost s success. From a statstcal perspectve we can embrace ths concluson, as reflectng the noton that non-regularzed models n hgh-dmensonal predctor space are bound to be over-ftted. Margn maxmzaton s a non-regularzed objectve, both ntutvely and more rgorously by our results from the prevous secton. Thus we would expect the margn maxmzng solutons to perform worse than regularzed models n the case of boostng regularzaton would correspond to early stoppng of the boostng algorthm. 7.2 Boostng and SVMs as regularzed optmzaton n hgh-dmensonal predctor spaces Our exposton has led us to vew boostng as an approxmate way to solve the regularzed optmzaton problem: mn C(y,β h(x )) + λ β 1 (16) β whch converges as λ 0to ˆβ (1), f our loss s C e or C l. In general, the loss C can be any convex dfferentable loss and should be defned to match the problem doman. Support vector machnes can be descrbed as solvng the regularzed opt- 21

22 l 1 norm: 20 l 1 norm: 350 l 1 norm: 2701 l 1 norm: 5401 Fgure 10: Comparson of decson boundary of boostng models (broken) and of optmal constraned solutons wth same norm (full) mzaton problem (see for example [8], chapter 12): mn (1 y β h(x )) + + λ β 2 2 (17) β whch converges as λ 0 to the non-regularzed support vector machne soluton,.e. the optmal Eucldean separator, whch we denoted by ˆβ (2). An nterestng connecton exsts between these two approaches, n that they allow us to solve the regularzed optmzaton problem n hgh dmensonal predctor space: We are able to solve the l 1 - regularzed problem approxmately n very hgh dmenson va boostng by applyng the approxmate coordnate descent trck of buldng a decson tree (or otherwse greedly selectng a weak learner) based on re-weghted versons of the data. Support vector machnes facltate a dfferent trck for solvng the regularzed optmzaton problem n hgh dmensonal predctor space: the kernel trck. If our dctonary H spans a Reproducng Kernel Hlbert Space, then RKHS theory tells us we can fnd the regularzed solutons by solvng an n-dmensonal problem, n the space spanned by the kernel representers {K(x, x)}. Ths fact s by no means lmted to the hnge loss of (17), and apples to any convex loss. We concentrate our dscusson on SVM (and hence hnge loss) only snce t s by far the most common and well-known applcaton of ths result. 22

23 So we can vew both boostng and SVM as methods that allow us to ft regularzed models n hgh dmensonal predctor space usng a computatonal shortcut. The complexty of the model bult s controlled by the regularzaton. In that they are dstnctly dfferent than tradtonal statstcal approaches for buldng models n hgh dmenson, whch start by reducng the dmensonalty of the problem so that standard tools (e.g. Newton s method) can be appled to t, and also to make over-fttng less of a concern. Whle the merts of regularzaton wthout dmensonalty reducton lke Rdge regresson or the Lasso are well documented n statstcs, computatonal ssues make t mpractcal for the sze of problems typcally solved va boostng or SVM, wthout computatonal trcks. We beleve that ths dfference may be a sgnfcant reason for the endurng success of boostng and SVM n data modelng,.e.: workng n hgh dmenson and regularzng s statstcally preferable to a two-step procedure of frst reducng the dmenson, then fttng a model n the reduced space. It s also nterestng to consder the dfferences between the two approaches, n the loss (flexble vs. hnge loss), the penalty (l 1 vs. l 2 ), and the type of dctonary used (usually trees vs. RKHS). These dfferences ndcate that the two approaches wll be useful for dfferent stuatons. For example, f the true model has a sparse representaton n the chosen dctonary, then l 1 regularzaton may be warranted; f the form of the true model facltates descrpton of the class probabltes va a logstc-lnear model, then the logstc loss C l s the best loss to use, and so on. The computatonal trcks for both SVM and boostng lmt the knd of regularzaton that can be used for fttng n hgh dmensonal space. However, the problems can stll be formulated and solved for dfferent regularzaton approaches, as long as the dmensonalty s low enough: Support vector machnes can be ftted wth an l 1 penalty, by solvng the 1-norm verson of the SVM problem, equvalent to replacng the l 2 penalty n (17) wth an l 1 penalty. In fact, the 1-norm SVM s used qute wdely, because t s more easly solved n the lnear, non-rkhs, stuaton (as a lnear program, compared to the standard SVM whch s a quadratc program) and tends to gve sparser solutons n the prmal doman. Smlarly, we descrbe below an approach for developng a boostng algorthm for fttng approxmate l 2 regularzed models. Both of these methods are nterestng and potentally useful. However they lack what s arguably the most attractve property of the standard boostng and SVM algorthms: a computatonal trck to allow fttng n hgh dmensons. An l 2 boostng algorthm We can use our understandng of the relaton of boostng to regularzaton and theorem 3 to formulate l p -boostng algorthms, whch wll approxmately follow 23

24 the path of l p -regularzed solutons and converge to the correspondng l p -margn maxmzng separatng hyper-planes. Of partcular nterest s the l 2 case, snce theorem 3 mples that l 2 -constraned fttng usng C l or C e wll buld a regularzed path to the optmal separatng hyper-plane n the Eucldean (or SVM) sense. To construct an l 2 boostng algorthm, consder the equvalent optmzaton problem (12), and change the step-sze constrant to an l 2 constrant: β 2 β 0 2 ɛ It s easy to see that the frst order to ths soluton entals selectng for modfcaton the coordnate whch maxmzes: C(β 0 ) k β 0,k and that subject to monotoncty, ths wll lead to a correspondence to the locally l 2 -optmal drecton. Followng ths ntuton, we can construct an l 2 boostng algorthm by changng only step 2(c) of our generc boostng algorthm of secton 2 to: 2(c)* Identfy j t whch maxmzes whj t (x) β jt Note that the need to consder the current coeffcent (n the denomnator) makes the l 2 algorthm approprate for toy examples only. In stuatons where the dctonary of weak learner s prohbtvely large, we wll need to fgure out a trck lke the one we presented n secton 2.1, to allow us to make an approxmate search for the optmzer of step 2(c)*. Another problem n applyng ths algorthm to large problems s that we never choose the same dctonary functon twce, untl all have non-0 coeffcents. Ths s due to the use of the l 2 penalty, whch mples current coeffcent value affects the rate at whch the penalty term s ncreasng. In partcular, f β j =0 then ncreasng t cause the penalty term β 2 to ncrease at rate 0, to frst order (whch s all the algorthm s consderng). The convergence of our l 2 boostng algorthm on the artfcal dataset of secton 6.2 s llustrated n fgure 11. We observe that the l 2 boostng models do ndeed approach the optmal l 2 separator. It s nterestng to note the sgnfcant dfference between the optmal l 2 separator as presented n fgure 11 and the optmal l 1 separator presented n secton 6.2 (fgure 8). 8 Summary and future work In ths paper we have ntroduced a new vew of boostng n general, and twoclass boostng n partcular, comprsed of two man ponts: Followng [5, 9], we have descrbed boostng as approxmate l 1 -regularzed optmzaton. 24

25 optmal boost 5*10 6 ter boost 10 8 ter Fgure 11: Artfcal data set wth l 2 -margn maxmzng separator (sold), and l 2 -boostng models after teratons (dashed) and 10 8 teratons (dotted) usng ɛ = We observe the convergence of the boostng separator to the optmal separator We have shown that the exact l 1 -regularzed solutons converge to an l 1 - margn maxmzng separatng hyper-plane. We hope our results wll help n better understandng how and why boostng works. It s an nterestng and challengng task to separate the effects of the dfferent components of a boostng algorthm: Loss crteron Dctonary and greedy learnng method Lne search / slow learnng and relate them to ts success n dfferent scenaro. The mplct l 1 regularzaton n boostng may also contrbute to ts success - t has been shown that n some stuatons l 1 regularzaton s nherently superor to others (see [4]). An mportant ssue when analyzng boostng s over-fttng n the nosy data case. To deal wth over-fttng, Raetsch et al. [13] propose several regularzaton methods and generalzatons of the orgnal AdaBoost algorthm to acheve a soft margn by ntroducng slack varables. Our results ndcate that the models along the boostng path can be regarded as l 1 regularzed versons of the optmal separator, hence regularzaton can be done more drectly and naturally by stoppng the boostng teratons early. It s essentally a choce of the constrant parameter c. Many other questons arse from our vew of boostng. Among the ssues to be consdered: 25

26 Is there a smlar separator vew of mult-class boostng? We have some tentatve results to ndcate that ths mght be the case f the boostng problem s formulated properly. Can the constraned optmzaton vew of boostng help n producng generalzaton error bounds for boostng that would be more tght than the current exstng ones? Acknowledgments We thank Stephen Boyd, Brad Efron, Jerry Fredman, Robert Schapre and Rob Tbshran for helpful dscussons. Ths work s partally supported by Stanford graduate fellowshp, grant DMS from the Natonal Scence Foundaton, and grant ROI-CA from the Natonal Insttutes of Health. References [1] Blake, C.L. and Merz, C.J. (1998). UCI Repostory of machne learnng databases [ mlearn/mlrepostory.html]. Irvne, CA: Unversty of Calforna, Department of Informaton and Computer Scence. [2] Breman, L. (1999). Predcton games and arcng algorthms. Neural Computaton 7: [3] Collns, M., Schapre, R.E. & Snger, Y. (2000). Logstc regresson, AdaBoost and bregman dstances. In Proceedngs of the Thrteenth Annual Conference on Computatonal Learnng Theory. [4] Donoho, D., Johnstone, I., Kerkyacharan, G. & Pcard, D. (1995). Wavelet shrnkage: asymptopa? (wth dscusson). J. Royal. Statst. Soc. 57: [5] Efron, B., Haste, T., Johnstone, I.M. and Tbshran, R. (2002). Least Angle Regresson. Techncal report, Department of Statstcs, Stanford Unversty. [6] Freund, Y. & Scahpre, R.E. (1995). A decson theoretc generalzaton of onlne learnng and an applcaton to boostng. Proceedngs of the 2nd Eurpoean Conference on Computatonal Learnng Theory. [7] Fredman, J.H. (2001). Greedy functon approxmaton: A gradent boostng machne. Annals of Statstcs, Vol. 29, No. 5. [8] Fredman, J. H., Haste, T. & Tbshran, R. (2000). Addtve logstc regresson: a statstcal vew of boostng. Annals of Statstcs 28, pp [9] Haste, T., Tbshran, R. & Fredman, J.H. (2001). Elements of Statstcal Learnng. Sprnger-Verlag, New York. [10] Mangasaran, O.L. (1999). Arbtrary-norm separatng plane. Operatons Research Letters, Vol :15-23 [11] Mason, L., Baxter, J., Bartlett, P. & Frean, M. (1999). Boostng algorthms as gradent descent n functon space. Neural Informaton Processng Systems 12, pp

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set