Boosting as a Regularized Path to a Maximum Margin Classifier

Size: px
Start display at page:

Download "Boosting as a Regularized Path to a Maximum Margin Classifier"

Transcription

1 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset, J Zhu, Trevor Haste Department of Statstcs Stanford Unversty Stanford, CA, {saharon,jzhu,haste}@stat.stanford.edu May 5, 2003 Abstract In ths paper we study boostng methods from a new perspectve. We buld on recent work by Efron et al. to show that boostng approxmately (and n some cases exactly) mnmzes ts loss crteron wth an l 1 constrant on the coeffcent vector. Ths helps understand the success of boostng wth early stoppng as regularzed fttng of the loss crteron. For the two most commonly used crtera (exponental and bnomal loglkelhood), we further show that as the constrant s relaxed or equvalently as the boostng teratons proceed the soluton converges (n the separable case) to an l 1-optmal separatng hyper-plane. We prove that ths l 1-optmal separatng hyper-plane has the property of maxmzng the mnmal l 1-margn of the tranng data, as defned n the boostng lterature. An nterestng fundamental smlarty between boostng and kernel support vector machnes emerges, as both can be descrbed as methods for regularzed optmzaton n hgh-dmensonal predctor space, utlzng a computatonal trck to make the calculaton practcal, and convergng to margn-maxmzng solutons. Whle ths statement descrbes SVMs exactly, t apples to boostng only approxmately. Key-words: Boostng, regularzed optmzaton, support vector machnes, margn maxmzaton 1 Introducton and outlne Boostng s a method for teratvely buldng an addtve model F T (x) = T α t h jt (x), (1) t=1 1

2 where h jt H a large (but we wll assume fnte) dctonary of canddate predctors or weak learners ; and h jt s the bass functon selected as the best canddate to modfy at stage t. The model F T can equvalently be represented by assgnng a coeffcent to each dctonary functon h Hrather than to the selected h jt s only: J F T (x) = h j (x) β (T ) j, (2) j=1 where J = H and β (T ) j = j t=j α t. The β representaton allows us to nterpret the coeffcent vector β (T ) as a vector n R J or, equvalently, as the hyper-plane whch has β (T ) as ts normal. Ths nterpretaton wll play a key role n our exposton. Some examples of common dctonares are: The tranng varables themselves, n whch case h j (x) =x j. Ths leads to our addtve model F T beng just a lnear model n the orgnal data. The number of dctonary functons wll be J = d, the dmenson of x. Polynomal dctonary( of degree) p, n whch case the number of dctonary p + d functons wll be J = d Decson trees wth up to k termnal nodes, f we lmt the splt ponts to data ponts (or md-way between data ponts as CART does). The number of possble trees s bounded from above (trvally) by J (np) k 2 k2. Note that regresson trees do not ft nto our framework, snce they wll gve J =. The boostng dea was frst ntroduced by Freund and Schapre [6], wth ther AdaBoost algorthm. AdaBoost and other boostng algorthms have attracted a lot of attenton due to ther great success n data modelng tasks, and the mechansm whch makes them work has been presented and analyzed from several perspectves. Fredman et al. [8] develop a statstcal perspectve, whch ultmately leads to vewng AdaBoost as a gradent-based ncremental search for a good addtve model (more specfcally, t s a coordnate descent algorthm), usng the exponental loss functon C(y, F) = exp( yf), where y { 1, 1}. The gradent boostng [7] and anyboost [11] generc algorthms have used ths approach to generalze the boostng dea to wder famles of problems and loss functons. In partcular, [8] have ponted out that the bnomal log-lkelhood loss C(y, F) = log(1 + exp( yf)) s a more natural loss for classfcaton, and s more robust to outlers and msspecfed data. A dfferent analyss of boostng, orgnatng n the Machne Learnng communty, concentrates on the effect of boostng on the margns y F (x ). For example, [16] uses margn-based arguments to prove convergence of boostng to perfect classfcaton performance on the tranng data under general condtons, and to derve bounds on the generalzaton error (on future, unseen data). 2

3 In ths paper we combne the two approaches, to conclude that gradentbased boostng can be descrbed, n the separable case, as an approxmate margn maxmzng process. The vew we develop of boostng as an approxmate path of optmal solutons to regularzed problems also justfes early stoppng n boostng as specfyng a value for regularzaton parameter. We consder the problem of mnmzng convex loss functons (n partcular the exponental and bnomal log-lkelhood loss functons) over the tranng data, wth an l 1 bound on the model coeffcents: ˆβ(c) = arg mn C(y,h(x ) β). (3) β 1 c Where h(x )=[h 1 (x ),h 2 (x ),...,h J (x )] T and J = H. Haste et al. ([9], chapter 10) have observed that slow gradent-based boostng (.e. we set α t = ɛ, t n (1), wth ɛ small) tends to follow the penalzed path ˆβ(c) as a functon of c, under some mld condtons on ths path. In other words, usng the notaton of (2), (3), ths mples that β (c/ɛ) ˆβ(c) vanshes wth ɛ, for all (or a wde range of) values of c. Fgure 1 llustrates ths equvalence between ɛ-boostng and the optmal soluton of (3) on a real-lfe data set, usng squared error loss as the loss functon. In ths paper we demonstrate Lasso Forward Stagewse lcavol lcavol Coeffcents sv lweght pgg45 lbph gleason age Coeffcents sv lweght pgg45 lbph gleason age lcp j ˆβ j (c) lcp Iteraton Fgure 1: Exact coeffcent paths(left) for l 1 -constraned squared error regresson and boostng coeffcent paths (rght) on the data from a prostate cancer study ths equvalence further and formally state t as a conjecture. Some progress towards provng ths conjecture has been made by Efron et al. [5] who prove a weaker local result for the case where C s squared error loss, under some mld condtons on the optmal path. We generalze ther result to general convex loss functons. Combnng the emprcal and theoretcal evdence, we conclude that boostng can be vewed as an approxmate ncremental method for followng the l 1 -regularzed path. 3

4 We then prove that n the separable case, for both the exponental and logstc log-lkelhood loss functons, ˆβ(c)/c converges as c to an optmal separatng hyper-plane ˆβ descrbed by: ˆβ = arg max mn y β h(x ) (4) β 1=1 In other words, ˆβ maxmzes the mnmal margn among all vectors wth l1 - norm equal to 1. Ths result generalzes easly to other l p -norm constrants. For example, f p = 2, then ˆβ descrbes the optmal separatng hyper-plane n the Eucldean sense,.e. the same one that a non-regularzed support vector machne would fnd. Combnng our two man results, we get the followng characterzaton of boostng: ɛ-boostng can be descrbed as a gradent-descent search, approxmately followng the path of l 1 -constraned optmal solutons to ts loss crteron, and convergng, n the separable case, to a margn maxmzer n the l 1 sense. Note that boostng wth a large dctonary H (n partcular f n<j= H ) guarantees that the data wll be separable (except for pathologes), hence separablty s a very mld assumpton here. As n the case of support vector machnes n hgh dmensonal feature spaces, the non-regularzed optmal separatng hyper-plane s usually of theoretcal nterest only, snce t typcally represents a hghly over-ftted model. Thus, we would want to choose a good regularzed model. Our results ndcate that Boostng gves a natural method for dong that, by stoppng early n the boostng process. Furthermore, they pont out the fundamental smlarty between Boostng and SVMs: both approaches allow us to ft regularzed models n hgh-dmensonal predctor space, usng a computatonal trck. They dffer n the regularzaton approach they take exact l 2 regularzaton for SVMs, approxmate l 1 regularzaton for Boostng and n the computatonal trck that facltates fttng the kernel trck for SVMs, coordnate descent for Boostng. 1.1 Related work Schapre et al. [16] have dentfed the normalzed margns as dstance from an l 1 -normed separatng hyper-plane. Ther results relate the boostng teratons success to the mnmal margn of the combned model. Raetsch et al. [13] take ths further usng an asymptotc analyss of AdaBoost. They prove that the normalzed mnmal margn, mn y t α th t (x )/ t α t, s asymptotcally equal for both classes. In other words, they prove that the asymptotc separatng hyper-plane s equally far away from the closest ponts on ether sde. Ths s a property of the margn maxmzng separatng hyper-plane as we defne t. Both papers also llustrate the margn maxmzng effects of AdaBoost through expermentaton. However, they both stop short of provng the convergence to optmal (margn maxmzng) solutons. 4

5 Motvated by our result, Raetsch and Warmuth [14] have recently asserted the margn-maxmzng propertes of ɛ-adaboost, usng a dfferent approach than the one used n ths paper. Ther results relate only to the asymptotc convergence of nfntesmal AdaBoost, compared to our analyss of the regularzed path traced along the way and of a varety of boostng loss functons, whch also leads to a convergence result on bnomal log-lkelhood loss. The convergence of boostng to an optmal soluton from a loss functon perspectve has been analyzed n several papers. Raetsch et al. ([12]) and Collns et al ([3]) gve results and bounds on the convergence of tranng-set loss, C(y, t α th t (x )), to ts mnmum. However, n the separable case convergence of the loss to 0 s nherently dfferent from convergence of the lnear separator to the optmal separator. Any soluton whch separates the two classes perfectly can drve the exponental (or log-lkelhood) loss to 0, smply by scalng coeffcents up lnearly. 2 Boostng as gradent descent Generc gradent-based boostng algorthms such as [7, 11] attempt to fnd a good lnear combnaton of the members of some dctonary of bass functons to optmze a gven loss functon over a sample. Ths s done by searchng, at each teraton, for the bass functon whch gves the steepest descent n the loss, and changng ts coeffcent accordngly. In other words, ths s a coordnate descent algorthm n R J, where we assgn one dmenson (or coordnate) for the coeffcent of each dctonary functon. Assume we have data {x,y } n =1, a loss (or cost) functon C(y, F), and a set of bass functons {h j (x)} : R d R. Then all of these algorthms follow the same essental steps: Algorthm 1 Generc gradent-based boostng algorthm 1. Set β (0) =0. 2. For t =1:T, (a) Let F = β (t 1) h(x ),=1,...,n (the current ft). (b) Set w = C(y,F) F,=1,...,n. (c) Identfy j t = arg max j w h j (x ). (d) Set β (t) j t = β (t 1) j t α t and β (t) k = β (t 1) k,k j t. Here β (t) s the current coeffcent vector and α t > 0 s the current step sze. Notce that w h jt (x )= C(y,F) β jt. As we mentoned, algorthm 1 can be nterpreted smply as a coordnate descent algorthm n weak learner space. Implementaton detals nclude the dctonary H of weak learners, the loss functon C(y, F), the method of searchng for the optmal j t and the way n whch α t s determned (the sgn of α t 5

6 wll always be sgn( w h jt (x )), snce we want the loss to be reduced. In most cases, the dctonary H s negaton closed, and so t s assumed WLOG that α t > 0 ). For example, the orgnal AdaBoost algorthm uses ths scheme wth the exponental loss C(y, F) = exp( yf), and an mplct lne search to fnd the best α t once a drecton j has been chosen (see [9], chapter 10 and [11] for detals). The dctonary used by AdaBoost n ths formulaton would be a set of canddate classfers,.e. h j (x ) { 1, +1} usually decson trees areusednpractce. 2.1 Practcal mplementaton of boostng The dctonares used for boostng are typcally very large practcally nfnte and therefore the generc boostng algorthm we have presented cannot be mplemented verbatm. In partcular, t s not practcal to exhaustvely search for the maxmzer n step 2(c). Instead, an approxmate, usually greedy search s conducted to fnd a good canddate weak learner h jt whch makes the frst order declne n the loss large (even f not maxmal among all possble models). In the common case that the dctonary of weak learners s comprsed of decson trees wth up to k nodes, the way AdaBoost and other boostng algorthms solve stage 2(c) s by buldng a decson tree to a re-weghted verson of the data, wth the weghts w (w as defned above). Thus they frst replace step 2(c) wth mnmzaton of: w 1{y h jt (x )} whch s easly shown to be equvalent to the orgnal step 2(c). They then use a greedy decson-tree buldng algorthm such as CART or C5 to buld a decson tree whch mnmzes ths quantty,.e. acheves low weghted msclassfcaton error on the weghted data. Snce the tree s bult greedly one splt at a tme t wll not be the global mnmzer of weghted msclassfcaton error among all k-node decson trees. However, t wll be a good ft for the re-weghted data, and can be consdered an approxmaton to the optmal tree. Ths use of approxmate optmzaton technques s crtcal, snce much of the strength of the boostng approach comes from ts ablty to buld addtve models n very hgh-dmensonal predctor spaces. In such spaces, standard exact optmzaton technques are mpractcal: any approach whch requres calculaton and nverson of Hessan matrces s completely out of the queston, and even approaches whch requre only frst dervatves, such as coordnate descent, can only be mplemented approxmately. 2.2 Gradent-based boostng as a generc modelng tool As [7, 11] menton, ths vew of boostng as gradent descent allows us to devse boostng algorthms for any functon estmaton problem all we need s an approprate loss and an approprate dctonary of weak learners. For example, [8] suggested usng the bnomal log-lkelhood loss nstead of the exponental 6

7 loss of AdaBoost for bnary classfcaton, resultng n the LogtBoost algorthm. However, there s no need to lmt boostng algorthms to classfcaton [7] appled ths methodology to regresson estmaton, usng squared error loss and regresson trees, and [15] appled t to densty estmaton, usng the log-lkelhood crteron and Bayesan networks as weak learners. Ther experments and those of others llustrate that the practcal usefulness of ths approach coordnate descent n hgh dmensonal predctor space carres beyond classfcaton, and even beyond supervsed learnng. The vew we present n ths paper, of coordnate-descent boostng as approxmate l 1 -regularzed fttng, offers some nsght nto why ths approach would be good n general: t allows us to ft regularzed models drectly n hgh dmensonal predctor space. In ths t bears a conceptual smlarty to support vector machnes, whch exactly ft an l 2 regularzed model n hgh dmensonal (RKH) predctor space. 2.3 Loss functons The two most commonly used loss functons for boostng classfcaton models are the exponental and the (mnus) bnomal log-lkelhood: Exponental : Loglkelhood : C e (y, F) = exp( yf) C l (y, F) = log(1 + exp( yf)) These two loss functons bear some mportant smlartes to each other. As [8] 2.5 Exponental Logstc Fgure 2: The two classfcaton loss functons shows, the populaton mnmzer of expected loss at pont x s smlar for both loss functons and s gven by: [ ] P (y =1 x) ˆF (x) =c log P (y = 1 x) where c e =1/2 for exponental loss and c l = 1 for bnomal loss. 7

8 More mportantly for our purpose, we have the followng smple proposton, whch llustrates the strong smlarty between the two loss functons for postve margns (.e. correct classfcatons): Proposton 1 yf 0 0.5C e (y, F) C l (y, F) C e (y, F) (5) In other words, the two losses become smlar f the margns are postve, and both behave lke exponentals. Proof: Consder the functons f 1 (z) =z and f 2 (z) =log(1 + z) forz [0, 1]. Then f 1 (0) = f 2 (0) = 0, and: f 1 (z) 1 z 1 2 f 2(z) = 1 z 1+z 1 Thus we can conclude 0.5f 1 (z) f 2 (z) f 1 (z). Now set z = exp( yf) andwe get the desred result. For negatve margns the behavors of C e and C l are very dfferent, as [8] have noted. In partcular, C l s more robust aganst outlers and msspecfed data. 2.4 Lne-search boostng vs. ɛ-boostng As mentoned above, AdaBoost determnes α t usng a lne search. In our notaton for algorthm 1 ths would be: α t = arg mn C(y,F + αh jt (x )) α The alternatve approach, suggested by Fredman [7, 9], s to shrnk all α t to a sngle small value ɛ. Ths may slow down learnng consderably (dependng on how small ɛ s), but s attractve theoretcally: the frst-order theory underlyng gradent boostng mples that the weak learner chosen s the best ncrement only locally. It can also be argued that ths approach s stronger than lne search, as we can keep selectng the same h jt repeatedly f t remans optmal and so ɛ-boostng domnates lne-search boostng n terms of tranng error. In practce, ths approach of slowng the learnng rate usually performs better than lne-search n terms of predcton error as well (see [7]). For our purposes, we wll mostly assume ɛ s nfntesmally small, so the theoretcal boostng algorthm whch results s the lmt of a seres of boostng algorthms wth shrnkng ɛ. In regresson termnology, the lne-search verson s equvalent to forward stage-wse modelng, nfamous n the statstcs lterature for beng too greedy and hghly unstable (See [7]). Ths s ntutvely obvous, snce by ncreasng the coeffcent untl t saturates we are destroyng sgnal whch may help us select other good predctors. 8

9 Fgure 3: A smple data example, wth two observatons from class O and two observatons from class X. The full lne s the Eucldean margn-maxmzng separatng hyper-plane. 3 l p margns, support vector machnes and boostng We now ntroduce the concept of margns as a geometrc nterpretaton of a bnary classfcaton model. In the context of boostng, ths vew offers a dfferent understandng of AdaBoost from the gradent descent vew presented above. In the followng sectons we connect the two vews. 3.1 The Eucldean margn and the support vector machne Consder a classfcaton model n hgh dmensonal predctor space: F (x) = j h j(x)β j. We say that the model separates the tranng data {x,y } n =1 f sgn(f (x )) = y,. From a geometrcal perspectve ths means that the hyper-plane defned by F (x) = 0 s a separatng hyper-plane for ths data, and we defne ts (Eucldean) margn as: m 2 (β) = mn y F (x ) β 2 (6) The margn-maxmzng separatng hyper-plane for ths data would be defned by β whch maxmzes m 2 (β). Fgure 3 shows a smple example of separable data n 2 dmensons, wth ts margn-maxmzng separatng hyper-plane. The Eucldean margn-maxmzng separatng hyper-plane s the (non regularzed) support vector machne soluton. Its margn maxmzng propertes play a central role n dervng generalzaton error bounds for these models, and form the bass for a rch lterature. 9

10 Fgure 4: l 1 margn maxmzng separatng hyper-plane for the same data set as fgure 3. The dfference between the dagonal Eucldean optmal separator and the vertcal l 1 optmal separator llustrates the sparsty effect of optmal l 1 separaton 3.2 The l 1 margn and ts relaton to boostng Instead of consderng the Eucldean margn as n (6) we can defne an l p margn concept as y F (x ) m p (β) = mn (7) β p Of partcular nterest to us s the case p = 1. Fgure 4 shows the l 1 margn maxmzng separatng hyper-plane for the same smple example as fgure 3. Note the fundamental dfference between the two solutons: the l 2 -optmal separator s dagonal, whle the l 1 -optmal one s vertcal. To understand why ths s so we can relate the two margn defntons to each other as: yf(x) β 1 = yf(x) β 2 β 2 β 1 (8) From ths representaton we can observe that the l 1 margn wll tend to be bg f the rato β 2 β 1 s bg. Ths rato wll generally be bg f β s sparse. To see ths, consder fxng the l 1 norm of the vector and then comparng the l 2 norm of two canddates: one wth many small components and the other a sparse one wth a few large components and many zero components. It s easy to see that the second vector wll have bgger l 2 norm, and hence (f the l 2 margn for both vectors s equal) a bgger l 1 margn. A dfferent perspectve on the dfference between the optmal solutons s gven by a theorem due to Mangasaran [10], whch states that the l p margn maxmzng separatng hyper plane maxmzes the l q dstance from the closest ponts to the separatng hyper-plane, wth 1 p + 1 q = 1. Thus the Eucldean optmal separator (p = 2) also maxmzes Eucldean dstance between the ponts and the hyper-plane, whle the l 1 optmal separator maxmzes l dstance. Ths nterestng result gves another ntuton why l 1 optmal separatng hyperplanes tend to be coordnate-orented (.e. have sparse representatons): snce 10

11 l projecton consders only the largest coordnate dstance, some coordnate dstances may be 0 at no cost of ncreased l dstance. [16] have ponted out the relaton between AdaBoost and the l 1 margn. They prove that, n the case of separable data, the boostng teratons ncrease the boostng margn of the model, defned as: mn y F (x) α 1 (9) In other words, ths s the l 1 margn of the model, except that t uses the α ncremental representaton rather than the β geometrc representaton for the model. The two representatons gve the same l 1 norm f there s sgn consstency, or monotoncty n the coeffcent paths traced by the model,.e. f at every teraton t of the boostng algorthm: β jt 0 sgn(α t )=sgn(β jt ) (10) As we wll see later, ths monotoncty condton wll play an mportant role n the equvalence between boostng and l 1 regularzaton. The l 1 -margn maxmzaton vew of AdaBoost presented by [16] and a whole plethora of papers that followed s mportant for the analyss of boostng algorthms for two dstnct reasons: It gves an ntutve, geometrc nterpretaton of the model that AdaBoost s lookng for a model whch separates the data well n ths l 1 -margn sense. Note that the vew of boostng as gradent descent n a loss crteron doesn t really gve the same knd of ntuton: f the data s separable, then any model whch separates the tranng data wll drve the exponental or bnomal loss to 0 when scaled up: m 1 (β) > 0 = C(y,dβ x ) 0 as d The l 1 -margn behavor of a classfcaton model on ts tranng data facltates generaton of generalzaton (or predcton) error bounds, smlar to those that exst for support vector machnes. From a statstcal perspectve, however, we should be suspcous of margnmaxmzaton as a method for buldng good predcton models n hgh dmensonal predctor space. Margn maxmzaton s by nature a non-regularzed objectve, and solvng t n hgh dmensonal space s lkely to lead to overfttng and bad predcton performance.ths has been observed n practce by many authors, n partcular Breman [2]. In secton 7 we return to dscuss these ssues n more detal. 11

12 4 Boostng as approxmate ncremental l 1 constraned fttng In ths secton we ntroduce an nterpretaton of the generc coordnate-descent boostng algorthm as trackng a path of approxmate solutons to l 1 -constraned (or equvalently, regularzed) versons of ts loss crteron. Ths vew serves our understandng of what boostng does, n partcular the connecton between early stoppng n boostng and regularzaton. We wll also use ths vew to get a result about the asymptotc margn-maxmzaton of regularzed classfcaton models, and by analogy of classfcaton boostng. We buld on deas frst presented by [9] (chapter 10) and [5]. Gven a loss crteron C(, ), consder the 1-dmensonal path of optmal solutons to l 1 constraned optmzaton problems over the tranng data: ˆβ(c) = arg mn C(y,h(x ) β). (11) β 1 c As c vares, we get that ˆβ(c) traces a 1-dmensonal optmal curve through R J. If an optmal soluton for the non-constraned problem exsts and has fnte l 1 norm c 0, then obvously ˆβ(c) = ˆβ(c 0 )= ˆβ, c >c 0. Note that n the case of separable 2-class data, usng ether C e or C l, there s no fnte-norm optmal soluton. Rather, the constraned soluton wll always have ˆβ(c) 1 = c. A dfferent way of buldng a soluton whch has l 1 norm c, s to run our ɛ-boostng algorthm for c/ɛ teratons. Ths wll gve an α (c/ɛ) vector whch has l 1 norm exactly c. For the norm of the geometrc representaton β (c/ɛ) to also be equal to c, we need the monotoncty condton (10) to hold as well. Ths condton wll play a key role n our exposton. We are gong to argue that the two soluton paths ˆβ(c) andβ (c/ɛ) are very smlar for ɛ small. Let us start by observng ths smlarty n practce. Fgure 1 n the ntroducton shows an example of ths smlarty for squared error loss fttng wth l 1 (lasso) penalty. Fgure 5 shows another example n the same mold, taken from [5]. The data s a dabetes study and the dctonary used s just the orgnal 10 varables. The panel on the left shows the path of optmal l 1 -constraned solutons ˆβ(c) and the panel on the rght shows the ɛ- boostng path wth the 10-dmensonal dctonary (the total number of boostng teratons s about 6000). The 1-dmensonal path through R 10 s descrbed by 10 coordnate curves, correspondng to each one of the varables. The nterestng phenomenon we observe s that the two coeffcent traces are not completely dentcal. Rather, they agree up to the pont where varable 7 coeffcent path becomes non monotone,.e. t volates (10) (ths pont s where varable 8 comes nto the model, see the arrow on the rght panel). Ths example llustrates that the monotoncty condton and ts mplcaton that α 1 = β 1 s crtcal for the equvalence between ɛ-boostng and l 1 -constraned optmzaton. The two examples we have seen so far have used squared error loss, and we should ask ourselves whether ths equvalence stretches beyond ths loss. Fgure 6 shows a smlar result, but ths tme for the bnomal log-lkelhood loss, C l. 12

13 Lasso Stagwse 9 9 ^fj t = P j ^f j j! t = P j ^f j j! Fgure 5: Another example of the equvalence between the Lasso optmal soluton path (left) and ɛ-boostng wth squared error loss. Note that the equvalence breaks down when the path of varable 7 becomes non-monotone We used the spam dataset, taken from the UCI repostory [1]. We chose only 5 predctors of the 57 to make the plots more nterpretable and the computatons more accommodatng. We see that there s a perfect equvalence between the exact constraned soluton (.e. regularzed logstc regresson) and ɛ-boostng n ths case, snce the paths are fully monotone. To justfy why ths observed equvalence s not surprsng, let us consder the followng l 1 -locally optmal monotone drecton problem of fndng the best monotone ɛ ncrement to a gven model β 0.: mn C(β) (12) s.t. β 1 β 0 1 ɛ β β 0 Here we use C(β) as shorthand for C(y,h(x ) t β). expanson gves us: A frst order Taylor C(β) =C(β 0 )+ C(β 0 ) t (β β 0 )+O(ɛ 2 ) And gven the l 1 constrant on the ncrease n β 1, t s easy to see that a frst-order optmal soluton (and therefore an optmal soluton as ɛ 0) wll make a coordnate descent step,.e.: β j β 0,j C(β 0 ) j = max C(β 0 ) k k assumng the sgns match (.e. sgn(β 0j )= sgn( C(β 0 ) j )). 13

14 6 Exact constraned soluton 6 ε Stagewse β values 2 β values β 1 β 1 Fgure 6: Exact coeffcent paths (left) for l 1 -constraned logstc regresson and boostng coeffcent paths (rght) wth bnomal log-lkelhood loss on fve varables from the spam dataset. The boostng path was generated usng ɛ = and 7000 teratons. So we get that f the optmal soluton to (12) wthout the monotoncty constrant happens to be monotone, then t s equvalent to a coordnate descent step. And so t s reasonable to expect that f the optmal l 1 regularzed path s monotone (as t ndeed s n fgures 1,6), then an nfntesmal ɛ-boostng algorthm would follow the same path of solutons. Furthermore, even f the optmal path s not monotone, we can stll use the formulaton (12) to argue that ɛ-boostng would tend to follow an approxmate l 1 -regularzed path. The man dfference between the ɛ-boostng path and the true optmal path s that t wll tend to delay becomng non-monotone, as we observe for varable 7 n fgure 5. To understand ths specfc phenomenon would requre analyss of the true optmal path, whch falls outsde the scope of our dscusson [5] cover the subject for squared error loss, and ther dscusson apples to any contnuously dfferentable convex loss, usng second-order approxmatons. We can utlze ths understandng of the relatonshp between boostng and l 1 -regularzaton to construct l p boostng algorthms by changng the coordnateselecton crteron n the coordnate descent algorthm. We wll get back to ths pont n secton 7, where we desgn an l 2 boostng algorthm. The expermental evdence and heurstc dscusson we have presented lead us to the followng conjecture whch connects slow boostng and l 1 -regularzed optmzaton: Conjecture 2 Consder applyng the ɛ-boostng algorthm to any convex loss functon, generatng a path of solutons β (ɛ) (t). Then f the optmal coeffcent paths are monotone c <c 0,.e. f j, ˆβ(c) j s non-decreasng n the range c<c 0, then: lm ɛ 0 β(ɛ) (c 0 /ɛ) = ˆβ(c 0 ) 14

15 Efron et al. [5] (theorem 2) prove a weaker local result for the case of squared error loss only. We generalze ther result for any convex loss. However ths result stll does not prove the global convergence whch the conjecture clams, and the emprcal evdence mples. For the sake of brevty and readablty, we defer ths proof, together wth concse mathematcal defnton of the dfferent types of convergence, to appendx A. In the context of real-lfe boostng, where the number of bass functons s usually very large, and makng ɛ small enough for the theory to apply would requre runnng the algorthm forever, these results should not be consdered drectly applcable. Instead, they should be taken as an ntutve ndcaton that boostng especally the ɛ verson s, ndeed, approxmatng optmal solutons to the constraned problems t encounters along the way. 5 l p -constraned classfcaton loss functons Havng establshed the relaton between boostng and l 1 regularzaton, we are gong to turn our attenton to the regularzed optmzaton problem. By analogy, our results wll apply to boostng as well. We concentrate on C e and C l, the two classfcaton losses defned above, and the soluton paths of ther l p constraned versons: ˆβ (p) (c) = arg mn C(y,β h(x )) (13) β p c where C s ether C e or C l. As we dscussed below equaton (11), f the tranng data s separable n span(h), then we have ˆβ (p) (c) p = c for all values of c. Consequently: ˆβ (p) (c) p =1 c We may ask what are the convergence ponts of ths sequence as c. The followng theorem shows that these convergence ponts descrbe l p -margn maxmzng separatng hyper-planes. Theorem 3 Assume the data s separable,.e. β s.t., y β h(x ) > 0. ˆβ Then for both C e and C l, (p) (c) c converges to the l p -margn-maxmzng separatng hyper-plane (f t s unque) n the followng sense: ˆβ ˆβ (p) (p) (c) = lm = arg max c c mn β p=1 y β h(x ) (14) If the l p -margn-maxmzng separatng hyper-plane s not unque, then may have multple convergence ponts, but they wll all represent l p -margnmaxmzng separatng hyper-planes. Proof: Ths proof apples for both C e and C l, gven the property n (5). Consder two ˆβ(c) c 15

16 separatng canddates β 1 and β 2 such that β 1 p = β 2 p = 1. Assume that β 1 separates better,.e.: m 1 := mn y β 1h(x ) >m 2 := mn y β 2h(x ) > 0 Then we have the followng smple lemma: Lemma 4 There exsts some D = D(m 1,m 2 ) such that d >D, dβ 1 ncurs smaller loss than dβ 2, n other words: C(y,dβ 1h(x )) < C(y,dβ 2h(x )) Gven ths lemma, we can now prove that any convergence pont of ˆβ (p) (c) c must be an l p -margn maxmzng separator. Assume β s a convergence pont of ˆβ (p) (c) c. Denote ts mnmal margn on the data by m. If the data s separable, clearly m > 0 (snce otherwse the loss of dβ does not even converge to 0 as d ). Now, assume some β wth β p = 1 has bgger mnmal margn m >m. By contnuty of the mnmal margn n β, there exsts some open neghborhood of β : and an ɛ>0, such that: N β = {β : β β 2 <δ} mn y β h(x ) < m ɛ, β N β Now by the lemma we get that there exsts some D = D( m, m ɛ) such that d β ncurs smaller loss than dβ for any d>d, β N β. Therefore β cannot be a convergence pont of ˆβ (p) (c) c. We conclude that any convergence pont of the sequence ˆβ (p) (c) c must be an l p -margn maxmzng separator. If the margn maxmzng separator s unque then t s the only possble convergence pont, and therefore: ˆβ ˆβ (p) (p) (c) = lm = arg max c c mn y β h(x ) β p=1 Proof of Lemma: Usng (5) and the defnton of C e, we get for both loss functons: C(y,dβ 1h(x )) n exp( d m 1 ) Now, snce β 1 separates better, we can fnd our desred D = D(m 1,m 2 )= 16 logn + log2 m 1 m 2

17 such that: d >D, nexp( d m 1 ) < 0.5exp( d m 2 ) And usng (5) and the defnton of C e agan we can wrte: 0.5exp( d m 2 ) C(y,dβ 2h(x )) Combnng these three nequaltes we get our desred result: d >D, C(y,dβ 1h(x )) C(y,dβ 2h(x )) We thus conclude that f the l p -margn maxmzng separatng hyper-plane s unque, the normalzed constraned soluton converges to t. In the case that the margn maxmzng separatng hyper-plane s not unque, ths theorem can easly be generalzed to characterze a unque soluton by defnng te-breakers: f the mnmal margn s the same, then the second mnmal margn determnes whch model separates better, and so on. Only n the case that the whole order statstcs of the l p margns s common to many solutons can there really be more than one convergence pont for ˆβ (p) (c) c. 5.1 Implcatons of theorem 3 Boostng mplcatons u t αu Combned wth our results from secton 4, theorem 3 ndcates that the normalzed boostng path β(t) wth ether C e or C l used as loss approxmately converges to a separatng hyper-plane ˆβ, whch attans: max mn y β h(x ) = max mn y d β 2, (15) β 1=1 β 1=1 where d s the Eucldean dstance from the tranng pont to the separatng hyper-plane. In other words, t maxmzes Eucldean dstance scaled by an l 2 norm. As we have mentoned already, ths mples that the asymptotc boostng soluton wll tend to be sparse n representaton, due to the fact that for fxed l 1 norm, the l 2 norm of vectors that have many 0 entres wll generally be larger. We conjecture that ths asymptotc soluton ˆβ = lm ˆβ(1) c (c)/c, wll have at most n (the number of observatons) non-zero coeffcents. Ths n fact holds for squared error loss, where there always exsts a fnte optmal soluton ˆβ (1) wth at most n non-zero coeffcents (see Efron et al. [5]). Logstc regresson mplcatons Recall, that the logstc regresson (maxmum lkelhood) soluton s undefned f the data s separable n the Eucldean space spanned by the predctors. Theorem 3 allows us to defne a logstc regresson soluton for separable data, as follows: 17

18 1. Set a hgh constrant value c max 2. Fnd ˆβ (p) (c max ), the soluton to the logstc regresson problem subject to the constrant β p c max. The problem s convex for any p 1and dfferentable for any p>1, so nteror pont methods can be used to solve ths problem. 3. Now you have (approxmately) the l p -margn maxmzng soluton for ths data, descrbed by ˆβ (p) (c max ) c max Ths s a soluton to the orgnal problem n the sense that t s, approxmately, the convergence pont of the normalzed l p -constraned solutons, as the constrant s relaxed. Of course, wth our result from theorem 3 t would probably make more sense to smply fnd the optmal separatng hyper-plane drectly ths s a lnear programmng problem for l 1 separaton and a quadratc programmng problem for l 2 separaton. We can then consder ths optmal separator as a logstc regresson soluton for the separable data. 6 Examples 6.1 Spam dataset We now know f the data are separable and we let boostng run forever, we wll approach the same optmal separator for both C e and C l. However f we stop early or f the data s not separable the behavor of the two loss functons may dffer sgnfcantly, snce C e weghs negatve margns exponentally, whle C l s approxmately lnear n the margn for large negatve margns (see [8] for detaled dscusson). Consequently, we can expect C e to concentrate more on the hard tranng data, n partcular n the non-separable case. Fgure 7 llustrates the behavor of ɛ-boostng wth both loss functons, as well as that of AdaBoost, on the spam dataset (57 predctors, bnary response). We used 10 node trees and ɛ =0.1. The left plot shows the mnmal margn as a functon of the l 1 norm of the coeffcent vector β 1. Bnomal loss creates a bgger mnmal margn ntally, but the mnmal margns for both loss functons are convergng asymptotcally. AdaBoost ntally lags behnd but catches up ncely and reaches the same mnmal margn asymptotcally. The rght plot shows the test error as the teratons proceed, llustratng that both ɛ-methods ndeed seem to over-ft eventually, even as ther separaton (mnmal margn) s stll mprovng. AdaBoost dd not sgnfcantly over-ft n the 1000 teratons t was allowed to run, but t obvously would have f t were allowed to run on. 18

19 Mnmal margns Test error exponental logstc AdaBoost mnmal margn test error exponental logstc AdaBoost β β 1 Fgure 7: Behavor of boostng wth the two loss functons on spam dataset 6.2 Smulated data To make a more educated comparson and more compellng vsualzaton, we have constructed an example of separaton of 2-dmensonal data usng a 8-th degree polynomal dctonary (45 functons). The data conssts of 50 observatons of each class, drawn from a mxture of Gaussans, and presented n fgure 8. Also presented, n the sold lne, s the optmal l 1 separator for ths data n ths dctonary (easly calculated as a lnear programmng problem - note the dfference from the l 2 optmal decson boundary, presented n secton 7.2, fgure 11 ). The optmal l 1 separator has only 12 non-zero coeffcents out of 45. We ran an ɛ-boostng algorthm on ths dataset, usng the logstc loglkelhood loss C l, wth ɛ =0.001, and fgure 8 shows two of the models generated after 10 5 and teratons. We see that the models seem to converge to the optmal separator. A dfferent vew of ths convergence s gven n fgure 9, where we see two measures of convergence the mnmal margn (left) and the l 1 -norm dstance between the normalzed models (rght), gven by: ˆβ j β(t) j β (t) 1 where ˆβ s the optmal separator wth l 1 norm 1 and β (t) s the boostng model after t teratons. We can conclude that on ths smple artfcal example we get nce convergence of the logstc-boostng model path to the l 1 -margn maxmzng separatng hyper-plane. We can also use ths example to llustrate the smlarty between the boosted path and the path of l 1 optmal solutons, as we have dscussed n secton 4. Fgure 10 shows the class decson boundares for 4 models generated along the boostng, compared to the optmal solutons to the constraned logstc 19

20 optmal boost 10 5 ter boost 3*10 6 ter Fgure 8: Artfcal data set wth l 1 -margn maxmzng separator (sold), and boostng models after 10 5 teratons (dashed) and 10 6 teratons (dotted) usng ɛ = We observe the convergence of the boostng separator to the optmal separator regresson problem wth the same bound on the l 1 norm of the coeffcent vector. We observe the clear smlartes n the way the solutons evolve and converge to the optmal l 1 separator. The fact that they dffer (n some cases sgnfcantly) s not surprsng f we recall the monotoncty condton presented n secton 4 for exact correspondence between the two model paths. In ths case f we look at the coeffcent paths (not shown), we observe that the monotoncty condton s consstently volated n the low norm ranges, and hence we can expect the paths to be smlar n sprt but not dentcal. 7 Dscusson 7.1 Regularzed and non-regularzed behavor of the loss functons We can now summarze what we have learned about boostng from the prevous sectons: Boostng approxmately follows the path of l 1 -regularzed models for ts loss crteron If the loss crteron s the exponental loss of AdaBoost or the bnomal log-lkelhood loss of logstc regresson, then the l 1 regularzed model converges to an l 1 -margn maxmzng separatng hyper-plane, f the data are separable n the span of the weak learners 20

21 Mnmal margn l 1 dfference β β 1 Fgure 9: Two measures of convergence of boostng model path to optmal l 1 separator: mnmal margn (left) and l 1 dstance between the normalzed boostng coeffcent vector and the optmal model (rght) We may ask, whch of these two ponts s the key to the success of boostng approaches. One emprcal clue to answerng ths queston, can be found n The work of Leo Breman [2], who programmed an algorthm to drectly maxmze the margns. Hs results were that hs algorthm consstently got sgnfcantly hgher mnmal margns than AdaBoost on many data sets, but had slghtly worse predcton performance. Hs concluson was that margn maxmzaton s not the key to AdaBoost s success. From a statstcal perspectve we can embrace ths concluson, as reflectng the noton that non-regularzed models n hgh-dmensonal predctor space are bound to be over-ftted. Margn maxmzaton s a non-regularzed objectve, both ntutvely and more rgorously by our results from the prevous secton. Thus we would expect the margn maxmzng solutons to perform worse than regularzed models n the case of boostng regularzaton would correspond to early stoppng of the boostng algorthm. 7.2 Boostng and SVMs as regularzed optmzaton n hgh-dmensonal predctor spaces Our exposton has led us to vew boostng as an approxmate way to solve the regularzed optmzaton problem: mn C(y,β h(x )) + λ β 1 (16) β whch converges as λ 0to ˆβ (1), f our loss s C e or C l. In general, the loss C can be any convex dfferentable loss and should be defned to match the problem doman. Support vector machnes can be descrbed as solvng the regularzed opt- 21

22 l 1 norm: 20 l 1 norm: 350 l 1 norm: 2701 l 1 norm: 5401 Fgure 10: Comparson of decson boundary of boostng models (broken) and of optmal constraned solutons wth same norm (full) mzaton problem (see for example [8], chapter 12): mn (1 y β h(x )) + + λ β 2 2 (17) β whch converges as λ 0 to the non-regularzed support vector machne soluton,.e. the optmal Eucldean separator, whch we denoted by ˆβ (2). An nterestng connecton exsts between these two approaches, n that they allow us to solve the regularzed optmzaton problem n hgh dmensonal predctor space: We are able to solve the l 1 - regularzed problem approxmately n very hgh dmenson va boostng by applyng the approxmate coordnate descent trck of buldng a decson tree (or otherwse greedly selectng a weak learner) based on re-weghted versons of the data. Support vector machnes facltate a dfferent trck for solvng the regularzed optmzaton problem n hgh dmensonal predctor space: the kernel trck. If our dctonary H spans a Reproducng Kernel Hlbert Space, then RKHS theory tells us we can fnd the regularzed solutons by solvng an n-dmensonal problem, n the space spanned by the kernel representers {K(x, x)}. Ths fact s by no means lmted to the hnge loss of (17), and apples to any convex loss. We concentrate our dscusson on SVM (and hence hnge loss) only snce t s by far the most common and well-known applcaton of ths result. 22

23 So we can vew both boostng and SVM as methods that allow us to ft regularzed models n hgh dmensonal predctor space usng a computatonal shortcut. The complexty of the model bult s controlled by the regularzaton. In that they are dstnctly dfferent than tradtonal statstcal approaches for buldng models n hgh dmenson, whch start by reducng the dmensonalty of the problem so that standard tools (e.g. Newton s method) can be appled to t, and also to make over-fttng less of a concern. Whle the merts of regularzaton wthout dmensonalty reducton lke Rdge regresson or the Lasso are well documented n statstcs, computatonal ssues make t mpractcal for the sze of problems typcally solved va boostng or SVM, wthout computatonal trcks. We beleve that ths dfference may be a sgnfcant reason for the endurng success of boostng and SVM n data modelng,.e.: workng n hgh dmenson and regularzng s statstcally preferable to a two-step procedure of frst reducng the dmenson, then fttng a model n the reduced space. It s also nterestng to consder the dfferences between the two approaches, n the loss (flexble vs. hnge loss), the penalty (l 1 vs. l 2 ), and the type of dctonary used (usually trees vs. RKHS). These dfferences ndcate that the two approaches wll be useful for dfferent stuatons. For example, f the true model has a sparse representaton n the chosen dctonary, then l 1 regularzaton may be warranted; f the form of the true model facltates descrpton of the class probabltes va a logstc-lnear model, then the logstc loss C l s the best loss to use, and so on. The computatonal trcks for both SVM and boostng lmt the knd of regularzaton that can be used for fttng n hgh dmensonal space. However, the problems can stll be formulated and solved for dfferent regularzaton approaches, as long as the dmensonalty s low enough: Support vector machnes can be ftted wth an l 1 penalty, by solvng the 1-norm verson of the SVM problem, equvalent to replacng the l 2 penalty n (17) wth an l 1 penalty. In fact, the 1-norm SVM s used qute wdely, because t s more easly solved n the lnear, non-rkhs, stuaton (as a lnear program, compared to the standard SVM whch s a quadratc program) and tends to gve sparser solutons n the prmal doman. Smlarly, we descrbe below an approach for developng a boostng algorthm for fttng approxmate l 2 regularzed models. Both of these methods are nterestng and potentally useful. However they lack what s arguably the most attractve property of the standard boostng and SVM algorthms: a computatonal trck to allow fttng n hgh dmensons. An l 2 boostng algorthm We can use our understandng of the relaton of boostng to regularzaton and theorem 3 to formulate l p -boostng algorthms, whch wll approxmately follow 23

24 the path of l p -regularzed solutons and converge to the correspondng l p -margn maxmzng separatng hyper-planes. Of partcular nterest s the l 2 case, snce theorem 3 mples that l 2 -constraned fttng usng C l or C e wll buld a regularzed path to the optmal separatng hyper-plane n the Eucldean (or SVM) sense. To construct an l 2 boostng algorthm, consder the equvalent optmzaton problem (12), and change the step-sze constrant to an l 2 constrant: β 2 β 0 2 ɛ It s easy to see that the frst order to ths soluton entals selectng for modfcaton the coordnate whch maxmzes: C(β 0 ) k β 0,k and that subject to monotoncty, ths wll lead to a correspondence to the locally l 2 -optmal drecton. Followng ths ntuton, we can construct an l 2 boostng algorthm by changng only step 2(c) of our generc boostng algorthm of secton 2 to: 2(c)* Identfy j t whch maxmzes whj t (x) β jt Note that the need to consder the current coeffcent (n the denomnator) makes the l 2 algorthm approprate for toy examples only. In stuatons where the dctonary of weak learner s prohbtvely large, we wll need to fgure out a trck lke the one we presented n secton 2.1, to allow us to make an approxmate search for the optmzer of step 2(c)*. Another problem n applyng ths algorthm to large problems s that we never choose the same dctonary functon twce, untl all have non-0 coeffcents. Ths s due to the use of the l 2 penalty, whch mples current coeffcent value affects the rate at whch the penalty term s ncreasng. In partcular, f β j =0 then ncreasng t cause the penalty term β 2 to ncrease at rate 0, to frst order (whch s all the algorthm s consderng). The convergence of our l 2 boostng algorthm on the artfcal dataset of secton 6.2 s llustrated n fgure 11. We observe that the l 2 boostng models do ndeed approach the optmal l 2 separator. It s nterestng to note the sgnfcant dfference between the optmal l 2 separator as presented n fgure 11 and the optmal l 1 separator presented n secton 6.2 (fgure 8). 8 Summary and future work In ths paper we have ntroduced a new vew of boostng n general, and twoclass boostng n partcular, comprsed of two man ponts: Followng [5, 9], we have descrbed boostng as approxmate l 1 -regularzed optmzaton. 24

25 optmal boost 5*10 6 ter boost 10 8 ter Fgure 11: Artfcal data set wth l 2 -margn maxmzng separator (sold), and l 2 -boostng models after teratons (dashed) and 10 8 teratons (dotted) usng ɛ = We observe the convergence of the boostng separator to the optmal separator We have shown that the exact l 1 -regularzed solutons converge to an l 1 - margn maxmzng separatng hyper-plane. We hope our results wll help n better understandng how and why boostng works. It s an nterestng and challengng task to separate the effects of the dfferent components of a boostng algorthm: Loss crteron Dctonary and greedy learnng method Lne search / slow learnng and relate them to ts success n dfferent scenaro. The mplct l 1 regularzaton n boostng may also contrbute to ts success - t has been shown that n some stuatons l 1 regularzaton s nherently superor to others (see [4]). An mportant ssue when analyzng boostng s over-fttng n the nosy data case. To deal wth over-fttng, Raetsch et al. [13] propose several regularzaton methods and generalzatons of the orgnal AdaBoost algorthm to acheve a soft margn by ntroducng slack varables. Our results ndcate that the models along the boostng path can be regarded as l 1 regularzed versons of the optmal separator, hence regularzaton can be done more drectly and naturally by stoppng the boostng teratons early. It s essentally a choce of the constrant parameter c. Many other questons arse from our vew of boostng. Among the ssues to be consdered: 25

26 Is there a smlar separator vew of mult-class boostng? We have some tentatve results to ndcate that ths mght be the case f the boostng problem s formulated properly. Can the constraned optmzaton vew of boostng help n producng generalzaton error bounds for boostng that would be more tght than the current exstng ones? Acknowledgments We thank Stephen Boyd, Brad Efron, Jerry Fredman, Robert Schapre and Rob Tbshran for helpful dscussons. Ths work s partally supported by Stanford graduate fellowshp, grant DMS from the Natonal Scence Foundaton, and grant ROI-CA from the Natonal Insttutes of Health. References [1] Blake, C.L. and Merz, C.J. (1998). UCI Repostory of machne learnng databases [ mlearn/mlrepostory.html]. Irvne, CA: Unversty of Calforna, Department of Informaton and Computer Scence. [2] Breman, L. (1999). Predcton games and arcng algorthms. Neural Computaton 7: [3] Collns, M., Schapre, R.E. & Snger, Y. (2000). Logstc regresson, AdaBoost and bregman dstances. In Proceedngs of the Thrteenth Annual Conference on Computatonal Learnng Theory. [4] Donoho, D., Johnstone, I., Kerkyacharan, G. & Pcard, D. (1995). Wavelet shrnkage: asymptopa? (wth dscusson). J. Royal. Statst. Soc. 57: [5] Efron, B., Haste, T., Johnstone, I.M. and Tbshran, R. (2002). Least Angle Regresson. Techncal report, Department of Statstcs, Stanford Unversty. [6] Freund, Y. & Scahpre, R.E. (1995). A decson theoretc generalzaton of onlne learnng and an applcaton to boostng. Proceedngs of the 2nd Eurpoean Conference on Computatonal Learnng Theory. [7] Fredman, J.H. (2001). Greedy functon approxmaton: A gradent boostng machne. Annals of Statstcs, Vol. 29, No. 5. [8] Fredman, J. H., Haste, T. & Tbshran, R. (2000). Addtve logstc regresson: a statstcal vew of boostng. Annals of Statstcs 28, pp [9] Haste, T., Tbshran, R. & Fredman, J.H. (2001). Elements of Statstcal Learnng. Sprnger-Verlag, New York. [10] Mangasaran, O.L. (1999). Arbtrary-norm separatng plane. Operatons Research Letters, Vol :15-23 [11] Mason, L., Baxter, J., Bartlett, P. & Frean, M. (1999). Boostng algorthms as gradent descent n functon space. Neural Informaton Processng Systems 12, pp

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Lecture 20: November 7

Lecture 20: November 7 0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far Supervsed machne learnng Lnear models Least squares regresson Fsher s dscrmnant, Perceptron, Logstc model Non-lnear

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far So far Supervsed machne learnng Lnear models Non-lnear models Unsupervsed machne learnng Generc scaffoldng So far

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

Support Vector Machines

Support Vector Machines Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

Pattern Classification

Pattern Classification Pattern Classfcaton All materals n these sldes ere taken from Pattern Classfcaton (nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wley & Sons, 000 th the permsson of the authors and the publsher

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + + Intuton of Margn Consder ponts

More information

Support Vector Machines

Support Vector Machines /14/018 Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? Intuton of Margn Consder ponts A, B, and C We

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Which Separator? Spring 1

Which Separator? Spring 1 Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal

More information

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machines. Oct Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

The Minimum Universal Cost Flow in an Infeasible Flow Network

The Minimum Universal Cost Flow in an Infeasible Flow Network Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning Advanced Introducton to Machne Learnng 10715, Fall 2014 The Kernel Trck, Reproducng Kernel Hlbert Space, and the Representer Theorem Erc Xng Lecture 6, September 24, 2014 Readng: Erc Xng @ CMU, 2014 1

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

Linear Classification, SVMs and Nearest Neighbors

Linear Classification, SVMs and Nearest Neighbors 1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

CSE 252C: Computer Vision III

CSE 252C: Computer Vision III CSE 252C: Computer Vson III Lecturer: Serge Belonge Scrbe: Catherne Wah LECTURE 15 Kernel Machnes 15.1. Kernels We wll study two methods based on a specal knd of functon k(x, y) called a kernel: Kernel

More information

CSE 546 Midterm Exam, Fall 2014(with Solution)

CSE 546 Midterm Exam, Fall 2014(with Solution) CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Learning Theory: Lecture Notes

Learning Theory: Lecture Notes Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Maximal Margin Classifier

Maximal Margin Classifier CS81B/Stat41B: Advanced Topcs n Learnng & Decson Makng Mamal Margn Classfer Lecturer: Mchael Jordan Scrbes: Jana van Greunen Corrected verson - /1/004 1 References/Recommended Readng 1.1 Webstes www.kernel-machnes.org

More information

Kristin P. Bennett. Rensselaer Polytechnic Institute

Kristin P. Bennett. Rensselaer Polytechnic Institute Support Vector Machnes and Other Kernel Methods Krstn P. Bennett Mathematcal Scences Department Rensselaer Polytechnc Insttute Support Vector Machnes (SVM) A methodology for nference based on Statstcal

More information

Lagrange Multipliers Kernel Trick

Lagrange Multipliers Kernel Trick Lagrange Multplers Kernel Trck Ncholas Ruozz Unversty of Texas at Dallas Based roughly on the sldes of Davd Sontag General Optmzaton A mathematcal detour, we ll come back to SVMs soon! subject to: f x

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Semi-supervised Classification with Active Query Selection

Semi-supervised Classification with Active Query Selection Sem-supervsed Classfcaton wth Actve Query Selecton Jao Wang and Swe Luo School of Computer and Informaton Technology, Beng Jaotong Unversty, Beng 00044, Chna Wangjao088@63.com Abstract. Labeled samples

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Neural networks. Nuno Vasconcelos ECE Department, UCSD Neural networs Nuno Vasconcelos ECE Department, UCSD Classfcaton a classfcaton problem has two types of varables e.g. X - vector of observatons (features) n the world Y - state (class) of the world x X

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014 COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and

More information

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials MA 323 Geometrc Modellng Course Notes: Day 13 Bezer Curves & Bernsten Polynomals Davd L. Fnn Over the past few days, we have looked at de Casteljau s algorthm for generatng a polynomal curve, and we have

More information

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,

More information

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso Supplement: Proofs and Techncal Detals for The Soluton Path of the Generalzed Lasso Ryan J. Tbshran Jonathan Taylor In ths document we gve supplementary detals to the paper The Soluton Path of the Generalzed

More information

On the Multicriteria Integer Network Flow Problem

On the Multicriteria Integer Network Flow Problem BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 5, No 2 Sofa 2005 On the Multcrtera Integer Network Flow Problem Vassl Vasslev, Marana Nkolova, Maryana Vassleva Insttute of

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Support Vector Machines

Support Vector Machines CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009 Structural Extensons of Support Vector Machnes Mark Schmdt March 30, 2009 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons

More information

1 The Mistake Bound Model

1 The Mistake Bound Model 5-850: Advanced Algorthms CMU, Sprng 07 Lecture #: Onlne Learnng and Multplcatve Weghts February 7, 07 Lecturer: Anupam Gupta Scrbe: Bryan Lee,Albert Gu, Eugene Cho he Mstake Bound Model Suppose there

More information

Linear Regression Analysis: Terminology and Notation

Linear Regression Analysis: Terminology and Notation ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented

More information

Explaining the Stein Paradox

Explaining the Stein Paradox Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten

More information

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne

More information

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests Smulated of the Cramér-von Mses Goodness-of-Ft Tests Steele, M., Chaselng, J. and 3 Hurst, C. School of Mathematcal and Physcal Scences, James Cook Unversty, Australan School of Envronmental Studes, Grffth

More information

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.

More information

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs246.stanford.edu 2/19/18 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 2 Hgh dm. data Graph data Infnte

More information

Exercises. 18 Algorithms

Exercises. 18 Algorithms 18 Algorthms Exercses 0.1. In each of the followng stuatons, ndcate whether f = O(g), or f = Ω(g), or both (n whch case f = Θ(g)). f(n) g(n) (a) n 100 n 200 (b) n 1/2 n 2/3 (c) 100n + log n n + (log n)

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016 U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and

More information

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem. prnceton u. sp 02 cos 598B: algorthms and complexty Lecture 20: Lft and Project, SDP Dualty Lecturer: Sanjeev Arora Scrbe:Yury Makarychev Today we wll study the Lft and Project method. Then we wll prove

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

Lecture 17: Lee-Sidford Barrier

Lecture 17: Lee-Sidford Barrier CSE 599: Interplay between Convex Optmzaton and Geometry Wnter 2018 Lecturer: Yn Tat Lee Lecture 17: Lee-Sdford Barrer Dsclamer: Please tell me any mstake you notced. In ths lecture, we talk about the

More information

Lecture 3: Dual problems and Kernels

Lecture 3: Dual problems and Kernels Lecture 3: Dual problems and Kernels C4B Machne Learnng Hlary 211 A. Zsserman Prmal and dual forms Lnear separablty revsted Feature mappng Kernels for SVMs Kernel trck requrements radal bass functons SVM

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information