SimpleMKL. Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France. Francis R. Bach

Size: px

Start display at page:

Download "SimpleMKL. Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France. Francis R. Bach"

Job Warner
5 years ago
Views:

1 Journal of Machne Learnng Research X (28) 1-34 Subtted 1/8; Revsed 8/8; Publshed XX/XX SpleMKL Alan Rakotoaonjy LITIS EA 418 Unversté de Rouen 768 Sant Etenne du Rouvray, France alan.rakotoaonjy@nsa-rouen.fr Francs R. Bach francs.bach@nes.org INRIA - WILLOW Project - Tea Laboratore d Inforatque de l Ecole Norale Supéreure(CNRS/ENS/INRIA UMR 8548) 45, Rue d Ul, 7523 Pars, France Stéphane Canu LITIS EA 418 INSA de Rouen 7681 Sant Etenne du Rouvray, France stephane.canu@nsa-rouen.fr Yves Grandvalet yves.grandvalet@utc.fr Idap Research Insttute, Centre du Parc, 192 Martgny, Swtzerland Heudasyc, CNRS/Unversté de Technologe de Copègne (UMR 6599), 625 Copègne, France Edtor: Nello Crstann Abstract Multple kernel learnng (MKL) as at sultaneously learnng a kernel and the assocated predctor n supervsed learnng settngs. For the support vector achne, an effcent and general ultple kernel learnng algorth, based on se-nfnte lnear progang, has been recently proposed. Ths approach has opened new perspectves snce t akes MKL tractable for large-scale probles, by teratvely usng exstng support vector achne code. However, t turns out that ths teratve algorth needs nuerous teratons for convergng towards a reasonable soluton. In ths paper, we address the MKL proble through a weghted 2-nor regularzaton forulaton wth an addtonal constrant on the weghts that encourages sparse kernel cobnatons. Apart fro learnng the cobnaton, we solve a standard SVM optzaton proble, where the kernel s defned as a lnear cobnaton of ultple kernels. We propose an algorth, naed SpleMKL, for solvng ths MKL proble and provde a new nsght on MKL algorths based on xed-nor regularzaton by showng that the two approaches are equvalent. We show how SpleMKL can be appled beyond bnary classfcaton, for probles lke regresson, clusterng (one-class classfcaton) or ultclass classfcaton. Experental results show that the proposed algorth converges rapdly and that ts effcency copares favorably to other MKL algorths. Fnally, we llustrate the usefulness of MKL for soe regressors based on wavelet kernels and on soe odel selecton probles related to ultclass classfcaton probles. c 28 Rakotoaonjy et al..

2 Rakotoaonjy et al. 1. Introducton Durng the last few years, kernel ethods, such as support vector achnes (SVM) have proved to be effcent tools for solvng learnng probles lke classfcaton or regresson (Schölkopf and Sola, 21). For such tasks, the perforance of the learnng algorth strongly depends on the data representaton. In kernel ethods, the data representaton s plctly chosen through the so-called kernel K(x, x ). Ths kernel actually plays two roles: t defnes the slarty between two exaples x and x, whle defnng an approprate regularzaton ter for the learnng proble. Let {x, y } l =1 be the learnng set, where x belongs to soe nput space X and y s the target value for pattern x. For kernel algorths, the soluton of the learnng proble s of the for l f(x) = α K(x, x ) + b, (1) =1 where α and b are soe coeffcents to be learned fro exaples, whle K(, ) s a gven postve defnte kernel assocated wth a reproducng kernel Hlbert space (RKHS) H. In soe stuatons, a achne learnng practtoner ay be nterested n ore flexble odels. Recent applcatons have shown that usng ultple kernels nstead of a sngle one can enhance the nterpretablty of the decson functon and prove perforances (Lanckret et al., 24a). In such cases, a convenent approach s to consder that the kernel K(x, x ) s actually a convex cobnaton of bass kernels: M K(x, x ) = d K (x, x ), wth d, =1 M d = 1, =1 where M s the total nuber of kernels. Each bass kernel K ay ether use the full set of varables descrbng x or subsets of varables steng fro dfferent data sources (Lanckret et al., 24a). Alternatvely, the kernels K can sply be classcal kernels (such as Gaussan kernels) wth dfferent paraeters. Wthn ths fraework, the proble of data representaton through the kernel s then transferred to the choce of weghts d. Learnng both the coeffcents α and the weghts d n a sngle optzaton proble s known as the ultple kernel learnng (MKL) proble. For bnary classfcaton, the MKL proble has been ntroduced by Lanckret et al. (24b), resultng n a quadratcally constraned quadratc prograng proble that becoes rapdly ntractable as the nuber of learnng exaples or kernels becoe large. What akes ths proble dffcult s that t s actually a convex but non-sooth nzaton proble. Indeed, Bach et al. (24a) have shown that the MKL forulaton of Lanckret et al. (24b) s actually the dual of a SVM proble n whch the weght vector has been regularzed accordng to a xed (l 2, l 1 )-nor nstead of the classcal squared l 2 -nor. Bach et al. (24a) have consdered a soothed verson of the proble for whch they proposed a SMO-lke algorth that enables to tackle edu-scale probles. Sonnenburg et al. (26) reforulate the MKL proble of Lanckret et al. (24b) as a se-nfnte lnear progra (SILP). The advantage of the latter forulaton s that the algorth addresses the proble by teratvely solvng a classcal SVM proble wth a sngle kernel, for whch any effcent toolboxes exst (Vshwanathan et al., 23; Loosl et al., 2

3 SpleMKL 25; Chang and Ln, 21), and a lnear progra whose nuber of constrants ncreases along wth teratons. A very nce feature of ths algorth s that s can be extended to a large class of convex loss functons. For nstance, Zen and Ong (27) have proposed a ultclass MKL algorth based on slar deas. In ths paper, we present another forulaton of the ultple learnng proble. We frst depart fro the pral forulaton proposed by Bach et al. (24a) and further used by Bach et al. (24b) and Sonnenburg et al. (26). Indeed, we replace the xed-nor regularzaton by a weghted l 2 -nor regularzaton, where the sparsty of the lnear cobnaton of kernels s controlled by a l 1 -nor constrant on the kernel weghts. Ths new forulaton of MKL leads to a sooth and convex optzaton proble. By usng a varatonal forulaton of the xed-nor regularzaton, we show that our forulaton s equvalent to the ones of Lanckret et al. (24b), Bach et al. (24a) and Sonnenburg et al. (26). The an contrbuton of ths paper s to propose an effcent algorth, naed SpleMKL, for solvng the MKL proble, through a pral forulaton nvolvng a weghted l 2 -nor regularzaton. Indeed, our algorth s sple, essentally based on a gradent descent on the SVM objectve value. We teratvely deterne the cobnaton of kernels by a gradent descent wrappng a standard SVM solver, whch s SpleSVM n our case. Our schee s slar to the one of Sonnenburg et al. (26), and both algorths nze the sae objectve functon. However, they dffer n that we use reduced gradent descent n the pral, whereas Sonnenburg et al. s SILP reles on cuttng planes. We wll eprcally show that our optzaton strategy s ore effcent, wth new evdences confrng the prelnary results reported n Rakotoaonjy et al. (27). Then, extensons of SpleMKL to other supervsed learnng probles such as regresson SVM, one-class SVM or ultclass SVM probles based on parwse couplng are proposed. Although t s not the an purpose of the paper, we wll also dscuss the applcablty of our approach to general convex loss functons. Ths paper also presents several llustratons of the usefulness of our algorth. For nstance, n addton to the eprcal effcency coparson, we also show, n a SVM regresson proble nvolvng wavelet kernels, that autoatc learnng of the kernels leads to far better perforances. Then we depct how our MKL algorth behaves on soe ultclass probles. The paper s organzed as follows. Secton 2 presents the functonal settngs of our MKL proble and ts forulaton. Detals on the algorth and dscusson of convergence and coputatonal coplexty are gven n Secton 3. Extensons of our algorth to other SVM probles are dscussed n Secton 4 whle experental results dealng wth coputatonal coplexty or wth coparson wth other odel selecton ethods are presented n Secton 5. A SpleMKL toolbox based on Matlab code s avalable at Ths toolbox s an extenson of our SVM-KM toolbox (Canu et al., 23). 2. Multple Kernel Learnng Fraework In ths secton, we present our MKL forulaton and derve ts dual. In the sequel, and j are ndces on exaples, wheras s the kernel ndex. In order to lghten notatons, 3

4 Rakotoaonjy et al. we ot to specfy that suatons on and j go fro 1 to l, and that suatons on go fro 1 to M. 2.1 Functonal fraework Before enterng nto the detals of the MKL optzaton proble, we frst present the functonal fraework adopted for ultple kernel learnng. Assue K, = 1,..., M are M postve defnte kernels on the sae nput space X, each of the beng assocated wth an RKHS H endowed wth an nner product,. For any, let d be a non-negatve coeffcent and H be the Hlbert space derved fro H as follows: endowed wth the nner product H = {f f H : f H d < }, f, g H = 1 d f, g. In ths paper, we use the conventon that x = f x = and otherwse. Ths eans that, f d = then a functon f belongs to the Hlbert space H only f f = H. In such a case, H s restrcted to the null eleent of H. Wthn ths fraework, H s a RKHS wth kernel K(x, x ) = d K (x, x ) snce f H H, f(x) = f( ), K (x, ) = 1 d f( ), d K (x, ) = f( ), d K (x, ) H. Now, f we defne H as the drect su of the spaces H,.e., H = M H, then, a classcal result on RKHS (Aronszajn, 195) says that H s a RKHS of kernel K(x, x ) = =1 M d K (x, x ). =1 Owng to ths sple constructon, we have bult a RKHS H for whch any functon s a su of functons belongng to H. In our fraework, MKL as at deternng the set of coeffcents {d } wthn the learnng process of the decson functon. The ultple kernel learnng proble can thus be envsoned as learnng a predctor belongng to an adaptve hypothess space endowed wth an adaptve nner product. The forthcong sectons explan how we solve ths proble. 4

5 SpleMKL 2.2 Multple kernel learnng pral proble In the SVM ethodology, the decson functon s of the for gven n equaton (1), where the optal paraeters α and b are obtaned by solvng the dual of the followng optzaton proble: n f,b,ξ 1 2 f 2 H + C ξ s.t. y (f(x ) + b) 1 ξ ξ. In the MKL fraework, one looks for a decson functon of the for f(x) + b = f (x) + b, where each functon f belongs to a dfferent RKHS H assocated wth a kernel K. Accordng to the above functonal fraework and nspred by the ultple soothng splnes fraework of Wahba (199, chap. 1), we propose to address the MKL SVM proble by solvng the followng convex proble (see proof n appendx), whch we wll be referred to as the pral MKL proble: n {f },b,ξ,d s.t. 1 1 f 2 H 2 d + C y f (x ) + y b 1 ξ ξ ξ d = 1, d, where each d controls the squared nor of f n the objectve functon. The saller d s, the soother f (as easured by f H ) should be. When d =, f H has also to be equal to zero to yeld a fnte objectve value. The l 1 -nor constrant on the vector d s a sparsty constrant that wll force soe d to be zero, thus encouragng sparse bass kernel expansons. 2.3 Connectons wth xed-nor regularzaton forulaton of MKL The MKL forulaton ntroduced by Bach et al. (24a) and further developed by Sonnenburg et al. (26) conssts n solvng an optzaton proble expressed n a functonal for as n {f},b,ξ s.t. ( ) 2 1 f H + C ξ 2 y f (x ) + y b 1 ξ ξ. Note that the objectve functon of ths proble s not sooth snce f H s not dfferentable at f =. However, what akes ths forulaton nterestng s that the xed-nor penalzaton of f = f s a soft-thresholdng penalzer that leads to a sparse soluton, for whch the algorth perfors kernel selecton (Bach, 28). We have stated n the prevous secton that our proble should also lead to sparse solutons. In the followng, we show that the forulatons (2) and (3) are equvalent. (2) (3) 5

6 Rakotoaonjy et al. For ths purpose, we sply show that the varatonal forulaton of the xed-nor regularzaton s equal to the weghted 2-nor regularzaton, (whch s a partcular case of a ore general equvalence proposed by Mcchell and Pontl (25)).e., by Cauchy-Schwartz nequalty, for any vector d on the splex: ( ) 2 f H = ( ( f H d 1/2 d 1/2 f 2 H d f 2 H d, ) 2 ) ( ) d where equalty s et when d 1/2 s proportonal to f H /d 1/2, that s: d = f H, (4) f q Hq q whch leads to ( ) f 2 2 n d, P H = f H. (5) d =1 d Hence, owng to ths varatonal forulaton, the non-sooth xed-nor objectve functon of proble (3) has been turned nto a sooth objectve functon n proble (2). Although the nuber of varables has ncreased, we wll see that ths proble can be solved ore effcently. 2.4 The MKL dual proble The dual proble s a key pont for dervng MKL algorths and for studyng ther convergence propertes. Snce our pral proble (2) s equvalent to the one of Bach et al. (24a), they lead to the sae dual. However, our pral forulaton beng convex and dfferentable, t provdes a sple dervaton of the dual, that does not use conc dualty. The Lagrangan of proble (2) s L = 1 1 f 2 H 2 d + C ξ + ( ) α 1 ξ y f (x ) y b ν ξ ( ) +λ d 1 η d, (6) where α and ν are the Lagrange ultplers of the constrants related to the usual SVM proble, whereas λ and η are assocated to the constrants on d. When settng to zero 6

7 SpleMKL the gradent of the Lagrangan wth respect to the pral varables, we get the followng 1 (a) f ( ) = α y K (, x ), d (b) α y = (c) C α ν =, (7) (d) 1 2 f 2 H d 2 + λ η =,. We note agan here that f ( ) has to go to f the coeffcent d vanshes. Pluggng these optalty condtons n the Lagrangan gves the dual proble ax α λ α,λ s.t. α y = (8) α C 1 α α j y y j K (x, x j ) λ,. 2,j Ths dual proble 1 s dffcult to optze due to the last constrant. Ths constrant ay be oved to the objectve functon, but then, the latter becoes non-dfferentable causng new dffcultes (Bach et al., 24a). Hence, n the forthcong secton, we propose an approach based on the nzaton of the pral. In ths fraework, we beneft fro dfferentablty whch allows for an effcent dervaton of an approxate pral soluton, whose accuracy wll be ontored by the dualty gap. 3. Algorth for solvng the MKL pral proble One possble approach for solvng proble (2) s to use the alternate optzaton algorth appled by Grandvalet and Canu (1999, 23) n another context. In the frst step, proble (2) s optzed wth respect to f, b and ξ, wth d fxed. Then, n the second step, the weght vector d s updated to decrease the objectve functon of proble (2), wth f, b and ξ beng fxed. In Secton 2.3, we showed that the second step can be carred out n closed for. However, ths approach lacks convergence guarantees and ay lead to nuercal probles, n partcular when soe eleents of d approach zero (Grandvalet, 1998). Note that these nuercal probles can be handled by ntroducng a perturbed verson of the alternate algorth as shown by Argyrou et al. (28). Instead of usng an alternate optzaton algorth, we prefer to consder here the followng constraned optzaton proble: n d J(d) such that M d = 1, d, (9) =1 1. Note that Bach et al. (24a) forulaton dffers slghtly, n that the kernels are weghted by soe pre-defned coeffcents that were not consdered here. 7

8 Rakotoaonjy et al. where n {f},b,ξ J(d) = s.t. 1 1 f 2 H 2 d + C y f (x ) + y b 1 ξ ξ. We show below how to solve proble (9) on the splex by a sple gradent ethod. We wll frst note that the objectve functon J(d) s actually an optal SVM objectve value. We wll then dscuss the exstence and coputaton of the gradent of J( ), whch s at the core of the proposed approach. 3.1 Coputng the optal SVM value and ts dervatves The Lagrangan of proble (1) s dentcal to the frst lne of equaton (6). By settng to zero the dervatves of ths Lagrangan accordng to the pral varables, we get condtons (7) (a) to (c), fro whch we derve the assocated dual proble ax α wth 1 α α j y y j d K (x, x j ) + 2,j α y = C α, whch s dentfed as the standard SVM dual forulaton usng the cobned kernel K(x, x j ) = d K (x, x j ). Functon J(d) s defned as the optal objectve value of proble (1). Because of strong dualty, J(d) s also the objectve value of the dual proble: J(d) = 1 α α 2 jy y j d K (x, x j ) + α, (12),j where α axzes (11). Note that the objectve value J(d) can be obtaned by any SVM algorth. Our ethod can thus take advantage of any progress n sngle kernel algorths. In partcular, f the SVM algorth we use s able to handle large-scale probles, so wll our MKL algorth. Thus, the overall coplexty of SpleMKL s ted to the one of the sngle kernel SVM algorth. Fro now on, we assue that each Gra atrx (K (x, x j )),j s postve defnte, wth all egenvalues greater than soe η > (to enforce ths property, a sall rdge ay be added to the dagonal of the Gra atrces). Ths ples that, for any adssble value of d, the dual proble s strctly concave wth convexty paraeter η (Learéchal and Sagastzabal, 1997). In turn, ths strct concavty property ensures that α s unque, a characterstc that eases the analyss of the dfferentablty of J( ). Exstence and coputaton of dervatves of optal value functons such as J( ) have been largely dscussed n the lterature. For our purpose, the approprate reference s Theore 4.1 n Bonnans and Shapro (1998), whch has already been appled by Chapelle et al. (22) for tunng squared-hnge loss SVM. Ths theore s reproduced n the appendx for self-contanedness. In a nutshell, t says that dfferentablty of J(d) s ensured by ξ α (1) (11) 8

9 SpleMKL the uncty of α, and by the dfferentablty of the objectve functon that gves J(d). Furtherore, the dervatves of J(d) can be coputed as f α were not to depend on d. Thus, by sple dfferentaton of the dual functon (11) wth respect to d, we have: J d = 1 2 α αjy y j K (x, x j ). (13),j We wll see n the sequel that the applcablty of ths theore can be extended to other SVM probles. Note that coplexty of the gradent coputaton s of the order of n 2 SV, wth n SV beng the nuber of support vectors for the current d. 3.2 Reduced gradent algorth The optzaton proble we have to deal wth n (9) s a non-lnear objectve functon wth constrants over the splex. Wth our postvty assupton on the kernel atrces, J( ) s convex and dfferentable wth Lpschtz gradent (Learéchal and Sagastzabal, 1997). The approach we use for solvng ths proble s a reduced gradent ethod, whch converges for such functons (Luenberger, 1984). Once the gradent of J(d) s coputed, d s updated by usng a descent drecton ensurng that the equalty constrant and the non-negatvty constrants on d are satsfed. We handle the equalty constrant by coputng the reduced gradent (Luenberger, 1984, Chap. 11). Let d µ be a non-zero entry of d, the reduced gradent of J(d), denoted red J, has coponents: [ red J] = J d J d µ µ, and [ red J] µ = µ ( J J ) d µ d We chose µ to be the ndex of the largest coponent of vector d, for better nuercal stablty (Bonnans, 26). The postvty constrants have also to be taken nto account n the descent drecton. Snce we want to nze J( ), red J s a descent drecton. However, f there s an ndex such that d = and [ red J] >, usng ths drecton would volate the postvty constrant for d. Hence, the descent drecton for that coponent s set to. Ths gves the descent drecton for updatng d as D = f d = and J d J d µ > J + J f d > and µ d d µ ( J J ) for = µ. d ν d µ g µ,d ν > The usual updatng schee s d d+γd, where γ s the step sze. Here, as detaled n Algorth 1, we go one step beyond: once a descent drecton D has been coputed, we frst look for the axal adssble step sze n that drecton and check whether the objectve value decreases or not. The axal adssble step sze corresponds to a coponent, say d ν, set to zero. If the objectve value decreases, d s updated, we set D ν = and noralze D to coply wth the equalty constrant. Ths procedure s repeated untl the objectve value. (14) 9

10 Rakotoaonjy et al. Algorth 1 SpleMKL algorth set d = 1 M for = 1,..., M whle stoppng crteron not et do copute J(d) by usng an SVM solver wth K = d K copute J d for = 1,..., M and descent drecton D (14). set µ = argax d, J =, d = d, D = D whle J < J(d) do {descent drecton update} d = d, D = D ν = argn d /D, γ ax = d ν /D ν { D <} d = d + γ ax D, D µ = D µ D ν, D ν = copute J by usng an SVM solver wth K = d K end whle lne search along D for γ [, γ ax ] {calls an SVM solver for each γ tral value} d d + γd end whle stops decreasng. At ths pont, we look for the optal step sze γ, whch s deterned by usng a one-densonal lne search, wth proper stoppng crteron, such as Arjo s rule, to ensure global convergence. In ths algorth, coputng the descent drecton and the lne search are based on the evaluaton of the objectve functon J( ), whch requres solvng an SVM proble. Ths ay see very costly but, for sall varatons of d, learnng s very fast when the SVM solver s ntalzed wth the prevous values of α (DeCoste and Wagstaff., 2). Note that the gradent of the cost functon s not coputed after each update of the weght vector d. Instead, we take advantage of an easly updated descent drecton as long as the objectve value decreases. We wll see n the nuercal experents that ths approach saves a substantal aount of coputaton te copared to the usual update schee where the descent drecton s recoputed after each update of d. Note that we have also nvestgated gradent projecton algorths (Bertsekas, 1999, Chap 2.3), but ths turned out to be slghtly less effcent than the proposed approach, and we wll not report these results. The algorth s ternated when a stoppng crteron s et. Ths stoppng crteron can be ether based on the dualty gap, the KKT condtons, the varaton of d between two consecutve steps or, even ore sply, on a axal nuber of teratons. Our pleentaton, based on the dualty gap, s detaled n the forthcong secton. 3.3 Optalty condtons In a convex constraned optzaton algorth such as the one we are consderng, we have the opportunty to check for proper optalty condtons such as the KKT condtons or the dualty gap (the dfference between pral and dual objectve values), whch should be zero at the optu. Fro the pral and dual objectves provded respectvely n (2) and (8), the MKL dualty gap s 1

11 SpleMKL DualGap = J(d ) α ax α αjy y j K (x, x j ),, where d and {α } are optal pral and dual varables, and J(d ) depends plctly on optal pral varables {f}, b and {ξ }. If J(d ) has been obtaned through the dual proble (11), then ths MKL dualty gap can also be coputed fro the sngle kernel SVM algorth dualty gap DG SVM. Indeed, equaton (12) holds only when the sngle kernel SVM algorth returns an exact soluton wth DG SVM =. Otherwse, we have DG SVM = J(d ) + 1 α α 2 jy y j d K (x, x j ) α,j then the MKL dualty gap becoes DualGap = DG SVM 1 α α 2 jy y j d K (x, x j ) ax α αjy y j K (x, x j ).,j Hence, t can be obtaned wth a sall addtonal coputatonal cost copared to the SVM dualty gap. In teratve procedures, t s coon to stop the algorth when the optalty condtons are respected up to a tolerance threshold ε. Obvously, SpleMKL has no pact on DG SVM, hence, one ay assue, as we dd here, that DG SVM needs not to be ontored. Consequently, we ternate the algorth when,j ax α αjy y j K (x, x j ) α αjy y j d K (x, x j ) ε. (15),j,j For soe of the other MKL algorths that wll be presented n Secton 4, the dual functon ay be ore dffcult to derve. Hence, t ay be easer to rely on approxate KKT condtons as a stoppng crteron. For the general MKL proble (9), the frst order optalty condtons are obtaned through the KKT condtons: J + λ η d = η d =, where λ and {η } are respectvely the Lagrange ultplers for the equalty and nequalty constrants of (9). These KKT condtons ply J d = λ f d > J d λ f d =. However, as Algorth 1 s not based on the Lagrangan forulaton of proble (9), λ s not coputed. Hence, we derve approxate necessary optalty condtons to be used for ternaton crteron. Let s defne dj n and dj ax as dj n = n {d d >},j J d and dj ax = ax {d d >} 11 J d,

12 Rakotoaonjy et al. then, the necessary optalty condtons are approxated by the followng ternaton condtons: dj n dj ax ε and J d dj ax f d = In other words, we are consdered at the optu when the gradent coponents for all postve d le n a ε-tube and when all gradent coponents for vanshng d are outsde ths tube. Note that these approxate necessary optalty condtons are avalable rght away for any dfferentable objectve functon J(d). 3.4 Cuttng Planes, Steepest Descent and Coputatonal Coplexty As we stated n the ntroducton, several algorths have been proposed for solvng the orgnal MKL proble defned by Lanckret et al. (24b). All these algorths are based on equvalent forulatons of the sae dual proble; they all a at provdng a par of optal vectors (d, α). In ths subsecton, we contrast SpleMKL wth ts closest relatve, the SILP algorth of Sonnenburg et al. (25, 26). Indeed, fro an pleentaton pont of vew, the two algorths are alke, snce they are wrappng a standard sngle kernel SVM algorth. Ths feature akes both algorths very easy to pleent. They, however, dffer n coputatonal effcency, because the kernel weghts d are optzed n qute dfferent ways, as detaled below. Let us frst recall that our dfferentable functon J(d) s defned as: ax α J(d) = wth 1 α α j y y j d K (x, x j ) + 2,j α y =, C α, α and both algorths a at nzng ths dfferentable functon. However, usng a SILP approach n ths case, does not take advantage of the soothness of the objectve functon. The SILP algorth of Sonnenburg et al. (26) s a cuttng plane ethod to nze J wth respect to d. For each value of d, the best α s found and leads to an affne lower bound on J(d). The nuber of lower boundng affne functons ncreases as ore (d, α) pars are coputed, and the next canddate vector d s the nzer of the current lower bound on J(d), that s, the axu over all the affne functons. Cuttng planes ethod do converge but they are known for ther nstablty, notably when the nuber of lower-boundng affne functons s sall: the approxaton of the objectve functon s then loose and the terates ay oscllate (Bonnans et al., 23). Our steepest descent approach, wth the proposed lne search, does not suffer fro nstablty snce we have a dfferentable functon to nze. Fgure 1 llustrates the behavour of both algorths n a sple case, wth oscllatons for cuttng planes and drect convergence for gradent descent. Secton 5 evaluates how these oscllatons pact on the coputatonal te of the SILP algorth on several exaples. These experents show that our algorth needs less costly gradent coputatons. Conversely, the lne search n the gradent base approach requres ore SVM retranngs n the process of queryng the objectve functon. However, 12

13 SpleMKL Fgure 1: Illustratng three teratons of the SILP algorth and a gradent descent algorth for a one-densonal proble. Ths densonalty s not representatve of the MKL fraework, but our a s to llustrate the typcal oscllatons of cuttng planes around the optal soluton (wth terates d to d 3 ). Note that coputng an affne lower bound at a gven d requres a gradent coputaton. Provded the step sze s chosen correctly, gradent descent converges drectly towards the optal soluton wthout overshootng (fro d to d ). the coputaton te per SVM tranng s consderably reduced, snce the gradent based approach produces estates of d on a sooth trajectory, so that the prevous SVM soluton provdes a good guess for the current SVM tranng. In SILP, wth the oscllatory subsequent approxatons of d, the beneft of war-start tranng severely decreases. 3.5 Convergence Analyss In ths paragraph, we brefly dscuss the convergence of the algorth we propose. We frst suppose that proble (1) s always exactly solved, whch eans that the dualty gap of such proble s. Wth such condtons, the gradent coputaton n (13) s exact and thus our algorth perfors reduced gradent descent on a contnuously dfferentable functon J( ) (reeber that we have assued that the kernel atrces are postve defnte) defned on the splex {d d = 1, d }, whch does converge to the global nu of J (Luenberger, 1984). However, n practce, proble (1) s not solved exactly snce ost SVM algorths wll stop when the dualty gap s saller than a gven ε. In ths case, the convergence of our projected gradent ethod s no ore guaranteed by standard arguents. Indeed, the output of the approxately solved SVM leads only to an ε-subgradent (Bonnans et al., 23; Bach et al., 24a). Ths stuaton s ore dffcult to analyze and we plan to address t thoroughly n future work (see for nstance D Aspreont (28) for an exaple of such analyss n a slar context). 13

14 Rakotoaonjy et al. 4. Extensons In ths secton, we dscuss how the proposed algorth can be sply extended to other SVM algorths such as SVM regresson, one-class SVM or parwse ultclass SVM algorths. More generally, we wll dscuss other loss functons that can be used wthn our MKL algorths. 4.1 Extensons to other SVM Algorths The algorth we descrbed n the prevous secton focuses on bnary classfcaton SVMs, but t s worth notng that our MKL algorth can be extended to other SVM algorths wth only lttle changes. For SVM regresson wth the ε-nsenstve loss, or clusterng wth the one-class soft argn loss, the proble only changes n the defnton of the objectve functon J(d) n (1). For SVM regresson (Vapnk et al., 1997; Schölkopf and Sola, 21), we have J(d) = 1 n f,b,ξ 2 s.t. y 1 d f 2 H + C f (x ) b ε + ξ f (x ) + b y ε + ξ ξ, ξ, (ξ + ξ ) (16) and for one-class SVMs (Schölkopf and Sola, 21), we have: 1 1 n f 2 H f,b,ξ 2 d + 1 ξ b νl J(d) = s.t. f (x ) b ξ ξ. (17) Agan, J(d) can be defned accordng to the dual functons of these two optzaton probles, whch are respectvely ax α,β J(d) = wth and (β α )y ε (β + α ) 1 (β α )(β j α j ) d K (x, x j ) 2,j (β α ) = α, β C,, ax α J(d) = wth 1 α α j d K (x, x j ) 2,j α 1 νl α = 1, 14 (18) (19)

15 SpleMKL where {α } and {β } are Lagrange ultplers. Then, as long as J(d) s dfferentable, a property strctly related to the strct concavty of ts dual functon, our descent algorth can stll be appled. The an effort for the extenson of our algorth s the evaluaton of J(d) and the coputaton of ts dervatves. Lke for bnary classfcaton SVM, J(d) can be coputed by eans of effcent off-the-shelf SVM solvers and the gradent of J(d) s easly obtaned through the dual probles. For SVM regresson, we have: J d = 1 2 (β α )(βj αj)k (x, x j ), (2),j and for one-class SVM, we have: J d = 1 2 α αjk (x, x j ), (21),j where α and β are the optal values of the Lagrange ultplers. These exaples llustrate that extendng SpleMKL to other SVM probles s rather straghforward. Ths observaton s vald for other SVM algorths (based for nstance on the ν paraeter, a squared hnge loss or squared-ε tube) that we do not detal here. Agan, our algorth can be used provded J(d) s dfferentable, by pluggng n the algorth the functon that evaluates the objectve value J(d) and ts gradent. Of course, the dualty gap ay be consdered as a stoppng crteron f t can be coputed. 4.2 Multclass Multple Kernel Learnng Wth SVMs, ultclass probles are custoarly solved by cobnng several bnary classfers. The well-known one-aganst-all and one-aganst-one approaches are the two ost coon ways for buldng a ultclass decson functon based on parwse decson functons. Multclass SVM ay also be defned rght away as the soluton of a global optzaton proble (Weston and Watkns, 1999; Craer and Snger, 21), that ay also be addressed wth structured-output SVM (Tsochantards et al., 25). Very recently, an MKL algorth based on structured-output SVM has been proposed by Zen and Ong (27). Ths work extends the work of Sonnenburg et al. (26) to ultclass probles, wth an MKL pleentaton stll based on a QCQP or SILP approach. Several works have copared the perforance of ultclass SVM algorths (Duan and Keerth, 25; Hsu and Ln, 22; Rfkn and Klautau, 24). In ths subsecton, we do not deal wth ths aspect; we explan how SpleMKL can be extended to parwse SVM ultclass pleentatons. The proble of applyng our algorth to structured-output SVM wll be brefly dscussed later. Suppose we have a ultclass proble wth P classes. For a one-aganst-all ultclass SVM, we need to tran P bnary SVM classfers, where the p-th classfer s traned by consderng all exaples of class p as postve exaples whle all other exaples are consdered negatve. For a one-aganst-one ultclass proble, P (P 1)/2 bnary SVM classfers are bult fro all pars of dstnct classes. Our ultclass MKL extenson of SpleMKL dffers fro the bnary verson only n the defnton of a new cost functon J(d). As we now look 15

16 Rakotoaonjy et al. for the cobnaton of kernels that jontly optzes all the parwse decson functons, the objectve functon we want to optze accordng to the kernel weghts {d } s: J(d) = p P J p (d), where P s the set of all pars to be consdered, and J p (d) s the bnary SVM objectve value for the classfcaton proble pertanng to par p. Once the new objectve functon s defned, the lnes of Algorth 1 stll apply. The gradent of J(d) s stll very sple to obtan, snce owng to lnearty, we have: J d = 1 2 α,pα j,py y j K (x, x j ), (22) p P,j where α j,p s the Lagrange ultpler of the j-th exaple nvolved n the p-th decson functon. Note that those Lagrange ultplers can be obtaned ndependently for each par. The approach descrbed above as at fndng the cobnaton of kernels that jontly optzes all bnary classfcaton probles: ths one set of features should axze the su of argns. Another possble and straghtforward approach conssts n runnng ndependently SpleMKL for each classfcaton task. However, ths choce s lkely to result n as any cobnatons of kernels as there are bnary classfers. 4.3 Other loss functons Multple kernel learnng has been of great nterest and snce the senal work of Lanckret et al. (24b), several works on ths topc have flourshed. For nstance, ultple kernel learnng has been transposed to least-square fttng and logstc regresson (Bach et al., 24b). Independently, several authors have appled xed-nor regularzaton, such as the addtve splne regresson odel of Grandvalet and Canu (1999). Ths type of regularzaton, whch s now known as the group lasso, ay be seen as a lnear verson of ultple kernel learnng (Bach, 28). Several algorths have been proposed for solvng the group lasso proble. Soe of the are based on projected gradent or on coordnate descent algorth. However, they all consder the non-sooth verson of the proble. We prevously entoned that Zen and Ong (27) have proposed an MKL algorth based on structured-output SVMs. For such proble, the loss functon, whch dffers fro the usual SVM hnge loss, leads to an algorth based on cuttng planes nstead of the usual QP approach. Provded the gradent of the objectve value can be obtaned, our algorth can be appled to group lasso and structured-output SVMs. The key pont s whether the theore of Bonnans et al. (23) can be appled or not. Although we have not deeply nvestgated ths pont, we thnk that any probles coply wth ths requreent, but we leave these developents for future work. 4.4 Approxate regularzaton path SpleMKL requres the settng of the usual SVM hyperparaeter C, whch usually needs to be tuned for the proble at hand. For dong so, a practcal and useful technque 16

17 SpleMKL s to copute the so-called regularzaton path, whch descrbes the set of solutons as C vares fro to. Exact path followng technques have been derved for soe specfc probles lke SVMs or the lasso (Haste et al., 24; Efron et al., 24). Besdes, regularzaton paths can be sapled by predctor-corrector ethods (Rosset, 24; Bach et al., 24b). For odel selecton purposes, an approxaton of the regularzaton path ay be suffcent. Ths approach has been appled for nstance by Koh et al. (27) n regularzed logstc regresson. Here, we copute an approxate regularzaton path based on a war-start technque. Suppose, that for a gven value of C, we have coputed the optal (d, α ) par; the dea of a war-start s to use ths soluton for ntalzng another MKL proble wth a dfferent value of C. In our case, we teratvely copute the solutons for decreasng values of C (note that α has to be odfed to be a feasble ntalzaton of the ore constraned SVM proble). 5. Nuercal experents In ths experental secton, we essentally a at llustratng three ponts. The frst pont s to show that our gradent descent algorth s effcent. Ths s acheved by bnary classfcaton experents, where SpleMKL s copared to the SILP approach of Sonnenburg et al. (26). Then, we llustrate the usefulness of a ultple kernel learnng approach n the context of regresson. The exaples we use are based on wavelet-based regresson n whch the ultple kernel learnng fraework naturally fts. The fnal experent as at evaluatng the ultple kernel approach n a odel selecton proble for soe ultclass probles. 5.1 Coputaton te The a of ths frst set of experents s to assess the runnng tes of SpleMKL. 2 Frst, we copare wth SILP regardng the te requred for coputng a sngle soluton of MKL wth a gven C hyperparaeter. Then, we copute an approxate regularzaton path by varyng C values. We fnally provde hnts on the expected coplexty of SpleMKL, by easurng the growth of runnng te as the nuber of exaples or kernels ncreases Te needed for reachng a sngle soluton In ths frst benchark, we put SpleMKL and SILP sde by sde, for a fxed value of the hyperparaeter C (C = 1). Ths procedure, whch does not take nto account a proper odel selecton procedure, s not representatve of the typcal use of SVMs. It s however relevant for the purpose of coparng algorthc ssues. The evaluaton s ade on fve datasets fro the UCI repostory: Lver, Wpbc, Ionosphere, Pa, Sonar (Blake and Merz, 1998). The canddate kernels are: 2. All the experents have been run on a Pentu D-3 GHz wth 3 GB of RAM. 17

18 Rakotoaonjy et al. Gaussan kernels wth 1 dfferent bandwdths σ, on all varables and on each sngle varable; polynoal kernels of degree 1 to 3, agan on all and each sngle varable. All kernel atrces have been noralzed to unt trace, and are precoputed pror to runnng the algorths. Both SpleMKL and SILP wrap an SVM dual solver based on SpleSVM, an actve constrants ethod wrtten n Matlab (Canu et al., 23). The descent procedure of SpleMKL s also pleented n Matlab, whereas the lnear prograng nvolved n SILP s pleented n the publcly avalable toolbox LPSOLVE (Berkelaar et al., 24). For a far coparson, we use the sae stoppng crteron for both algorths. They halt when, ether the dualty gap s lower than.1, or the nuber of teratons exceeds 2. Quanttatvely, the dsplayed results dffer fro the prelnary verson of ths work, where the stoppng crteron was based on the stablzaton of the weghts, but they are qualtatvely slar (Rakotoaonjy et al., 27). For each dataset, the algorths were run 2 tes wth dfferent tran and test sets (7% of the exaples for tranng and 3% for testng). Tranng exaples were noralzed to zero ean and unt varance. In Table 1, we report dfferent perforance easures: accuracy, nuber of selected kernels and runnng te. As the latter s anly spent n queryng the SVM solver and n coputng the gradent of J wth respect to d, the nuber of calls to these two routnes s also reported. Both algorths are nearly dentcal n perforance accuracy. Ther nuber of selected kernels are of sae agntude, although SpleMKL tends to select 1 to 2% ore kernels. As both algorths address the sae convex optzaton proble, wth convergent ethods startng fro the sae ntalzaton, the observed dfferences are only due to the naccuracy of the soluton when the stoppng crteron s et. Hence, the trajectores chosen by each algorth for reachng the soluton, detaled n Secton 3.4, explan the dfferences n the nuber of selected kernels. The updates of d based on the descent algorth of SpleMKL are rather conservatve (sall steps departng fro 1/M for all d ), whereas the oscllatons of cuttng planes are lkely to favor extree solutons, httng the edges of the splex. Ths explanaton s corroborated by Fgure 2, whch copares the behavor of the d coeffcents through te. The nstablty of SILP s clearly vsble, wth very hgh oscllatons n the frst teratons and a notceable resdual nose n the long run. In coparson, the trajectores for SpleSVM are uch soother. If we now look at the overall dfference n coputaton te reported n Table 1, clearly, on all data sets, SpleSVM s faster than SILP, wth an average gan factor of about 5. Furtherore, the larger the nuber of kernels s, the larger the speed gan we acheve. Lookng at the last colun of Table 1, we see that the an reason for proveent s that SpleMKL converges n fewer teratons (that s, gradent coputatons). It ay see surprsng that ths gan s not conterbalanced by the fact that SpleMKL requres any ore calls to the SVM solver (on average, about 4 tes). As we stated n Secton 3.4, when the nuber of kernels s large, coputng the gradent ay be expensve copared to SVM retranng wth war-start technques. 18

19 SpleMKL Table 1: Average perforance easures for the two MKL algorths and a plan gradent descent algorth. Lver l = 241 M = 91 Algorth # Kernel Accuracy Te (s) # SVM eval # Gradent eval SILP 1.6 ± ± ± ± ± 2 SpleMKL 11.2 ± ± ± ± ± 26 Grad. Desc ± ± ± ± ± 27 Pa l = 538 M = 117 Algorth # Kernel Accuracy Te (s) # SVM eval # Gradent eval SILP 11.6 ± ± ± ± ± 13 SpleMKL 14.7 ± ± ± ± ± 4.8 Grad. Desc ± ± ± ± ± 8.7 Ionosphere l = 246 M = 442 Algorth # Kernel Accuracy Te (s) # SVM eval # Gradent eval SILP 21.6 ± ± ± ± ± 53 SpleMKL 23.6 ± ± ± ± ± 25 Grad. Desc ± ± ± ± ± 38 Wpbc l = 136 M = 442 Algorth # Kernel Accuracy Te (s) # SVM eval # Gradent eval SILP 13.7 ± ± ± ± ± 44 SpleMKL 15.8 ± ± ± ± ± 1 Grad. Desc ± ± ± ± ± 16 Sonar l = 146 M = 793 Algorth # Kernel Accuracy Te (s) # SVM eval # Gradent eval SILP 33.5 ± ± ± ± ± 187 SpleMKL 36.7 ± ± ± ± ± 66 Grad. Desc ± ± ± ± ± 99 19

20 Rakotoaonjy et al. d k SpleMKL d k SpleMKL d k SILP Iteratons Pa d k SILP Iteratons Ionosphere Fgure 2: Evoluton of the fve largest weghts d for SpleMKL and SILP; left row: Pa; rght row: Ionosphere. To understand why, wth ths large nuber of calls to the SVM solver, SpleMKL s stll uch faster than SILP, we have to look back at Fgure 2. On the one hand, the large varatons n subsequents d values for SILP, ental that subsequent SVM probles are not lkely to have slar solutons: a war-start call to the SVM solver does not help uch. On the other hand, wth the sooth trajectores of d n SpleMKL, the prevous SVM soluton s often a good guess for the current proble: a war-start call to the SVM solver results n uch less coputaton than a call fro scratch. Table 1 also shows the results obtaned when replacng the update schee descrbed n Algorth 1 by a usual reduced gradent update, whch, at each teraton, odfes d by coputng the optal step sze on the descent drecton D (14). The tranng of ths varant s consderably slower than SpleMKL and s only slghtly better than SILP. We see that the gradent descent updates requre any ore calls to the SVM solver and a nuber of gradent coputatons coparable wth SILP. Note that, copared to SILP, the nuerous addtonal calls to the SVM solver have not a drastc effect on runnng te. The gradent updates are stable, so that they can beneft fro war-start contrary to SILP. To end ths frst seres of experents, Fgure 3 depcts the evoluton of the objectve functon for the data sets that were used n Fgure 2. Besdes the fact that SILP needs ore teratons for achevng a good approxaton of the fnal soluton, t s worth notng that the objectve values rapdly reach ther steady state whle stll beng far fro convergence, when d values are far fro beng settled. Thus, ontorng objectve values s not sutable to assess convergence Te needed for gettng an approxate regularzaton path In practce, the optal value of C s unknown, and one has to solve several SVM probles, spannng a wde range of C values, before choosng a soluton accordng to soe odel selecton crteron lke the cross-valdaton error. Here, we further pursue the coparson 2

21 SpleMKL 3.5 x 14 SpleMKL SILP 15 SpleMKL SILP Objectve value Objectve value Iteratons Pa Iteratons Ionosphere Fgure 3: Evoluton of the objectve values for SpleSVM and SILP; left row: Pa; rght row: Ionosphere. of the runnng tes of SpleMKL and SILP, n a seres of experents that nclude the search for a sensble value of C. In ths new benchark, we use the sae data sets as n the prevous experents, wth the sae kernel settngs. The task s only changed n the respect that we now evaluate the runnng tes needed by both algorths to copute an approxate regularzaton path. For both algorths, we use a sple war-start technque, whch conssts n usng the optal solutons {d } and {α } obtaned for a gven C to ntalze a new MKL proble wth C + C (DeCoste and Wagstaff., 2). As descrbed n Secton 4.4, we start fro the largest C and then approxate the regularzaton path by decreasng ts value. The set of C values s obtaned by evenly saplng the nterval [.1, 1] on a logarthc scale. Fgure 4 shows the varatons of the nuber of selected kernels and the values of d along the regularzaton path for the Pa and Wpbc datasets. The nuber of kernels s not a onotone functon of C: for sall values of C, the nuber of kernels s soewhat constant, then, t rses rapdly. There s a sall overshooth before reachng a plateau correspondng to very hgh values of C. Ths trend s slar for the nuber of leadng ters n the kernel weght vector d. Both phenoenon were observed consstently over the datasets we used. Table 2 dsplays the average coputaton te (over 1 runs) requred for buldng the approxate regularzaton path. As prevously, SpleMKL s ore effcent than SILP, wth a gan factor ncreasng wth the nuber of kernels n the cobnaton. The range of gan factors, fro 5.9 to 23, s even ore pressve than n the prevous benchark. SpleMKL benefts fro the contnuty of solutons along the regularzaton path, whereas SILP does not take advantage of war starts. Even provded wth a good ntalzaton, t needs any cuttng planes to stablze. 21

22 Rakotoaonjy et al. nuber of selected kernels d k C C Pa nuber of selected kernels d k C C Wpbc Fgure 4: Regularzaton paths for d and the nuber of selected kernels versus C; left row: Pa; rght row: Wpbc. Table 2: Average coputaton te (n seconds) for gettng an approxate regularzaton path. For the Sonar data set, SILP was extreely slow, so that regularzaton path was coputed only once. Dataset SpleMKL SILP Rato Lver 148 ± ± Pa 13 ± ± Ionosphere 129 ± ± Wpbc 88 ± ± Sonar 625 ± (*)

23 SpleMKL More on SpleMKL runnng tes Here, we provde an eprcal assessent of the expected coplexty of SpleMKL on dfferent data sets fro the UCI repostory. We frst look at the stuaton where kernel atrces can be pre-coputed and stored n eory, before reportng experents where the eory are too hgh, leadng to repeated kernel evaluatons. In a frst set of experents, we use Gaussan kernels, coputed on rando subsets of varables and wth rando wdth. These kernels are precoputed and stored n eory, and we report the average CPU runnng tes obtaned fro 2 runs dfferng n the rando draw of tranng exaples. The stoppng crteron s the sae as n the prevous secton: a relatve dualty gap less than ε =.1. The frst two rows of Fgure 5 depcts the growth of coputaton te as the nuber of kernel ncreases. We observe a nearly lnear trend for the four learnng probles. Ths growth rate could be expected consderng the lnear convergence property of gradent technques, but the absence of overhead s valuable. The last row of Fgure 5 depcts the growth of coputaton te as the nuber of exaples ncreases. Here, the nuber of kernels s set to 1. In these plots, the observed trend s clearly superlnear. Agan, ths trend could be expected, consderng that SVM expected tranng tes are superlnear n the nuber of tranng exaples. As we already entoned, the coplexty of SpleMKL s tghtly lnked to the one of SVM tranng (for soe exaples of sngle kernel SVM runnng te, one can refer to the work of Loosl and Canu (27)). When all the kernels used for MKL cannot be stored n eory, one can resort to a decoposton ethod. Table 3 reports the average coputaton tes, over 1 runs, n ths ore dffcult stuaton. The large-scale SVM schee of Joachs (1999) has been pleented, wth bass kernels recoputed whenever needed. Ths approach s coputatonally expensve but goes wth no eory lt. For these experents, the stoppng crteron s based on the varaton of the weghts d. As shown n Fgure 2, the kernel weghts rapdly reach a steady state and any teratons are spent for fne tunng the weght and reach the dualty gap tolerance. Here, we trade the optalty guarantees provded by the dualty gap for substantal coputatonal te savngs. The algorth ternates when the kernel weghts varaton s lower than.1. Results reported n Table 3 just a at showng that edu and large-scale stuatons can be handled by SpleMKL. Note that Sonnenburg et al. (26) have run a odfed verson of ther SILP algorth on a larger scale datasets. However, for such experents, they have taken advantage of soe specfc feature ap propertes. And, as they stated, for general cases where kernel atrces are dense, they have to rely on the SILP algorth we used n ths secton for effcency coparson. 5.2 Multple kernel regresson exaples Several research papers have already claed that usng ultple kernel learnng can lead to better generalzaton perforances n soe classfcaton probles (Lanckret et al., 24a; Zen and Ong, 27; Harchaou and Bach, 27). Ths next experent as at llustratng ths pont but n the context of regresson. The proble we deal wth s a classcal unvarate regresson proble where the desgn ponts are rregular (D Aato et al., 23

24 Rakotoaonjy et al. cpu te n seconds cpu te n seconds Nuber of kernels Nuber of kernels Credt, l = 588 Yeast, l = cpu te n seconds cpu te n seconds Nuber of kernels Nuber of kernels Spadata, l = 138 Optdgts, l = cpu te n seconds cpu te n seconds Nuber of tranng exaples Nuber of tranng exaples Spadata, M = 1 Optdgts, M = 1 Fgure 5: SpleMKL average coputaton tes for dfferent datasets; top two rows: nuber of tranng exaples fxed, nuber of kernels varyng; botto row: nuber of tranng exaples varyng, nuber of kernels fxed. Table 3: Average coputaton te needed by SpleSVM usng decoposton ethods. Dataset Nb Exaples # Kernel Accuracy (%) Te (s) Yeast Spadata

Excess Error, Approximation Error, and Estimation Error

Excess Error, Approximation Error, and Estimation Error E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple