Fast Rates for Regularized Objectives

Size: px

Start display at page:

Download "Fast Rates for Regularized Objectives"

Horatio Randall
6 years ago
Views:

1 Fast Rates for Regularized Objectives Karthik Sridhara, Natha Srebro, Shai Shalev-Shwartz Toyota Techological Istitute Chicago Abstract We study covergece properties of empirical miimizatio of a stochastic strogly covex objective, where the stochastic compoet is liear. We show that the value attaied by the empirical miimizer coverges to the optimal value with rate /. The result applies, i particular, to the SVM objective. Thus, we obtai a rate of / o the covergece of the SVM objective (with fixed regularizatio parameter) to its ifiite data limit. We demostrate how this is essetial for obtaiig certai type of oracle iequalities for SVMs. The results exted also to approximate miimizatio as well as to strog covexity with respect to a arbitrary orm, ad so also to objectives regularized usig other l p orms. Itroductio We cosider the problem of (approximately) miimizig a stochastic objective F (w) = E θ [f(w; θ)] () where the optimizatio is with respect to w W, based o a i.i.d. sample θ,..., θ. We focus o problems where f(w; θ) has a geeralized liear form: f(w; θ) = l( w, φ(θ), θ) + r(w). (2) The relevat special case is regularized liear predictio, where θ = (x, y), l( w, φ(x), y) is the loss of predictig w, φ(x) whe the true target is y, ad r(w) is a regularizer. It is well kow that whe the domai W ad the mappig φ( ) are bouded, ad the fuctio l(z; θ) is Lipschitz cotiuous i z, the empirical averages ˆF (w) = Ê [f(w; θ)] = f(w; θ i ) (3) coverge uiformly to their expectatios F (w) with rate /. This justifies usig the empirical miimizer ŵ = arg mi ˆF (w), (4) w W ad we ca the establish covergece of F (ŵ) to the populatio optimum with a rate of /. i= F (w ) = mi F (w) (5) w W Recetly, Haza et al [] studied a olie aalogue to this problem, ad established that if f(w; θ) is strogly covex i w, the average olie regret dimiishes with a much faster rate, amely (log )/. The fuctio f(w; θ) becomes strogly covex whe, for example, we have r(w) = λ 2 w 2 as i SVMs ad other regularized learig settigs. I this paper we preset a aalogous fast rate for empirical miimizatio of a strogly covex stochastic objective. I fact, we do ot eed to assume that we perform the empirical miimizatio

2 exactly: we provide uiform (over all w W) guaratees o the populatio sub-optimality F (w) F (w ) i terms of the empirical sub-optimality ˆF (w) ˆF (ŵ) with a rate of /. This is a stroger type of result tha what ca be obtaied with a olie-to-batch coversio, as it applies to ay possible solutio w, ad ot oly to some specific algorithmically defied solutio. For example, it ca be used to aalyze the performace of approximate miimizers obtaied through approximate optimizatio techiques. Specifically, cosider f(w; θ) as i (2), where l(z; θ) is covex ad L- Lipschitz i z, the orm of φ(θ) is bouded by B, ad r is λ-strogly covex. We show that for ay a > 0 ad δ > 0, with probability at least δ, for all w (of arbitrary magitude): F (w) F (w ) ( + a)( ˆF (w) ˆF (ŵ)) + O ( ( + /a) L2 B 2 (log(/δ)) ). (6) We emphasize that here ad throughout the paper the big-o otatio hides oly fixed umeric costats. It might ot be surprisig that requirig strog covexity yields a rate of /. Ideed, the coectio betwee strog covexity, variace bouds, ad rates of /, is well kow. However, it is iterestig to ote the geerality of the result here, ad the simplicity of the coditios. I particular, we do ot require ay low oise coditios, or that the loss fuctio is strogly covex (it eed oly be weakly covex). I particular, (3) applies, uder o additioal coditios, to the SVM objective. We therefore obtai covergece with a rate of / for the SVM objective. This / rate o the SVM objective is always valid, ad does ot deped o ay low-oise coditios or o specific properties of the kerel fuctio. Such a fast rate might seem surprisig at a first glace to the reader familiar with the / rate o the expected loss of the SVM optimum. There is o cotradictio here what we establish is that although the loss might coverge at a rate of /, the SVM objective (regularized loss) always coverges at a rate of /. I fact, i Sectio 3 we see how a rate of / o the objective correspods to a rate of / o the loss. Specifically, we perform a oracle aalysis of the optimum of the SVM objective (rather tha of empirical miimizatio subject to a orm costrait, as i other oracle aalyses of regularized liear learig), based o the existece of some (ukow) low-orm, low-error predictor w. Strog covexity is a cocept that depeds o a choice of orm. We state our results i a geeral form, for ay choice of orm. Strog covexity of r(w) must hold with respect to the chose orm, ad the data φ(θ) must be bouded with respect to the dual orm, i.e. we must have φ(θ) B. This allows us to apply our results also to more geeral forms of regularizers, icludig squared l p orm regularizers, r(w) = λ 2 w 2 p, for p < 2 (see Corollary 2). However, the reader may choose to read the paper always thikig of the orm w, ad so also its dual orm w, as the stadard l 2 -orm. 2 Mai Result We cosider a geeralized liear fuctio f : W Θ R, that ca be writte as i (2), defied over a closed covex subset W of a Baach space equipped with orm. Lipschitz cotiuity ad boudedess We require that the mappig φ( ) is bouded by B, i.e. φ(θ) B, ad that the fuctio l(z; θ) is L-Lipschitz i z R for every θ. Strog Covexity We require that F (w) is λ-strogly covex w.r.t. the orm w. That is, for all w, w 2 W ad α [0, ] we have: F (αw + ( α)w 2 ) αf (w ) + ( α)f (w 2 ) λ 2 α( α) w w 2 2. Recallig that w = arg mi w F (w), this esures (see for example [2, Lemma 3]): F (w) F (w ) + λ 2 w w 2 (7) We require oly that the expectatio F (w) = E [f(w; θ)] is strogly covex. Of course, requirig that f(w; θ) is λ-strogly covex for all θ (with respect to w) is eough to esure the coditio. 2

3 I particular, for a geeralized liear fuctio of the form (2) it is eough to require that l(z; y) is covex i z ad that r(w) is λ-strogly covex (w.r.t. the orm w ). We ow provide a faster covergece rate usig the above coditios. Theorem. Let W be a closed covex subset of a Baach space with orm ad dual orm ad cosider f(w; θ) = l( w, φ(θ) ; θ) + r(w) satisfyig the Lipschitz cotiuity, boudedess, ad strog covexity requiremets with parameters B, L, ad λ. Let w, ŵ, F (w) ad ˆF (w) be as defied i (??)-(??). The, for ay δ > 0 ad ay a > 0, with probability at least δ over a sample of size, we have that for all w W: (where [x] + = max(x, 0)) F (w) F (w ) ( + a)[ ˆF (w) ˆF (w )] ( + a )L2 B 2 (32 + log(/δ)) ( + a)( ˆF (w) ˆF (ŵ)) + 8 ( + a )L2 B 2 (32 + log(/δ)). It is particularly iterestig to cosider regularizers of the form r(w) = λ 2 w 2 p, which are (p )λstrogly covex w.r.t. the correspodig l p -orm [2]. Applyig Theorem to this case yields the followig boud: Corollary 2. Cosider a l p orm ad its dual l q, with < p 2, q + p =, ad the objective f(w; θ) = l( w, φ(θ) ; θ) + λ 2 w 2 p, where φ(θ) q B ad l(z; y) is covex ad L-Lipschitz i z. The domai is the etire Baach space W = l p. The, for ay δ > 0 ad ay a > 0, with probability at least δ over a sample of size, we have that for all w W = l p (of ay magitude): ( F (w) F (w ) ( + a)( ˆF (w) ˆF ( + (ŵ)) + O a )L2 B 2 log(/δ) (p ) Corollary 2 allows us to aalyze the rate of covergece of the regularized risk for l p -regularized liear learig. That is, traiig by miimizig the empirical average of: ). f(w; x, y) = l( w, x, y) + λ 2 w 2 p (8) where l(z, y) is some covex loss fuctio ad x q B. For example, i SVMs we use the l 2 orm, ad so boud x 2 B, ad the hige loss l(z, y) = [ yz] +, which is -Lipschitz. What we obtai is a boud o how quickly we ca miimize the expectatio F (w) = E [l( w, x, y)] + λ 2 w 2 p, i.e. the regularized empirical loss, or i other words, how quickly do we coverge to the ifiite-data optimum of the objective. We see, the, that the SVM objective coverges to its optimum value at a fast rate of /, without ay special assumptios. This still does t mea that the expected loss L(ŵ) = E [l( ŵ, x, y)] coverges at this rate. This behavior is empirically demostrated o the left plot of Figure. For each data set size we plot the excess expected loss L(ŵ) L(w ) ad the sub-optimality of the regularized expected loss F (ŵ) F (w ) (recall that F (ŵ) = L(ŵ) + λ 2 ŵ 2 ). Although the regularized expected loss coverges to its ifiite data limit, i.e. to the populatio miimizer, with rate roughly /, the expected loss L(ŵ) coverges at a slower rate of roughly /. Studyig the covergece rate of the SVM objective allows us to better uderstad ad appreciate aalysis of computatioal optimizatio approaches for this objective, as well as obtai oracle iequalities o the geeralizatio loss of ŵ, as we do i the followig Sectio. Before movig o, we briefly provide a example of applyig Theorem with respect to the l - orm. The boud i Corollary 2 diverges whe p ad the Corollary is ot applicable for l regularizatio. This is because w 2 is ot strogly covex w.r.t. the l -orm. A example of a regularizer that is strogly covex with respect to the l orm is the (uormalized) etropy regularizer [3]: r(w) = d i= w i log( w i ). This regularizer is /Bw-strogly 2 covex w.r.t. w, as log as w B w (see [2]), yieldig: Corollary 3. Cosider a fuctio f(w; θ) = l( w, φ(θ) ; θ) + d i= w i log( w i ), where φ(θ) B ad l(z; y) is covex ad L-Lipschitz i z. Take the domai to be the l ball 3

4 Suboptimality of Objective Excess Expected Loss Figure : Left: Excess expected loss L(ŵ) L(w ) ad sub-optimality of the regularized expected loss F (ŵ) F (w ) as a fuctio of traiig set size, for a fixed λ = 0.8. Right: Excess expected loss L(ŵ λ ) mi w L(w o), relative to the overall optimal w o = arg mi w L(w), with λ = p 300/. Both plots are o a logarithmic scale ad refer to a sythetic example with x uiform over [.5,.5] 300, ad y = sig x whe x > but uiform otherwise. W = {w R d : w B w }. The, for ay δ > 0 ad ay a > 0, with probability at least δ over a sample of size, we have that for all w W: ( F (w) F (w ) ( + a)( ˆF (w) ˆF ( + a (ŵ)) + O )L2 B 2 Bw 2 ) log(/δ). 3 Oracle Iequalities for SVMs I this Sectio we apply the results from previous Sectio to obtai a oracle iequality o the expected loss L(w) = E [l( w, x, y)] of a approximate miimizer of the SVM traiig objective ˆF λ (w) = Ê [f λ(w)] where f λ (w; x, y) = l( w, x, y) + λ 2 w 2, (9) ad l(z, y) is the hige-loss, or ay other -Lipschitz loss fuctio. As before we deote B = sup x x (all orms i this Sectio are l 2 orms). We assume, as a oracle assumptio, that there exists a good predictor w o with low orm w o ad which attais low expected loss L(w o ). Cosider a optimizatio algorithm for ˆF λ (w) that is guarateed to fid w such that ˆF λ ( w) mi ˆF λ (w) + ɛ opt. Usig the results of Sectio 2, we ca traslate this approximate optimality of the empirical objective to a approximate optimality of the expected objective F λ (w) = E [f λ (w)]. Specifically, applyig Corollary 2 with a = we have that with probability at least δ: ( B F λ ( w) F λ (w 2 ) log(/δ) ) 2ɛ opt + O. (0) Optimizig to withi ɛ opt = O( B2 ) is the eough to esure ( B F λ ( w) F λ (w 2 ) log(/δ) ) = O. () I order to traslate this to a boud o the expected loss L( w) we cosider the followig decompositio: L( w) = L(w o ) + (F λ ( w) F λ (w )) + (F λ (w ) F λ (w o )) + λ 2 w o 2 λ 2 ( w 2 B 2 ) log(/δ) L(w o ) + O λ 2 w o 2 (2) 4

5 where we used the boud (8) to boud the secod term, the optimality of w to esure the third term is o-positive, ad we also dropped the last, o-positive, term. This might seem like a rate of / o the geeralizatio error, but we eed to choose λ so as to balace the secod ad third terms. The optimal choice for λ is λ() = c B log(/δ) w o, (3) for some costat c. We ca ow formally state our oracle iequality, which is obtaied by substitutig (0) ito (9): Corollary 4. Cosider a SVM-type objective as i (6). For ay w o ad ay δ > 0, with probability at least δ over a sample of size, we have that for all w s.t. ˆFλ() ( w) mi ˆF λ() (w)+o( B2 ), where λ() chose as i (0), the followig holds: L( w) L(w o ) + O B 2 w o 2 log(/δ) Corollary 4 is demostrated empirically o the right plot of Figure. The way we set λ() i Corollary 4 depeds o w o. However, usig we obtai: λ() = B log(/δ) (4) Corollary 5. Cosider a SVM-type objective as i (6) with λ() set as i (). For ay δ > 0, with probability at least δ over a sample of size, we have that for all w s.t. ˆFλ() ( w) mi ˆF λ() (w) + O( B2 ), the followig holds: L( w) if w o L(w o ) + O B 2 ( w o 4 + ) log(/δ) The price we pay here is that the boud of Corollary 5 is larger by a factor of w o relative to the boud of Corollary 4. Nevertheless, this boud allows us to coverge with a rate of / to the expected loss of ay fixed predictor. It is iterestig to repeat the aalysis of this Sectio usig the more stadard result: ( ) F λ (w) F λ (w ) ˆF λ (w) ˆF B λ (w 2 ) + O w B 2 for w B w where we igore the depedece o δ. Settig B w = 2/λ, as this is a boud o the orm of both the empirical ad populatio optimums, ad usig (2) istead of Corollary 2 i our aalysis yields the oracle iequality: ( ) /3 L( w) L(w o ) + O B 2 w o 2 log(/δ) (6) (5) The oracle aalysis studied here is very simple our oracle assumptio ivolves oly a sigle predictor w o, ad we make o assumptios about the kerel or the oise. We ote that a more sophisticated aalysis has bee carried out by Steiwart et al [4], who showed that rates faster tha / are possible uder certai coditios o oise ad complexity of kerel class. I Steiwart s et al aalyses the estimatio rates (i.e. rates for expected regularized risk) are give i terms of the approximatio error quatity λ 2 w 2 + L(w ) L where L is the Bayes risk. I our result we cosider the estimatio rate for regularized objective idepedet of the approximatio error. 5

6 4 Proof of Mai Result To prove Theorem we use techiques of reweighig ad peelig followig Bartlett et al [5]. For each w, we defie g w (θ) = f(w; θ) f(w ; θ), ad so our goal is to boud the expectatio of g w i terms of its empirical average. We deote by G = {g w w W }. Sice our desired boud is ot exactly uiform, ad we would like to pay differet attetio to fuctios depedig o their expected sub-optimality, we will istead cosider the followig reweighted class. For ay r > 0 defie G r = { g r w = } gw : w W, k(w) = mi{k Z 4 k(w) + : E [g w ] r4 k } where Z + is the set of o-egative itegers. I other words, g r w G r is just a scaled versio of g w G ad the scalig factor esures that E [g r w] r. We will begi by boudig the variatio betwee expected ad empirical average values of g r G r. This is typically doe i terms of the complexity of the class G r. However, we will istead use the complexity of a slightly differet class of fuctios, which igores the o-radom (i.e. o-datadepedet) regularizatio terms r(w). Defie: H r = { h r w = hw : w W, k(w) = mi{k Z 4 k(w) + : E [g w ] r4 k } where h w (θ) = g w (θ) (r(w) r(w )) = l( w, φ(θ) ; θ) l( w, φ(θ) ; θ). (9) That is, h r w(θ) is the data depedet compoet of gw, r droppig the (scaled) regularizatio terms. With this defiitio we have E [gw] r Ê [gr w] = E [h r w] Ê [hr w] (the regularizatio terms o the left had side cacel out), ad so it is eough to boud the deviatio of the empirical meas i H r. This ca be doe i terms of the Rademacher Complexity of the class, R(H r ) [6, Theorem 5]: For ay δ > 0, with probability at least δ, ( ) sup E [h r ] Ê [hr ] 2R(H r ) + sup h r log /δ (θ) 2. (20) h r H r h r H r,θ We will ow proceed to boudig the two terms o the right had side: Lemma 6. sup h r H r,θ h r (θ) LB 2r / λ Proof. From the defiitio of h r w give i (5) (6), the Lipschitz cotiuity of l( ; θ), ad the boud φ(θ) B, we have for all w, θ: } (7) (8) h r w(θ) hw(θ) 4 k(w) LB w w /4 k(w) (2) We ow use the strog covexity of F (w), ad i particular eq. (4), as well as the defiitios of g w ad k(w), ad fially ote that 4 k(w), to get: w w 2 λ (F (w) F (w 2 )) = λ E [g 2 w] λ 4k(w) 2 r λ 6k(w) r (22) Substitutig (9) i (8) yields the desired boud. Lemma 7. R(H r ) 2L B 2r Proof. We will use the followig geeric boud o the Rademacher complexity of liear fuctioals [7, Theorem ]: for ay t(w) which is λ-strogly covex (w.r.t a orm with dual orm ), 2a R({φ w, φ t(w) a}) (sup φ ). (23) For each a > 0, defie H(a) = {h w : w W, E [g w ] a}. First ote that E [g w ] = F (w) F (w ) is λ-strogly covex. Usig (20) ad the Lipschitz compositio property we therefore have 2a R(H(a)) LB. Now: R(H r ) = R ( j=04 j H(r4 j ) ) 4 j R(H(4r j 2r )) LB 4 j/2 2r = 2LB j=0 j=0 6

7 We ow proceed to boudig E [g w ] = F (w) F (w ) ad thus provig Theorem. For ay r > 0, with probability at least δ we have: E [g w ] Ê [g w] = 4 k(w) (E [gw] r Ê [gr w]) = 4 k(w) (E [h r w] Ê [hr w]) 4 k(w) rd (24) where D = LB ( log(/δ) log(/δ)) 2LB is obtaied by substitutig Lemmas 6 ad 7 ito (7). We ow cosider two possible cases: k(w) = 0 ad k(w) > 0. The case k(w) = 0 correspods to fuctios with a expected value close to optimal: E [g w ] r, i.e. F (w) F (w ) + r. I this case (2) becomes: E [g w ] Ê [g w] + rd (25) We ow tur to fuctios for which k(w) > 0, i.e. with expected values further away from optimal. I this case, the defiitio of k(w) esures 4 k(w) r < E [g w ] ad substitutig this ito (2) we have E [g w ] Ê [g w] 4 r E [g w] rd. Rearragig terms yields: E [g w ] 4D/ r Ê [g w] (26) Combiig the two cases (22) ad (23) (ad requirig r (4D) 2 so that 4D/ ), we always r have: E [g w ] [Ê 4D/ [gw ]] + rd (27) r + Settig r = ( + a )2 (4D) 2 yields the boud i Theorem. 5 Compariso with Previous Fast Rate Guaratees Rates faster tha / for estimatio have bee previously explored uder various coditios, where strog covexity has played a sigificat role. Lee et al [8] showed faster rates for squared loss, exploitig the strog covexity of this loss fuctio, but oly uder fiite pseudodimesioality assumptio, which do ot hold i SVM-like settigs. Bousquet [9] provided similar guaratees whe the spectrum of the kerel matrix (covariace of the data) is expoetially decayig. Tsybakov [0] itroduced a margi coditio uder which rates faster tha / are show possible. It is also possible to esure rates of / by relyig o low oise coditios [9, ], but here we make o such assumptio. Most methods for derivig fast rates first boud the variace of the fuctios i the class by some mootoe fuctio of their expectatios. The, usig methods as i Bartlett et al [5], oe ca get bouds that have a localized complexity term ad additioal terms of order faster tha /. However, it is importat to ote that the localized complexity term typically domiates the rate ad still eeds to be cotrolled. For example, Bartlett et al [2] show that strict covexity of the loss fuctio implies a variace boud, ad provide a geeral result that ca eable obtaiig faster rates as log as the complexity term is low. For istace, for classes with fiite VC dimesio V, the resultig rate is (V +2)/(2V +2), which ideed is better tha / but is ot quite /. Thus we see that eve for a strictly covex loss fuctio, such as the squared loss, additioal coditios are ecessary i order to obtai fast rates. I this work we show that strog covexity ot oly implies a variace boud but i fact ca be used to boud the localized complexity. A importat distictio is that we require strog covexity of the fuctio F (w) with respect to the orm w. This is rather differet tha requirig the loss fuctio z l(z, y) be strogly covex o the reals. I particular, the loss of a liear predictor, w l( w, x, y) ca ever be strogly covex i a multi-dimesioal space, eve if l is strogly covex, sice it is flat i directios orthogoal to x. As metioed, f(w; x, y) = l( w, x, y) ca ever be strogly covex i a high-dimesioal space. However, we actually oly require the strog covexity of the expected loss F (w). If the loss fuctio l(z, y) is λ-strogly covex i z, ad the eigevalues of the covariace of x are bouded away from zero, strog covexity of F (w) ca be esured. I particular, F (w) would be cλstrogly-covex, where c is the miimal eigevalue of the COV[x]. This eables us to use Theorem 7

8 to obtai rates of / o the expected loss itself. However, we caot expect the eigevalues to be bouded away from zero i very high dimesioal spaces, limitig the applicability of the result of low-dimesioal spaces were, as discussed above, other results also apply. A iterestig observatio about our proof techique is that the oly cocetratio iequality we ivoked was McDiarmid s Iequality (i [6, Theorem 5] to obtai (7) a boud o the deviatios i terms of the Rademacher complexity). This was possible because we could make a localizatio argumet for the l orm of the fuctios i our fuctio class i terms of their expectatio. 6 Summary We believe this is the first demostratio that, without ay additioal requiremets, the SVM objective coverges to its ifiite data limit with a rate of O(/). This improves the previous results that cosidered the SVM objective oly uder special additioal coditios. The results exteds also to other regularized objectives. Although the quatity that is ultimately of iterest to us is the expected loss, ad ot the regularized expected loss, it is still importat to uderstad the statistical behavior of the regularized expected loss. This is the quatity that we actually optimize, track, ad ofte provide bouds o (e.g. i approximate or stochastic optimizatio approaches). A better uderstadig of its behavior ca allow us to both theoretically explore the behavior of regularized learig methods, to better uderstad empirical behavior observed i practice, ad to appreciate guaratees of stochastic optimizatio approaches for such regularized objectives. As we saw i Sectio 3, derivig such fast rates is also essetial for obtaiig simple ad geeral oracle iequalities, that also helps us guide our choice of regularizatio parameters. Refereces [] E. Haza, A. Kalai, S. Kale, ad A. Agarwal. Logarithmic regret algorithms for olie covex optimizatio. I Proceedigs of the Nieteeth Aual Coferece o Computatioal Learig Theory, [2] S. Shalev-Shwartz. Olie Learig: Theory, Algorithms, ad Applicatios. PhD thesis, The Hebrew Uiversity, [3] T. Zhag. Coverig umber bouds of certai regularized liear fuctio classes. J. Mach. Lear. Res., 2: , [4] I. Steiwart, D. Hush, ad C. Scovel. A ew cocetratio result for regularized risk miimizers. Highdimesioal Probability IV, i IMS Lecture Notes, 5: , [5] P. L. Bartlett, O. Bousquet, ad S. Medelso. Localized rademacher complexities. I COLT 02: Proceedigs of the 5th Aual Coferece o Computatioal Learig Theory, pages 44 58, Lodo, UK, Spriger-Verlag. [6] O. Bousquet, S. Bouchero, ad G. Lugosi. Itroductio to statistical learig theory. I O. Bousquet, U.v. Luxburg, ad G. Rätsch, editors, Advaced Lectures i Machie Learig, pages Spriger, [7] S. M. Kakade, K. Sridhara, ad A. Tewari. O the complexity of liear predictio: Risk bouds, margi bouds, ad regularizatio. I NIPS, [8] W. S. Lee, P. L. Bartlett, ad R. C. Williamso. The importace of covexity i learig with squared loss. I Computatioal Learig Theory, pages 40 46, 996. [9] O. Bousquet. Cocetratio Iequalities ad Empirical Processes Theory Applied to the Aalysis of Learig Algorithms. PhD thesis, Ecole Polytechique, [0] A. Tsybakov. Optimal aggregatio of classifiers i statistical learig. Aals of Statistics, 32:35 66, [] I. Steiwart ad C. Scovel. Fast rates for support vector machies usig gaussia kerels. ANNALS OF STATISTICS, 35:575, [2] P. L. Bartlett, M. I. Jorda, ad J. D. McAuliffe. Covexity, classificatio, ad risk bouds. Joural of the America Statistical Associatio, 0:38 56, March

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the