Local Rademacher Complexities

Size: px

Start display at page:

Download "Local Rademacher Complexities"

Megan Turner
6 years ago
Views:

1 Local Rademacher Complexities Peter L. Bartlett Departmet of Statistics ad Divisio of Computer Sciece Uiversity of Califoria at Berkeley 367 Evas Hall Berkeley, CA Olivier Bousquet Empirical Iferece Departmet Max Plack Istitute for Biological Cyberetics Spemastr. 38, D Tübige, Germay May 14, 2004 Shahar Medelso Istitute of Advaced Studies The Australia Natioal Uiversity Caberra, ACT 0200, Australia Abstract We propose ew bouds o the error of learig algorithms i terms of a data-depedet otio of complexity. The estimates we establish give optimal rates ad are based o a local ad empirical versio of Rademacher averages, i the sese that the Rademacher averages are computed from the data, o a subset of fuctios with small empirical error. We preset some applicatios to classificatio ad predictio with covex fuctio classes, ad with kerel classes i particular. Keywords: Error Bouds, Rademacher Averages, Data-Depedet Complexity, Cocetratio Iequalities. 1 Itroductio Estimatig the performace of statistical procedures is useful for providig a better uderstadig of the factors that ifluece their behavior, as well as for suggestig ways to improve them. Although asymptotic aalysis is a crucial first step towards uderstadig the behavior, fiite sample error bouds are of more value as they allow the desig of model selectio or parameter tuig) procedures. These error bouds typically have the followig form: with high probability, the error of the estimator typically a fuctio i a certai class) is bouded by a empirical estimate of error plus a pealty term depedig o the complexity of the class of fuctios that ca be chose by the algorithm. The differeces betwee the true ad empirical errors of fuctios i that class ca the be viewed as a empirical process. May tools have bee developed for uderstadig the behavior of such objects, ad especially for evaluatig their suprema which ca be thought of as a measure of how hard it is to estimate fuctios i the class at had. The goal is thus to obtai the sharpest possible estimates o the complexity of fuctio classes. A problem arises 1

2 sice the otio of complexity might deped o the ukow) uderlyig probability measure accordig to which the data is produced. Distributio-free otios of the complexity, such as the Vapik-Chervoekis dimesio [35] or the metric etropy [28], typically give coservative estimates. Distributio-depedet estimates, based for example o etropy umbers i the L 2 P ) distace where P is the uderlyig distributio, are ot useful whe P is ukow. Thus, it is desirable to obtai data-depedet estimates which ca readily be computed from the sample. Oe of the most iterestig data-depedet complexity estimates is the so-called Rademacher average associated with the class. Although kow for a log time to be related to the expected supremum of the empirical process thaks to symmetrizatio iequalities), it was first proposed as a effective complexity measure by Koltchiskii [15], Bartlett, Bouchero ad Lugosi [1] ad Medelso [25] ad the further studied i [3]. Ufortuately, oe of the shortcomigs of the Rademacher averages is that they provide global estimates o the complexity of the fuctio class, that is, they do ot reflect the fact that the algorithm will likely pick fuctios that have a small error, ad i particular, oly a small subset of the fuctio class will be used. As a result, the best error rate that ca be obtaied via the global Rademacher averages is at least of the order of 1/ where is the sample size), which is suboptimal i some situatios. Ideed, the type of algorithms we cosider here are kow i the statistical literature as M-estimators. They miimize a empirical loss criterio i a fixed class of fuctios. They have bee extesively studied ad their rate of covergece is kow to be related to the modulus of cotiuity of the empirical process associated with the class of fuctios rather tha to the expected supremum of that empirical process). This modulus of cotiuity is well uderstood from the empirical processes theory viewpoit see e.g. [34] ad [33]). Also, from the poit of view of M-estimators, the quatity which determies the rate of covergece is actually a fixed poit of this modulus of cotiuity. Results of this type have bee obtaied by va de Geer [31, 32] amog others), who also provides o-asymptotic expoetial iequalities. Ufortuately, these are i terms of etropy or radom etropy) ad hece are ot useful whe the probability distributio is ukow. The key property that allows oe to prove fast rates of covergece is the fact that aroud the best fuctio i the class, the variace of the icremets of the empirical process or the L 2 P ) distace to the best fuctio) is upper bouded by a liear fuctio of the expectatio of these icremets. I the cotext of regressio with squared loss, this happes as soo as the fuctios are bouded ad the class of fuctios is covex. I the cotext of classificatio, Mamme ad Tsybakov have show [20] that this also happes uder coditios o the coditioal distributio especially about its behavior aroud 1/2). They actually do ot require the relatioship betwee variace ad expectatio of the icremets) to be liear but allow for more geeral, power type iequalities. Their results, like those of va de Geer, are asymptotic. I order to exploit this key property ad have fiite sample bouds, rather tha cosiderig the Rademacher averages of the etire class as the complexity measure, it is possible to cosider the Rademacher averages of a small subset of the class, usually, the itersectio of the class with a ball cetered at a fuctio of iterest. These local Rademacher averages ca serve as a complexity measure; clearly, they are always smaller tha the correspodig global averages. Several authors have cosidered the use of local estimates of the complexity of the fuctio class, i order to obtai better bouds. Before presetig their results, we itroduce some otatio which is used throughout the paper. Let X, P ) be a probability space. Deote by F a class of measurable fuctios from X to R, ad set X 1,..., X to be idepedet radom variables distributed accordig to P. Let σ 1,..., σ 2

3 be idepedet Rademacher radom variables, that is, idepedet radom variables for which Prσ i = 1) = Prσ i = 1) = 1/2. For a fuctio f : X R, defie the followig P f = 1 fx i ), P f = EfX) ad R f = 1 σ i fx i ). For a class F, set R F = sup R f. Defie E σ to be the expectatio with respect to the radom variables σ 1,..., σ, coditioed o all of the other radom variables. The Rademacher average of F is ER F, ad the empirical or coditioal) Rademacher averages of F are ) E σ R F = 1 E sup σ i fx i ) X 1,..., X Some classical properties of Rademacher averages ad some simple lemmas which we use ofte) are listed i Appedix A. The simplest way to obtai the property allowig for fast rates of covergece is to cosider oegative uiformly bouded fuctios or icremets with respect to a fixed ull fuctio). I that case, oe trivially has for all f F Var [f] cp f. This is exploited by Koltchiskii ad Pacheko [16], who cosider the case of predictio with absolute loss whe fuctios i F have values i [0, 1] ad there is a perfect fuctio f i the class, i.e. P f = 0. They itroduce a iterative method ivolvig local empirical Rademacher averages. They first costruct a fuctio φ r) = c 1 R f : P f 2r} + c 2 rx/ + c3 /, which ca be computed from the data. For ˆr N defied by ˆr 0 = 1, ˆr k+1 = φ ˆr k ), they show that with probability at least 1 2Ne x, P ˆf ˆr N + 2x,. where ˆf is a miimizer of the empirical error, that is, a fuctio i F satisfyig P ˆf = if P f. Hece, this oicreasig sequece of local Rademacher averages ca be used as upper bouds o the error of the empirical miimizer ˆf. Furthermore, if ψ is a cocave fuctio such that ψ r) Eσ R f F : P f r}, ad if the umber of iteratios N is at least 1 + log 2 log 2 /x, the with probability at least 1 Ne x, ˆr N c ˆr + x ), where r is a solutio of the fixed-poit equatio ψ r) = r. Combiig the above results, oe has a procedure to obtai data-depedet error bouds that are of the order of the fixed-poit of the modulus of cotiuity at 0 of the empirical Rademacher averages. Oe limitatio of this result is that it assumes that there is a fuctio f i the class with P f = 0. I cotrast, we are iterested i predictio problems where P f is the error of a estimator, ad i the presece of oise there may ot be ay perfect estimator eve the best i the class ca have o-zero error). 3

4 More recetly, Bousquet, Koltchiskii ad Pacheko [9] obtaied a more geeral result avoidig the iterative procedure. Their result is that for fuctios with values i [0, 1], with probability at least 1 e x, f F, P f c P f + ˆr + where ˆr is the fixed poit of a cocave fuctio ψ satisfyig ψ 0) = 0 ad ψ r) E σ R f F : P f r}. ) t + log log. 1.1) The mai differece betwee this ad the results of [16] is that there is o requiremet that the class cotais a perfect fuctio. However, the local Rademacher averages are cetered aroud the zero fuctio istead of the oe that miimizes P f. As a cosequece, the fixed poit ˆr caot be expected to coverge to zero whe if P f > 0. I order to remove this limitatio, Lugosi ad Wegkamp [19] use localized Rademacher averages of a small ball aroud the miimizer ˆf of P. However, their result is restricted to oegative fuctios, ad i particular fuctios with values i 0, 1}. Moreover, their bouds also ivolve some global iformatio, i the form of the shatter coefficiets S F X1 ) of the fuctio class that is, the cardiality of the coordiate projectios of the class F o the data X1 ). They show that there are costats c 1, c 2 such that, with probability at least 1 8/, the empirical miimizer ˆf satisfies P ˆf if P f + 2 ψ ˆr ), where ψ r) = c 1 E σ R f F : P f 16P ˆf + 15r} + log + ) log P ˆf + r ad ˆr = c 2 log S F X 1 ) + log )/. The limitatio of this result is that ˆr has to be chose accordig to the empirically measured) complexity of the whole class, which may ot be as sharp as the Rademacher averages, ad i geeral, is ot a fixed poit of ψ. Moreover, the balls over which the Rademacher averages are computed i ψ cotai a factor of 16 i frot of P ˆf. As we explai later, this iduces a lower boud o ψ whe there is o fuctio with P f = 0 i the class. It seems that the oly way to capture the right behavior i the geeral, oisy case, is to aalyze the icremets of the empirical process, i other words, to directly cosider the fuctios f f. This approach was first proposed by Massart [22]; see also [26]. Massart itroduces the followig assumptio Var [l f X) l f X)] d 2 f, f ) BP l f P l f ), where l f is the loss associated with the fuctio f i other words, l f X, Y ) = lfx), Y ), which measures the discrepacy i the predictio made by f), 1 d is a psuedometric, ad f miimizes the expected loss. This is a more refied versio of the assumptio we metioed earlier o the relatioship betwee the variace ad expectatio of the icremets of the empirical process. It is oly satisfied for some loss fuctios l ad fuctio classes F. Uder this assumptio, Massart cosiders a odecreasig fuctio ψ satisfyig ψr) E sup, d 2 f,f ) 2 r P f P f P f + P f + c x, 1 The previous results could also be stated i terms of loss fuctios, but we omitted this i order to simplify expositio. However, the extra otatio is ecessary to properly state Massart s result. 4

5 such that ψr)/ r is oicreasig we refer to this property as the sub-root property later i the paper). The, with probability at least 1 e x, f F, P l f P l f c r + x ), 1.2) where r is the fixed poit of ψ ad c depeds oly o B ad o the uiform boud o the rage of fuctios i F. It ca be proved that i may situatios of iterest, this boud suffices to prove miimax rates of covergece for pealized M-estimators. Massart cosiders examples where the complexity term ca be bouded usig a priori global iformatio about the fuctio class.) However, the mai limitatio of this result is that it does ot ivolve quatities that ca be computed from the data. Fially, as we metioed earlier, Medelso [26] gives a aalysis similar to that of Massart, i a slightly less geeral case with o oise i the target values, i.e. the coditioal distributio of Y give X is cocetrated at oe poit). Medelso itroduces the otio of the star-hull of a class of fuctios see the ext sectio for a defiitio) ad cosiders Rademacher averages of this star-hull as a localized measure of complexity. His results also ivolve a priori kowledge of the class, such as the rate of growth of coverig umbers. We ca ow spell out our goal i more detail: i this paper we combie the icremet-based approach of Massart ad Medelso dealig with differeces of fuctios, or more geerally with bouded real-valued fuctios) with the empirical local Rademacher approach of Koltchiskii ad Pacheko ad of Lugosi ad Wegkamp, i order to obtai data-depedet bouds which deped o a fixed poit of the modulus of cotiuity of Rademacher averages computed aroud the empirically best fuctio. Our first mai result Theorem 3.3) is a distributio-depedet result ivolvig the fixed poit r of a local Rademacher average of the star-hull of the class F. This shows that fuctios with the sub-root property ca readily be obtaied from Rademacher averages, while i previous work the appropriate fuctios were obtaied oly via global iformatio about the class. The secod mai result Theorems 4.1 ad 4.2) is a empirical couterpart of the first oe, where the complexity is the fixed poit of a empirical local Rademacher average. We also show that this fixed poit is withi a costat factor of the o-empirical oe. Equipped with this result, we ca the prove Theorem 5.4) a fully data-depedet aalogue of Massart s result, where the Rademacher averages are localized aroud the miimizer of the empirical loss. We also show Theorem 6.3) that i the cotext of classificatio, the local Rademacher averages of star-hulls ca be approximated by solvig a weighted empirical error miimizatio problem. Our fial result Corollary 6.7) cocers regressio with kerel classes, that is, classes of fuctios that are geerated by a positive defiite kerel. These classes are widely used i iterpolatio ad estimatio problems as they yield computatioally efficiet algorithms. Our result gives a data-depedet complexity term that ca be computed directly from the eigevalues of the Gram matrix the matrix whose etries are values of the kerel o the data). The sharpess of our results is demostrated from the fact that we recover, i the distributio depedet case treated i Sectio 4), similar results to those of Massart [22] which, i the situatios where they apply, give the miimax optimal rates or the best kow results. Moreover, the datadepedet bouds that we obtai as couterparts of these results have the same rate of covergece see Theorem 4.2). 5

6 The paper is orgaized as follows. I Sectio 2, we preset some prelimiary results obtaied from cocetratio iequalities, which we use throughout. Sectio 3 establishes error bouds usig local Rademacher averages ad explais how to compute their fixed poits from global iformatio e.g. estimates of the metric etropy or of the combiatorial dimesios of the idexig class), i which case the optimal estimates ca be recovered. I Sectio 4, we give a data-depedet error boud usig empirical ad local Rademacher averages, ad show the coectio betwee the fixed poits of the empirical ad o-empirical Rademacher averages. I Sectio 5, we apply our results to loss classes. We give estimates that geeralize the results of Koltchiskii ad Pacheko by elimiatig the requiremet that some fuctio i the class has zero loss, ad are more geeral tha those of Lugosi ad Wegkamp, sice there is o eed, i our case, to estimate global shatter coefficiets of the class. We also give a data-depedet extesio of Massart s result where the local averages are computed aroud the miimizer of the empirical loss. Fially, Sectio 6 shows that the problem of estimatig these local Rademacher averages i classificatio reduces to weighted empirical risk miimizatio. It also shows that the local averages for kerel classes ca be sharply bouded i terms of the eigevalues of the Gram matrix. 2 Prelimiary Results Recall that the star-hull of F aroud f 0 is defied by starf, f 0 ) = f 0 + αf f 0 ) : f F, α [0, 1]}. Throughout this paper, we will maipulate suprema of empirical processes, that is, quatities of the form sup P f P f). We will always assume they are measurable without explicitly metioig it. I other words, we assume that the class F ad the distributio P satisfy appropriate mild) coditios for measurability of this supremum we refer to [11, 28] for a detailed accout of such issues). The followig theorem is the mai result of this sectio ad is at the core of all the proofs preseted later. It shows that if the fuctios i a class have small variace, the maximal deviatio betwee empirical meas ad true meas is cotrolled by the Rademacher averages of F. I particular, the boud improves as the largest variace of a class member decreases. Theorem 2.1 Let F be a class of fuctios that map X ito [a, b]. Assume that there is some r > 0 such that for every f F, Var [fx i )] r. The, for every x > 0, with probability at least 1 e x, supp f P f) if α>0 ad with probability at least 1 2e x, supp f P f) if α 0,1) 21 + α)er F α 1 α E σr F + 2rx 2rx Moreover, the same results hold for the quatity sup P f P f). 1 + b a) ) ) x, α 1 + b a) α α ) ) x. 2α1 α) 6

7 This theorem, which is proved i Appedix B, is a more or less direct cosequece of Talagrad s iequality for empirical processes [30]. However, the actual statemet preseted here is ew i the sese that it displays the best kow costats. Ideed, compared to the previous result of Koltchiskii ad Pacheko [16] which was based o Massart s versio of Talagrad s iequality [21], we have used the most refied cocetratio iequalities available: that of Bousquet [7] for the supremum of the empirical process ad that of Bouchero, Lugosi ad Massart [5] for the Rademacher averages. This last iequality is a powerful tool to obtai data-depedet bouds, sice it allows oe to replace the Rademacher average which measures the complexity of the class of fuctios) by its empirical versio, which ca be efficietly computed i some cases. Details about these iequalities are give i Appedix A. Whe applied to the full fuctio class F, the above theorem is ot useful. Ideed, with oly a trivial boud o the maximal variace, better results ca be obtaied via simpler cocetratio iequalities, such as the bouded differece iequality [23], which would allow rx/ to be replaced by x/. However, by applyig Theorem 2.1 to subsets of F or to modified classes obtaied from F, much better results ca be obtaied. Hece, the presece of a upper boud o the variace i the square root term is the key igrediet of this result. A last prelimiary result that we will require is the followig cosequece of Theorem 2.1, which shows that if the local Rademacher averages are small, the balls i L 2 P ) are probably cotaied i the correspodig empirical balls that is, i L 2 P )) with a slightly larger radius. Corollary 2.2 Let F be a class of fuctios that map X ito [ b, b] with b > 0. For every x > 0 ad r that satisfy r 10bER f : f F, P f 2 r } + 11b2 x, the with probability at least 1 e x, f F : P f 2 r } f F : P f 2 2r }. Proof: Sice the rage of ay fuctio i the set F r = f 2 : f F, P f 2 r } is cotaied i [0, b 2 ], it follows that Var [ f 2 X i ) ] P f 4 b 2 P f 2 b 2 r. Thus, by the first part of Theorem 2.1 with α = 1/4), with probability at least 1 e x, every f F r satisfies P f 2 r ER f 2 : f F, P f 2 r } 2b + 2 rx + 13b2 x 3 r ER f 2 : f F, P f 2 r } + r b2 x 3 r + 5bER f : f F, P f 2 r } + r b2 x 3 2r, where the secod iequality follows from Lemma A.3 ad we have used, i the secod last iequality, Theorem A.6 applied to φx) = x 2 with Lipschitz costat 2b o [ b, b]). 3 Error Bouds with Local Complexity I this sectio, we show that the Rademacher averages associated with a small subset of the class may be cosidered as a complexity term i a error boud. Sice these local Rademacher averages are always smaller tha the correspodig global averages, they lead to sharper bouds. 7

8 We preset a geeral error boud ivolvig local complexities that is applicable to classes of bouded fuctios for which the variace is bouded by a fixed liear fuctio of the expectatio. I this case, the local Rademacher averages are defied as ER f F : T f) r} where T f) is a upper boud o the variace typically chose as T f) = P f 2 ). There is a trade-off betwee the size of the subset we cosider i these local averages ad its complexity; we shall see that the optimal choice is give by a fixed poit of a upper boud o the local Rademacher averages. The fuctios we use as upper bouds are sub-root fuctios; amog other useful properties, sub-root fuctios have a uique fixed poit. Defiitio 3.1 A fuctio ψ : [0, ) [0, ) is sub-root if it is oegative, odecreasig, ad if r ψr)/ r is oicreasig for r > 0. We oly cosider otrivial sub-root fuctios, that is sub-root fuctios that are ot the costat fuctio ψ 0. Lemma 3.2 If ψ : [0, ) [0, ) is a otrivial sub-root fuctio, the it is cotiuous o [0, ) ad the equatio ψr) = r has a uique positive solutio. Moreover, if we deote the solutio by r, the for all r > 0, r ψr) if ad oly if r r. The proof of this lemma is i Appedix B. I view of the lemma, we will simply refer to the quatity r as the uique positive solutio of ψr) = r, or as the fixed-poit of ψ. 3.1 Error Bouds We ca ow state ad discuss the mai result of this sectio. It is composed of two parts: i the first part, oe requires a sub-root upper boud o the local Rademacher averages, ad i the secod part, it is show that better results ca be obtaied whe the class over which the averages are computed is slightly elarged. Theorem 3.3 Let F be a class of fuctios with rages i [a, b] ad assume that there is some fuctioal T : F R + ad some costat B such that for every f F, Var [f] T f) BP f. Let ψ be a sub-root fuctio ad let r be the fixed poit of ψ. 1) Assume that ψ satisfies, for ay r r, ψr) BER f F : T f) r}. The, with c 1 = 704 ad c 2 = 26, for ay K > 1 ad every x > 0, with probability at least 1 e x, f F, P f K K 1 P f + c 1K B r + x11b a)) + c 2BK). Also, with probability at least 1 e x, f F, P f K + 1 K P f + c 1K B r + x11b a) + c 2BK). 2) If, i additio, for f F ad α [0, 1], T αf) α 2 T f), ad if ψ satisfies, for ay r r, ψr) BER f starf, 0) : T f) r}, the the same results hold true with c 1 = 6 ad c 2 = 5. 8

9 The proof of this theorem is give i Sectio 3.2. We ca compare the results to our startig poit Theorem 2.1). The improvemet comes from the fact that the complexity term, which was essetially sup r ψr) i Theorem 2.1 if we had applied it to the class F directly) is ow reduced to r, the fixed poit of ψ. So the complexity term is always smaller later, we show how to estimate r ). O the other had, there is some loss sice the costat i frot of P f is strictly larger tha oe. Sectio 5.2 will show that this is ot a issue i the applicatios we have i mid. I Sectios 5.1 ad 5.2, we ivestigate coditios that esure the assumptios of this theorem are satisfied, ad we provide applicatios of this result to predictio problems. The coditio that the variace is upper bouded by the expectatio turs out to be crucial to obtai these results. The idea behid Theorem 3.3 origiates i the work of Massart [22], who proves a slightly differet versio of the first part. The differece is that we use local Rademacher averages istead of the expectatio of the supremum of the empirical process o a ball. Moreover, we give smaller costats. As far as we kow, the secod part of Theorem 3.3 is ew Choosig the Fuctio ψ Notice that the fuctio ψ caot be chose arbitrarily ad has to satisfy the sub-root property. Oe possible approach is to use classical upper bouds o the Rademacher averages, such as Dudley s etropy itegral. This ca give a sub-root upper boud ad was used, for example, i [16] ad i [22]. However, the secod part of Theorem 3.3 idicates a possible choice for ψ, amely, oe ca take ψ as the local Rademacher averages of the star-hull of F aroud 0. The reaso for this comes from the followig lemma, which shows that if the class is star-shaped ad T f) behaves as a quadratic fuctio, the Rademacher averages are sub-root. Lemma 3.4 If the class F is star-shaped aroud ˆf which may deped o the data), ad T : F R + is a possibly radom) fuctio that satisfies T αf) α 2 T f) for ay f F ad ay α [0, 1], the the radom) fuctio ψ defied for r 0 by is sub-root ad r Eψr) is also sub-root. ψr) = E σ R f F : T f ˆf) r} This lemma is proved i Appedix B. Notice that makig a class star-shaped oly icreases it, so that ER f starf, f 0 ) : T f) r} ER f F : T f) r}. However, this icrease i size is moderate as ca be see for example if oe compares coverig umbers of a class ad its star-hull see, for example, [26], Lemma 4.5) Some Cosequeces As a cosequece of Theorem 3.3, we obtai a error boud whe F cosists of uiformly bouded oegative fuctios. Notice that i this case, the variace is trivially bouded by a costat times the expectatio ad oe ca directly use T f) = P f. 9

10 Corollary 3.5 Let F be a class of fuctios with rages i [0, 1]. Let ψ be a sub-root fuctio, such that for all r 0, ER f F : P f r} ψr), ad let r be the fixed poit of ψ. The, for ay K > 1 ad every x > 0, with probability at least 1 e x, every f F satisfies P f K K 1 P f + 704Kr x K) +. Also, with probability at least 1 e x, every f F satisfies P f K + 1 K P f + x K) 704Kr +. Proof: Whe f [0, 1], we have Var [f] P f so that the result follows from applyig Theorem 3.3 with T f) = P f. We also ote that the same idea as i the proof of Theorem 3.3 gives a coverse of Corollary 2.2, amely, that with high probability, the itersectio of F with a empirical ball of a fixed radius is cotaied i the itersectio of F with a L 2 P ) ball with a slightly larger radius. Lemma 3.6 Let F be a class of fuctios that map X ito [ 1, 1]. Fix x > 0. If r 20ER f : f starf, 0), P f 2 r } + 26x, the with probability at least 1 e x, f starf, 0) : P f 2 r } f starf, 0) : P f 2 2r }. This result, proved i Sectio 3.2, will be useful i Sectio Estimatig r from Global Iformatio The error bouds ivolve fixed poits of fuctios that defie upper bouds o the local Rademacher averages. I some cases, these fixed poits ca be estimated from global iformatio o the fuctio class. We preset a complete aalysis oly i a simple case, where F is a class of biary-valued fuctios with a fiite VC dimesio. Corollary 3.7 Let F be a class of 0, 1}-valued fuctios with VC dimesio d <. The for all K > 1 ad every x > 0, with probability at least 1 e x, every f F satisfies P f K d log/d) K 1 P f + ck + x ). The proof is i Appedix B. The above result is similar to results obtaied by Vapik ad Chervoekis [35] ad by Lugosi ad Wegkamp Theorem 3.1 of [19]). However they used iequalities for weighted empirical processes idexed by oegative fuctios. Our results have more flexibility sice they ca accommodate geeral fuctios, although this is ot eeded i this simple corollary. The proof uses a similar lie of reasoig to proofs i [26, 27]. Clearly, it exteds to ay class of real-valued fuctios for which oe has estimates for the etropy itegral, such as classes with fiite pseudo-dimesio or a combiatorial dimesio that grows more slowly tha quadratically. See [26, 27] for more details. Notice also that the rate of d log / is the best kow. 10

11 3.1.4 Proof Techiques Before givig the proofs of the results metioed above, let us sketch the techiques we use. The approach has its roots i classical empirical processes theory, where it was uderstood that the modulus of cotiuity of the empirical process is a importat quatity here, ψ plays this role). I order to obtai o-asymptotic results, two approaches have bee developed: the first oe cosists of cuttig the class F ito smaller pieces, where oe has cotrol of the variace of the elemets. This is the so-called peelig techique see, for example, [31, 33, 34, 32] ad refereces therei). The secod approach cosists of weightig the fuctios i F by dividig them by their variace. May results have bee obtaied o such weighted empirical processes see, for example, [28]). The results of Vapik ad Chervoekis based o weightig [35] are restricted to classes of oegative fuctios. Also, most previous results, such as those of Pollard [28], Va de Geer [32] or Haussler [13] give complexity terms that ivolve global measures of complexity of the class, such as coverig umbers. Noe of these results use the recetly itroduced Rademacher averages as measures of complexity. It turs out that it is possible to combie the peelig ad weightig ideas with cocetratio iequalities to obtai such results, as proposed by Massart i [22], ad also used for oegative fuctios) by Koltchiskii ad Pacheko [16]. The idea is the followig: Apply Theorem 2.1 to the class of fuctios f/wf) : f F} where w is some oegative weight of the order of the variace of f. Hece, the fuctios i this class have a small variace. Upper boud the Rademacher averages of this weighted class, by peelig off subclasses of F accordig to the variace of their elemets, ad boudig the Rademacher averages of these subclasses usig ψ. Use the sub-root property of ψ, so that its fixed poit gives a commo upper boud o the complexity of all the subclasses up to some scalig). Fially, covert the upper boud for fuctios i the weighted class ito a boud for fuctios i the iitial class. The idea of peelig that is, of partitioig the class F ito slices where fuctios have variace withi a certai rage is at the core of the proof of the first part of Theorem 3.3 see, for example, Equatio 3.1)). However, it does ot appear explicitly i the proof of the secod part. Oe explaatio is that whe oe cosiders the star hull of the class, it is eough to cosider two subclasses: the fuctios with T f) r ad the oes with T f) > r ad this is doe by itroducig the weightig factor T f) r. This idea was exploited i the work of Medelso [26] ad, more recetly, i [4]. Moreover, whe oe cosiders the set F r = starf, 0) T f) r, ay fuctio f F with T f ) > r, will have a scaled dow represetative i that set. So eve though it seems that we look at the class starf, 0) oly locally, we still take ito accout all of the fuctios i F with appropriate scalig). 3.2 Proofs Before presetig the proof, let us first itroduce some additioal otatio. Give a class F, λ > 1, ad r > 0, let wf) = mirλ k : k N, rλ k T f)} ad set } r G r = wf) f : f F. 11

12 Notice that wf) r, so that G r αf : f F, α [0, 1]} = starf, 0). Defie V + r = sup P g P g ad Vr = sup P g P g. g G r g G r For the secod part of the theorem, we eed to itroduce aother class of fuctios } rf G r := T f) r : f F, ad defie Ṽ + r = sup P g P g ad Ṽ g G r = sup P g P g. r g G r Lemma 3.8 With the above otatio, assume that there is a costat B > 0 such that for every f F, T f) BP f. Fix K > 1, λ > 0 ad r > 0. If V r + r/λbk), the Also, if V r r/λbk), the Similarly, if K > 1 ad r > 0 are such that Ṽ + r Also, if Ṽ r r/bk), the f F, P f K K 1 P f + r λbk. f F, P f K + 1 K P f + r λbk. r/bk) the f F, P f K K 1 P f + r BK. f F, P f K + 1 K P f + r BK. Proof: Notice that for all g G r, P g P g + V r +. Whe T f) r, wf) = r, so that g = f. Thus, the fact that P g P g + V r + P f P f + V r + P f + r/λbk). Fix f F ad defie g = rf/wf). implies that O the other had, if T f) > r, the wf) = rλ k with k > 0 ad T f) rλ k 1, rλ k ]. Moreover, g = f/λ k, P g P g + V + r, ad thus P f λ k Usig the fact that T f) > rλ k 1, it follows that P f λ k + V + r. P f P f + λ k V + r < P f + λt f)v + r /r P f + P f/k. Rearragig, P f K K 1 P f < K K 1 P f + r λbk. The proof of the secod result is similar. For the third ad fourth results, the reasoig is the same. 12

13 Proof of Theorem 3.3, first part: Let G r be defied as above, where r is chose such that r r, ad ote that fuctios i G r satisfy g P g b a sice 0 r/wf) 1. Also, we have Var [g] r. Ideed, if T f) r, the g = f, ad thus Var [g] = Var [f] r. Otherwise, whe T f) > r, g = f/λ k where k is such that T f) rλ k 1, rλ k ]), so that Var [g] = Var [f] /λ 2k r. Applyig Theorem 2.1, for all x > 0, with probability 1 e x, 2rx 1 V r α)er G r + + b a) ) x α. Let Fx, y) := f F : x T f) y} ad defie k to be the smallest iteger such that rλ k+1 Bb. The, r wf) R f ER G r ER F0, r) + E sup r,bb) k ER F0, r) + = ER F0, r) + ψr) B + 1 B j=0 E k λ j E j=0 sup rλ j,rλ j+1 ) k λ j ψrλ j+1 ). j=0 sup rλ j,rλ j+1 ) By our assumptio it follows that for β 1, ψβr) βψr). Hece, ER G r 1 B ψr) 1 + k λ λ j/2, j=0 r wf) R f 3.1) ad takig λ = 4, the right-had side is upper bouded by 5ψr)/B. Moreover, for r r, ψr) r/r ψr ) = rr, ad thus V r α) 2rx 1 rr B + + b a) ) x α. Set A = α) r /B + 2x/ ad C = b a)1/3 + 1/α)x/, ad ote that V r + A r + C. We ow show that r ca be chose such that V r + r/λbk. Ideed, cosider the largest solutio r 0 of A r + C = r/λbk. It satisfies r 0 λ 2 A 2 B 2 K 2 /2 r ad r 0 λbk) 2 A 2 + 2λBKC, so that applyig Lemma 3.8, it follows that every f F satisfies P f K K 1 P f + λbka 2 + 2C = K K 1 P f + λbk α) 2 r /B b a) ) x α. R f ) α) 2xr B + 2x Settig α = 1/10 ad usig Lemma A.3 to show that 2xr / Bx/5) + 5r /2B) completes the proof of the first statemet. The secod statemet is proved i the same way, by cosiderig Vr istead of V r +. 13

14 Proof of Theorem 3.3, secod part: The proof of this result uses the same argumet as for the first part. However, we cosider the class G r defied above. Oe ca easily check that G r f starf, 0) : T f) r}, ad thus ER Gr ψr)/b. Applyig Theorem 2.1 to G r, it follows that, for all x > 0, with probability 1 e x, Ṽ r α) 2rx 1 ψr) + + b a) B ) x α. The reasoig is the the same as for the first part, ad we use i the very last step that 2xr / Bx/ + r /2B), which gives the displayed costats. Proof of Lemma 3.6: The map α α 2 is Lipschitz with costat 2 whe α is restricted to [ 1, 1]. Applyig Theorem A.6, r 10ER f 2 : f starf, 0), P f 2 r } + 26x. 3.2) Clearly, if f F, the f 2 maps to [0, 1] ad Var [ f 2] P f 2. Thus, Theorem 2.1 ca be applied to the class G r = rf 2 /P f 2 r) : f F}, whose fuctios have rage i [0, 1] ad variace bouded by r. Therefore, with probability at least 1 e x, every f F satisfies r P f 2 P f 2 P f 2 r 21 + α)er G r + 2rx ) x α. Select α = 1/4 ad otice that 2rx/ r/4 + 2x/ to get r P f 2 P f 2 P f 2 r 5 2 ER G r + r x 3. Hece, oe either has P f 2 r, or whe P f 2 r, sice it was assumed that P f 2 r, P f 2 r + P f 2 5 r 2 ER G r + r x ). 3 Now, if g G r, there exists f 0 F such that g = rf0 2/P f 0 2 r). If P f 0 2 r, the g = f 0 2. O the other had, if P f0 2 > r, the g = rf 0 2/P f 0 2 = f 1 2 with f 1 starf, 0) ad P f1 2 r, which shows that ER G r ER f 2 : f starf, 0), P f 2 r }. Thus, by Iequality 3.2), P f 2 2r, which cocludes the proof. 4 Data-Depedet Error Bouds The results preseted thus far use distributio depedet measures of complexity of the class at had. Ideed, the sub-root fuctio ψ of Theorem 3.3 is bouded i terms of the Rademacher averages of the star-hull of F, but these averages ca oly be computed if oe kows the distributio P. Otherwise, we have see that it is possible to compute a upper boud o the Rademacher averages usig a priori global or distributio-free kowledge about the complexity of the class at had such as the VC dimesio). I this sectio, we preset error bouds that ca be computed 14

15 directly from the data, without a priori iformatio. Istead of computig ψ, we compute a estimate, ψ, of it. The fuctio ψ is defied usig the data ad is a upper boud o ψ with high probability. To simplify the expositio we restrict ourselves to the case where the fuctios have a rage which is symmetric aroud zero, say [ 1, 1]. Moreover, we ca oly treat the special case where T f) = P f 2, but this is a mior restrictio as i most applicatios this is the fuctio of iterest i.e., for which oe ca show T f) BP f). 4.1 Results We ow preset the mai result of this sectio, which gives a aalogue of the secod part of Theorem 3.3, with a completely empirical boud that is, the boud ca be computed from the data oly). Theorem 4.1 Let F be a class of fuctios with rages i [ 1, 1] ad assume that there is some costat B such that for every f F, P f 2 BP f. Let ψ be a sub-root fuctio ad let ˆr be the fixed poit of ψ. Fix x > 0 ad assume that ψ satisfies, for ay r ˆr, ψ r) c 1 E σ R f starf, 0) : P f 2 2r } + c 2x, where c 1 = 210 B) ad c 2 = c The, for ay K > 1 with probability at least 1 3e x, Also, with probability at least 1 3e x, f F, P f K K 1 P f + 6K x11 + 5BK) B ˆr +. f F, P f K + 1 K P f + 6K x11 + 5BK) B ˆr +. Although these are data depedet bouds, they are ot ecessarily easy to compute. There are, however, favorable iterestig situatios where they ca be computed efficietly, as Sectio 6 shows. It is atural to woder how close the quatity ˆr appearig i the above theorem is to the quatity r of Theorem 3.3. The ext theorem shows that they are close with high probability. Theorem 4.2 Let F be a class of fuctios with rages i [ 1, 1]. Fix x > 0 ad cosider the sub-root fuctios ψr) = ER f starf, 0) : P f 2 r }, ad ψ r) = c 1 E σ R f starf, 0) : P f 2 2r } + c 2x, with fixed poits r ad ˆr respectively ad with c 1 = 210 B) ad c 2 = 13. Assume that r c 3 x/ where c 3 = 26 c 2 + 2c 1 )/3. The, with probability at least 1 4e x, r ˆr 91 + c 1 ) 2 r. Thus, with high probability, ˆr is a upper boud o r ad has the same asymptotic behavior. Notice that there was o attempt to optimize the costats i the above theorem. I additio, the costat 91 + c 1 ) 2 equal to 3969 if B 10) i Theorem 4.2 does ot appear i the upper boud of Theorem

16 4.2 Proofs The idea of the proofs is to show that oe ca upper boud ψ by a empirical estimate with high probability). This requires two steps: the first oe uses the cocetratio of the Rademacher averages to upper boud the expected Rademacher averages by their empirical versios. The secod step uses Corollary 2.2 to prove that the ball over which the averages are computed which is a L 2 P ) ball) ca be replaced by a empirical oe. Thus, ψ is a upper boud o ψ, ad oe ca apply Theorem 3.3, together with the followig lemma, which shows how fixed poits of sub-root fuctios relate whe the fuctios are ordered. Lemma 4.3 Suppose that ψ, ψ are sub-root. Let r resp. ˆr ) be the fixed poit of ψ resp. ψ ). If for 0 α 1, we have α ψ r ) ψr ) ψ r ), the α 2ˆr r ˆr. Proof: Deotig by ˆr α the fixed poit of the sub-root fuctio α ψ the, by Lemma 3.2, ˆr α r ˆr. Also, sice ψ is sub-root, ψ α 2ˆr ) α ψ ˆr ) = αˆr which meas α ψ α 2ˆr ) α 2ˆr. Hece, Lemma 3.2 yields ˆr α α 2ˆr. Proof of Theorem 4.1: Cosider the sub-root fuctio ψ 1 r) = c 1 2 ER f starf, 0) : P f 2 r } + c 2 c 1 )x, with fixed poit r 1. Applyig Corollary 2.2 whe r ψ 1r), it follows that with probability at least 1 e x, f starf, 0) : P f 2 r } f starf, 0) : P f 2 2r }. Usig this, together with the first iequality of Lemma A.4 with α = 1/2) shows that if r ψ 1 r), with probability at least 1 2e x, ψ 1 r) = c 1 2 ER f starf, 0) : P f 2 r } + c 2 c 1 )x c 1 E σ R f starf, 0) : P f 2 r } + c 2x c 1 E σ R f starf, 0) : P f 2 2r } + c 2x ψ r). Choosig r = r 1, Lemma 4.3 shows that with probability at least 1 2e x, Also, for all r 0, r 1 ˆr. 4.1) ψ 1 r) BER f starf, 0) : P f 2 r }, ad so from Theorem 3.3, with probability at least 1 e x, every f F satisfies P f K K 1 P f + 6Kr 1 B BK)x. Combiig this with 4.1) gives the first result. The secod result is proved i a similar maer. 16

17 Proof of Theorem 4.2: Cosider the fuctios ψ 1 r) = c 1 2 ER f starf, 0) : P f 2 r } + c 2 c 1 )x, ad ψ 2 r) = c 1 ER f starf, 0) : P f 2 r } + c 3x, ad deote by r1 ad r 2 the fixed poits of ψ 1 ad ψ 2 respectively. The proof of Theorem 4.1 shows that with probability at least 1 2e x, r1 ˆr. Now apply Lemma 3.6 to show that if r ψ 2 r) the with probability at least 1 e x, f starf, 0) : P f 2 r } f starf, 0) : P f 2 2r }. Usig this, together with the secod iequality of Lemma A.4 with α = 1/2) shows that if r ψ 2 r), with probability at least 1 2e x, ψ r) = c 1 E σ R f starf, 0) : P f 2 2r } + c 2x c 1 2Eσ R f starf, 0) : P f 2 r } + c 2x c 1 2Eσ R f starf, 0) : P f 2 2r } + c 2x c 1ER f starf, 0) : P f 2 2r } + c 2 + 2c 1 )x 3c 1 ER f starf, 0) : P f 2 r } + c 2 + 2c 1 )x 3ψ 2 r), where the sub-root property was used twice i the first ad secod last iequalities). Lemma 4.3 thus gives ˆr 9r2. Also otice that for all r, ψr) ψ 1 r), ad hece r r1. Moreover, for all r ψr) hece r r c 3 x/), ψ 2 r) c 1 ψr) + r so that ψ 2 r ) c 1 + 1)r = c 1 + 1)ψr ). Lemma 4.3 implies that r2 1 + c 1) 2 r. 5 Predictio with Bouded Loss I this sectio, we discuss the applicatio of our results to predictio problems, such as classificatio ad regressio. For such problems, there is a iput space X ad a output space Y ad the product X Y is edowed with a ukow probability measure P. For example, classificatio correspods to the case where Y is discrete, typically Y = 1, 1} ad regressio correspods to the cotiuous case, typically Y = [ 1, 1]. Note that assumig the boudedess of the target values is a typical assumptio i theoretical aalysis of regressio procedures. To aalyze the case of ubouded targets, oe usually trucates the values at a certai threshold ad bouds the probability of exceedig that threshold see, for example, the techiques developed i [12]). The traiig sample is a sequece X 1, Y 1 ),..., X, Y ) of idepedet ad idetically distributed i.i.d.) pairs sampled accordig to P. A loss fuctio l : Y Y [0, 1] is defied ad the goal is to fid a fuctio f : X Y from a class F that miimizes the expected loss El f = ElfX), Y ). 17

18 Sice the probability distributio P is ukow, oe caot directly miimize the expected loss over F. The key property that is eeded to apply our results is the fact that Var [f] BP f or P f 2 BP f to obtai data-depedet bouds). This will trivially be the case for the class l f : f F} as all its fuctios are uiformly bouded ad oegative. This case, studied i Sectio 5.1 is, however, ot the most iterestig. Ideed, it is whe oe studies the excess risk l f l f that our approach shows its superiority over previous oes; whe the class l f l f } satisfies the variace coditio ad Sectio 5.2 gives examples of this), we obtai distributio-depedet bouds that are optimal i certai cases, ad data-depedet bouds of the same order. 5.1 Geeral Results without Assumptios Defie the followig class of fuctios, called the loss class associated with F: l F = l f : f F} = x, y) lfx), y) : f F}. Notice that l F is a class of oegative fuctios. Applyig Theorem 4.1 to this class of fuctios gives the followig corollary. Corollary 5.1 For a loss fuctio l : Y Y [0, 1], defie ψ r) = 20E σ R f starlf, 0) : P f 2 2r } + 13x, with fixed poit ˆr. The, for ay K > 1 with probability at least 1 3e x, f F, P l f K K 1 P l f + 6Kˆr x11 + 5K) +. A atural approach is to miimize the empirical loss P l f over the class F. The followig result shows that this approach leads to a estimate with expected loss ear miimal. How close it is to the miimal expected loss depeds o the value of the miimum, as well as o the local Rademacher averages of the class. Theorem 5.2 For a loss fuctio l : Y Y [0, 1], defie ψr), ψ r), r, ad ˆr as i Theorem 5.1. Let L = if P l f. The there is a costat c such that with probability at least 1 2e x, the miimizer ˆf F of P l f satisfies Also, with probability at least 1 4e x, P l ˆf L + c L r + r ). P l ˆf L + c L ˆr + ˆr ). The proof of this theorem is give i Appedix B. This theorem has the same flavor as Theorem 4.2 of [19]. We have ot used ay property besides the positivity of the fuctios i the class. This idicates that there might ot be a sigificat gai compared to earlier results as without further assumptios the optimal rates are kow). 18

19 Ideed, a careful examiatio of this result shows that whe L > 0, the differece betwee P l ˆf ad L is essetially of order r. For a class of 0, 1}-valued fuctios with VC-dimesio d, for example, this would be d log /. O the other had, the result of [19] is more refied sice the Rademacher averages are ot localized aroud 0 as they are here), but rather aroud the miimizer of the empirical error itself. Ufortuately, the small ball i [19] is ot defied as P l f P l ˆf + r but as P l f 16P l ˆf + r. This meas that i the geeral situatio where L > 0, sice P l ˆf does ot coverge to 0 with icreasig as it is expected to be close to P l ˆf which itself coverges to L ), the radius of the ball aroud l ˆf which is 15P l ˆf + r) will ot coverge to 0. Thus, the localized Rademacher average over this ball will coverge at speed d/. I other words, our Theorem 5.2 ad Theorem 4.2 of [19] essetially have the same behavior. But this is ot surprisig, as it is kow that this is the optimal rate of covergece i this case. To get a improvemet i the rates of covergece, oe eeds to make further assumptios o the distributio P or o the class F. 5.2 Improved Results for the Excess Risk Cosider a loss fuctio l ad fuctio class F that satisfy the followig coditios. 1. For every probability distributio P there is a f F satisfyig P l f = if P l f. 2. There is a costat L such that l is L-Lipschitz i its first argumet: for all y, ŷ 1, ŷ 2, lŷ 1, y) lŷ 2, y) L ŷ 1 ŷ There is a costat B 1 such that for every probability distributio ad every f F, P f f ) 2 BP l f l f ). These coditios are ot too restrictive as they are met by several commoly used regularized algorithms with covex losses. Note that Coditio 1 could be weakeed, ad oe could cosider a fuctio which is oly close to achievig the ifimum, with a appropriate chage to Coditio 3. This geeralizatio is straightforward, but it would make the results less readable, so we omit it. Coditio 2 implies that, for all f F, P l f l f ) 2 L 2 P f f ) 2. Coditio 3 usually follows from a uiform covexity coditio o l. A importat example is the quadratic loss, ly, y ) = y y ) 2, whe the fuctio class F is covex ad uiformly bouded. I particular, if fx) y [0, 1] for all f F, x X ad y Y, the the coditios are satisfied with L = 2 ad B = 1 see [18]). Other examples are described i [26] ad i [2]. The first result we preset is a direct but istructive corollary of Theorem 3.3. Corollary 5.3 Let F be a class of fuctios with rage i [ 1, 1] ad let l be a loss fuctio satisfyig Coditios 1 3 above. Let ˆf be ay elemet of F satisfyig P l ˆf = if P l f. Assume ψ is a sub-root fuctio for which ψr) BLER f F : L 2 P f f ) 2 r }. 19

20 The for ay x > 0 ad ay r ψr), with probability at least 1 e x, P l ˆf l f ) 705 r B + 11L + 27B)x. Proof: Oe applies Theorem 3.3 first part) to the class l f l f with T f) = L 2 P f f ) 2 ad uses the fact that by Theorem A.6, ad by the symmetry of the Rademacher variables, LER f : L 2 P f f ) 2 r} ER l f l f : L 2 P f f ) 2 r}. The result follows from oticig that P l ˆf l f ) 0. Istead of comparig the loss of f to that of f, oe could compare it to the loss of the best measurable fuctio the regressio fuctio for regressio fuctio estimatio, or the Bayes classifier for classificatio). The techiques proposed here ca be adapted to this case. Usig Corollary 5.3, oe ca with mior modificatio) recover the results of [22] for model selectio. These have bee show to match the miimax results i various situatios. I that sese, Corollary 5.3 ca be cosidered as sharp. Next, we tur to the mai result of this sectio. It is a versio of Corollary 5.3 with a fully data-depedet boud. This is obtaied by modifyig ψ i three ways: the Rademacher averages are replaced by empirical oes, the radius of the ball is i L 2 P ) orm istead of L 2 P ), ad fially, the ceter of the ball is ˆf istead of f. Theorem 5.4 Let F be a covex class of fuctios with rage i [ 1, 1] ad let l be a loss fuctio satisfyig Coditios 1 3 above. Let ˆf be ay elemet of F satisfyig P l ˆf = if P l f. Defie ψ r) = c 1 E σ R f F : P f ˆf) } 2 c 3 r + c 2x, 5.1) where c 1 = 2LB 10L), c 2 = 11L 2 + c 1, ad c 3 = B11L + 27B)/c 2. The with probability at least 1 4e x, P l ˆf l f ) L + 27B)x B ˆr +, where ˆr is the fixed poit of ψ. Remark 5.5 Ulike Corollary 5.3, the class F i Theorem 5.4 has to be covex. This esures that it is star-shaped aroud ay of its elemets which implies that ψ is sub-root eve though ˆf is radom). However, covexity of the loss class is ot ecessary, so that this theorem still applies to may situatios of iterest, i particular to regularized regressio, where the fuctios are take i a vector space or a ball of a vector space. Remark 5.6 Although the theorem is stated with explicit costats, there is o reaso to thik that these are optimal. The fact that the costat 705 appears actually is due to our failure to use the secod part of Theorem 3.3 to the iitial loss class, which is ot star-shaped this would have give a 7 istead). However, with some additioal effort, oe ca probably obtai much better costats. As we explaied earlier, although the statemet of Theorem 5.4 is similar to Theorem 4.2 i [19], there is a importat differece i the way the localized averages are defied: i our case the radius is a costat times r, while i [19] there is a additioal term, ivolvig the loss of the empirical risk miimizer, which may ot coverge to zero. Hece, the complexity decreases faster i our boud. 20

21 The additioal property required i the proof of this result compared to the proof of Theorem 4.1 is that uder the assumptios of the theorem, the miimizer of the empirical loss ad of the true loss are close with respect to the L 2 P ) ad the L 2 P ) distaces this has also bee used i [20] ad [31, 32]). 5.3 Proof of Theorem 5.4 Defie the fuctio ψ as ψr) = c 1 2 ER f F : L 2 P f f ) 2 r } + c 2 c 1 )x. 5.2) Notice that sice F is covex ad thus star-shaped aroud each of its poits, Lemma 3.4 implies that ψ is sub-root. Now, for r ψr), Corollary 5.3 ad Coditio 3 o the loss fuctio imply that, with probability at least 1 e x, L 2 P ˆf f ) 2 BL 2 P ) l ˆf l f 705L 2 r + 11L + 27B)BL2 x. 5.3) Deote the right-had side by s. Sice s r r, the s ψs) by Lemma 3.2), ad thus s 10L 2 ER f F : L 2 P f f ) 2 s } + 11L2 x Therefore, Corollary 2.2 applied to the class LF yields that with probability at least 1 e x, f F, L 2 P f f ) 2 s} f F, L 2 P f f ) 2 2s}. This, combied with 5.3), implies that with probability at least 1 2e x, ) P ˆf f ) 2 11L + 27B)Bx 2 705r L + 27B)B c 2. ) r, 5.4) where the secod iequality follows from r ψr) c 2 x/. Defie c = L + 27B)B/c 2 ). By the triagle iequality i L 2 P ), if 5.4) occurs, the ay f F satisfies ) P f ˆf) P 2 2 f f ) 2 + P f ˆf) 2 P f f ) 2 + cr) 2. Appealig agai to Corollary 2.2 applied to LF as before, but ow for r ψr), it follows that with probability at least 1 3e x, f F : L 2 P f f ) 2 r } f F : L 2 P f ˆf) 2 ) } c L 2 r. 21

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the