COMPLEXITY REGULARIZATION VIA LOCALIZED RANDOM PENALTIES

Size: px

Start display at page:

Download "COMPLEXITY REGULARIZATION VIA LOCALIZED RANDOM PENALTIES"

Andrea Payne
5 years ago
Views:

1 COMLEXITY REGULARIZATION VIA LOCALIZED RANDOM ENALTIES GÁBOR LUGOSI AND MARTEN WEGKAM Abstract. I this paper model selectio via pealized empirical loss miimizatio i oparametric classificatio problems is studied. Data-depedet pealties are costructed, which are based o estimates of the complexity of a small subclass of each model class, cotaiig oly those fuctios with small empirical loss. The pealties are ovel sice those cosidered i the literature are typically based o the etire model class. Oracle iequalities usig these pealties are established, ad the advatage of the ew pealties over those based o the complexity of the whole model class is demostrated. 1. Itroductio. I this paper we propose a ew complexity-pealized model selectio method based o data-depedet pealties. We cosider a simple biary classificatio problem, though most ideas may be exteded to a more geeral framewor. Give a radom observatio X R d, oe has to predict Y 0, 1. A classifier or classificatio rule is a fuctio f : R d 0, 1, with loss L(f) def = f(x) Y. A sample D = (X 1, Y 1 ),..., (X, Y ) of idepedet, idetically distributed (i.i.d.) pairs are available. Each pair (X i, Y i ) has the same distributio as (X, Y ) ad D is idepedet of (X, Y ). The statisticia s tas is to select a classificatio rule f based o the data D such that the probability of error is small. The Bayes classifier L(f ) = f (X) Y D f (x) def = I Y = 1 X = x Y = 0 X = x (where I deotes the idicator fuctio) is the optimal rule as L def = if L(f) = L(f ), f:r d 0,1 Date: March 3, Mathematics Subject Classificatio. rimary 62H30, 62G99; secodary 60E15. Key words ad phrases. Classificatio, complexity regularizatio, cocetratio iequalities, oracle iequalities, rademacher averages, radom pealties, shatter coefficiets. Supported by DGI grat BMF

2 2 GÁBOR LUGOSI AND MARTEN WEGKAM but both f ad L are uow to the statisticia. I this ote we study classifiers f : R d 0, 1 which miimize the empirical loss L(f) = 1 If(X i ) Y i i=1 over a class of rules F. For ay f F miimizig the empirical probability of error, we have Clearly, the approximatio error is decreasig as F becomes richer. EL( f) L = E L( f) L + E(L L)( f) statistical problem becomes: the estimatio error = E if L(f) L + E(L L)( f) if E L(f) L + E(L L)( f) = if L(f) L + E(L L)( f). if L(f) L However, the more complex F, the more difficult the E(L L)( f) icreases with the complexity of F. I may approaches to the problem described above, oe fixes i advace a sequece of model classes F 1, F 2,..., whose uio is F. The problem of pealized model selectio is to fid a possibly data-depedet pealty Ĉ, assiged to each class F, such that miimizig the pealized empirical loss L(f) + Ĉ, f F, = 1, 2,... leads to a predictio rule f with smallest possible loss. Deote by f a fuctio i F havig miimal empirical loss ad by L = if L(f) the miimal loss i class F. that The mai idea is that sice f miimizes L(f), we fid, by the argumet described above, EL( f ) L L L + E(L L)( f ). Our goal is to fid the class F such that L( f ) is as small as possible. To this ed, a good balace has to be foud betwee the approximatio ad estimatio errors. The approximatio error is uow to us, but the estimatio error may be estimated. The ey to complexityregularized model selectio is that a tight boud for the estimatio error is a good pealty Ĉ. More precisely, we show i Lemma 2.1 below that if for some costat κ > 0 Ĉ (L L)( f ) κ 2 2,

3 LOCALIZED RANDOM ENALTIES 3 the the oracle iequality EL( f) L if holds, ad also, a similar boud, L( f) L if ( ) L L + EĈ + 2κ 2 ( ) L L + 2Ĉ holds with probability greater tha 1 4κ 2. This simple result shows that the pealty should be, with large probability, a upper boud o the estimatio error, ad to guaratee good performace, the boud should be as tight as possible. Origially, distributio-free bouds, based o uiform-deviatio iequalities, were proposed as pealties. For example, the structural ris miimizatio method of Vapi ad Chervoeis 27 uses pealties of the form log S (2) + log Ĉ = κ, where κ is a costat ad S (2) is the 2-maximal shatter coefficiet of the class that is, A = x : f(x) = 1, f F,, (1.1) S (2) = max x 1,...,x 2 x 1,, x 2 A, A A = max x 1,...,x 2 (f(x 1 ),, f(x 2 )), f F, see for example, Vapi 26, Devroye, Györfi, ad Lugosi 9. of pealty wors follows from the Vapi-Chervoeis iequality. The fact that this type Such distributio-free bouds are attractive because of their simplicity, but precisely because of their distributiofree ature, they are ecessarily loose i may cases. Recetly various attempts have bee made to defie the pealties i a data-depedet way to achieve this goal, see, for example, Bartlett, Bouchero, ad Lugosi 2, Koltchisii 11, Koltchisii ad acheo 13, Lozao 15, Lugosi ad Nobel 17, Massart 19, ad Shawe-Taylor, Bartlett, Williamso, ad Athoy 22. For example, i 11 ad 2, radom complexity pealties based o Rademacher averages were proposed ad ivestigated. Rademacher averages are defied as 1 R F = E sup σ i If(X i ) Y i D, i=1

4 4 GÁBOR LUGOSI AND MARTEN WEGKAM where σ 1,..., σ are i.i.d. symmetric 1, 1-valued radom variables, idepedet of D. The reaso why this pealty was itroduced is based o the fact that E sup (L L)(f) E R F (see, e.g., Va der Vaart ad Weller 25), ad sice R F ca be show to be sharply cocetrated aroud its mea. I fact, cocetratio iequalities have bee a ey tool i the aalysis of data-based pealties (see Massart 19) ad this paper relies heavily o some recet cocetratio results. The model selectio method based o Rademacher complexities satisfies a oracle iequality of the rough form (1.2) EL( f) L if L L + κ 1 E R F + κ 2 log (see 2 ad 11) for values of the costats κ 1, κ 2 > 0. The advatage of this boud over the oe obtaied by the distributio-free pealties metioed above may perhaps be better uderstood if we further boud where E R F E log 2S (X 1 ) 2 (1.3) S (X 1 ) = X 1,..., X A : A = x : f(x) = 1, f F = (f(x 1 ),..., f(x )), f F, is the radom shatter coefficiet of the class F, which obviously ever exceeds the worst-case shatter coefficiet S () ad may be sigificatly smaller for certai distributios. However, this improved pealty is still ot completely satisfactory. To see this, recall that by a classical result of Vapi ad Chervoeis, for ay idex, (1.4) EL( f ) L c ( L E log S (X 1 ) + E log S (X 1 ) which is much smaller tha the correspodig expected Rademacher average if L is small. (For explicit costats we refer to Theorem 1.14 i Lugosi 16.) Sice i typical classificatio problems the miimal error L i class F is ofte very small for some, it is importat to fid pealties which allow to derive oracle iequalities with the appropriate depedece o L. I particular, a desirable goal would be to develop classifiers f for which a oracle iequality resemblig EL( f) L if L L + κ 1 L E log S (X 1 ) ) E log S (X1 + κ ) 2

5 LOCALIZED RANDOM ENALTIES 5 holds for all distributios. The mai results of this ote (Theorems 4.1 ad 4.2) show that estimates of the desired property are ideed possible to costruct i a coceptually simple way. By the ey Lemma 2.1, it suffices to fid a data-depedet upper estimate of (L L)( f ) which has the order of magitude of the above upper boud. The difficulty is that L ad E log S (X1 ) both deped o the uderlyig distributio. The improvemet is achieved by decreasig the pealties so that the supremum i the defiitio of the Rademacher average is ot tae over the whole class F but rather over a small subclass F cotaiig oly fuctios which loo good o the data. More precisely, defie the radom subclass F F by F = f F : L(f) κ 1 L( f ) + κ 2 1 log S (X1 ) + κ 3 1 log() for some o-egative costats κ 1, κ 2 ad κ 3. Ris estimates based o localized Rademacher averages have bee cosidered i several recet wors. The most closely related procedure is proposed by Koltchisii ad acheo 12, who, assumig if L(f) = 0, compute the Rademacher averages of subclasses of F with empirical loss less tha r for differet values of r obtaied by a recursive procedure, ad obtai bouds for the loss of the empirical ris miimizer i terms of the localized Rademacher averages obtaied after a certai umber of iteratios. Our approach of boudig the loss is coceptually simpler: it suffices to compute the Rademacher complexities at oly oe scale which depeds o the smallest empirical loss i the class ad a term of a smaller order determied by the shatter coefficiets of the whole class. Thus, we use global iformatio to determie the scale of localizatio. Bartlett, Bousquet, ad Medelso 3 also derive closely related geeralizatio bouds, based o localized Rademacher averages. I their approach the performace bouds also deped o Rademacher averages computed at differet scales of localizatio, which are combied by the techique of peelig. For further recet related wor we also refer to Bousquet 7, Bousquet, Koltchisii, ad acheo 8, ad Tsybaov 24. The rest of the paper is orgaized as follows. Sectio 2 presets some basic iequalities o model selectio, which geeralizes some of the results i Bartlett, Bouchero, ad Lugosi 2. Sectio 3 proposes a simple but suboptimal pealty which already has some of the mai features of the pealties preseted i Sectio 4. It shows, i a trasparet way, some of the uderlyig ideas of the mai results. Sectio 4 itroduces a ew pealty based o the Rademacher average R F ad it is show that the ew estimate yields a improvemet of the desired form.

6 6 GÁBOR LUGOSI AND MARTEN WEGKAM 2. relimiaries I this sectio we preset two basic auxiliary lemmata o model selectio. The first lemma is geeral i the sese that it does ot deped o the particular choice of the pealty Ĉ. This result was metioed i the itroductio ad geeralizes a result obtaied by Bartlett, Bouchero, ad Lugosi 2. Lemma 2.1. Suppose that the radom variables Ĉ1, Ĉ2,... are such that Ĉ (L L)( f ) κ 2 2 for some κ > 0 ad for all. The we have EL( f) L if L L + EĈ + 2κ 2. It is clear that we ca always tae Ĉ 1. roof. Observe that E sup (L L)( f ) Ĉ sup (L L)( f ) Ĉ =1 (sice sup 0 (L L)( f ) Ĉ (L L)( f ) Ĉ 0 (by the uio boud) κ 2 2 =1 (by assumptio) 2κ 2. 1)

7 LOCALIZED RANDOM ENALTIES 7 Therefore, we may coclude that EL( f) L = E L( f) L + Ĉ + E (L L)( f) Ĉ (where is the selected model idex, that is, f = f ) E if L( Ĉ f ) L + + E (L L)( f) Ĉ (by defiitio of f) E if if L(f) L + Ĉ (by defiitio of f ) if if L(f) L + EĈ (iterchage E ad if) L L + EĈ + 2κ 2 (by the precedig display) if + E sup (L L)( f ) Ĉ + E sup (L L)( f ) Ĉ ad the proof is complete. The precedig result is ot etirely satisfactory o the followig groud. Although it presets a sharp boud, it is a boud for the average ris behavior of f. However, the pealty is computed o the data at had, ad therefore the proposed criterio should have optimal performace for (almost) all possible sequeces of the data. The followig result presets a oasymptotic oracle iequality which holds with large probability ad a asymptotic almost sure versio. Lemma 2.2. Assume that for all, 1, Ĉ (L L)( f ) The for all 1 we have κ 2 2 ad the asymptotic almost sure boud ad Ĉ ( L L)(f ) L( f) ( ) L if L L + 2Ĉ 4κ 2 lim if L( f) ( L if L L + 2Ĉ) = 1. κ 2 2.

8 8 GÁBOR LUGOSI AND MARTEN WEGKAM roof. Let be the selected model idex. Notice that L( f) = L( f) + Ĉ + (L L)( f) Ĉ if L( Ĉ f ) + + sup (L L)( f ) Ĉ if L(f Ĉ ) + + sup (L L)( f ) Ĉ if L + 2Ĉ + sup ( L L)(f ) Ĉ + sup By assumptio, the last two terms o the right-had side satisfy sup ( L L)(f ) Ĉ + sup (L L)( f ) Ĉ 0 (L L)( f ) Ĉ. =1 2κ 2 2 < 4κ 2, provig the first iequality. The almost sure statemet is a direct cosequece of the Borel- Catelli lemma. 3. A simple versio The purpose of this short sectio is to offer a simplified, yet suggestive illustratio of the ideas. As discussed i the itroductio, a ideal pealty would be a tight upper boud for the expressio o the right-had side of (1.4). simple pealty Ĉ = 2 2 L( f ) + 8 log S (2) + 2 log() Motivated by this boud, we propose the log S (2) + 2 log() where S (2) is the (worst-case) 2-shatter coefficiet defied i (1.1). Thus, the miimal loss L i class F is estimated by its atural empirical couterpart L( f ) = if L(f) ad the expected logarithmic shatter coefficiet E log S (X1 ) is estimated by the distributio-free upper boud log S (2). (This term may be bouded further by V log(2 + 1), where V is the VC-dimesio of the set A ). The auxiliary terms 1 log() are ecessary to derive the desired oracle iequalities. The ext theorem shows that the proposed pealty ideed wors. Theorem 3.1. Cosider the pealized empirical loss miimizer f with the data-based pealty Ĉ defied above. The for every ad for all distributios of (X, Y ), EL( f) L if ( ) L L + EĈ ,

9 LOCALIZED RANDOM ENALTIES 9 I particular, EL( f) L if L L + 4 L + 2 log S (2) + 2 log() log S (2) + 2 log() The proof uses Lemma 2.1 ad the followig uiform deviatio boud due to Vapi ad Chervoeis 27. (The slighly improved form used here is proved by Athoy, ad Shawe-Taylor 1.) ropositio 3.2. Let S (X1 2) be the radom shatter coefficiet of A based o i.i.d. observatios X 1,, X 2 defied i (1.3). For all ε > 0 ad 1, (3.1) ad (3.2) sup L(f) 2 L(f) 2ε sup L(f) 2L(f) 2ε roof. Observe that for all ε > 0 ad 1, ad similarly, sup L(f) 2 L(f) 2ε sup L(f) 2L(f) 2ε 4ES (X 2 1 ) exp( ε/4) 4ES (X 2 1 ) exp( ε/4). sup sup L(f) L(f) L(f) L(f) L(f) L(f) The propositio follows by Athoy, ad Shawe-Taylor 1. ε ε. roof of Theorem 3.1. We start with the proof of the first iequality of Theorem 3.1. I view of Lemma 2.1, it suffices to show that Cosequetly, by (3.1), L( f ) L( f ) Ĉ 8/() 2. 2 L( f ) + 8 log S (2) = = + 16 log() L( f ) 2 L( f ) 8 log S (2) 4S (2) exp 4 2 2, L( f ) + 16 log() ) ( 8 log S (2) + 16 log() 8

10 10 GÁBOR LUGOSI AND MARTEN WEGKAM so that where Ĉ C 1 4/() 2, C = 2 L( f log S (2) ) Aother applicatio of iequality (3.1) yields L( f ) L( f ) Ĉ L( f ) L( f ) C + 4 () 2 = 4S (2) exp 8 () 2. Coclude via Lemma 2.1 that + 2 log(). ( log 4 4 S (2) + 2 log() EL( f) ( ) mi L + EĈ ) + 4 () 2 For the secod iequality, deduce that for all δ > 0, E L( f ) + δ E L( f ) + δ E if L(f) + δ L + δ. by Jese s iequality ad the defiitio of f. The boud of Theorem 3.1 has the right depedece o L as suggested by iequality (1.4) metioed i the itroductio. I particular, if L happes to equal to zero for some class F, the the upper boud has a improved rate of covergece. The disadvatage of the simple pealty defied above is that istead of the expected shatter coefficiets, a distributio-free (ad therefore suboptimal) upper boud appears for each class F. Recetly, Bouchero, Lugosi ad Massart 4 proved that log S (X1 ) cocetrates sharply aroud its mea. For example, we have the followig iequalities: ropositio 3.3. For all ε > 0, 1, Moreover, for each 1, E log S (X 1 ) > 2 log S (X 1 ) + 2ε e ε, log S (X 1 ) > 2E log S (X 1 ) + 2ε e ε. E log S (X 1 ) log ES (X 1 ) 1 l 2 E log S (X 1 ) 2E log S (X 1 ).

11 LOCALIZED RANDOM ENALTIES 11 This propositio implies that the expected radom log shatter coefficiets E log S (X 1 ) of F may be replaced by a costat times log S (X1 ) ad vice versa. Hece we may replace the distributio-free bouds log S (2) by empirical estimates log S (X1 ), at the price of slightly worse costats. The mai oracle iequalities i Sectio 4 are accompaied by asymptotic almost-sure versios of bouds for the expected value. Such bouds are easy to obtai as well, simply by ivoig Lemma 2.2 istead of Lemma 2.1. The details are omitted here. F, (4.1) 4. Rademacher pealties The mai results of the paper are preseted i this sectio. Assig, to each model class û = 16 4 log S (X 1 ) + 9 log() with S (X1 ) defied i (1.3), ad the class (4.2) F = f F : L(f) 16 L( f ) + 15û. Observe that the class F cotais oly those classifiers whose empirical loss is ot much larger tha that of the empirical miimizer. Note that the costat 16 has o special role, it has bee chose by coveiece. Ay costat larger tha oe would lead to similar results, at the price of modifyig other costats. The term û depeds o the shatter coefficiet of the whole class F but it is typically small compared to L( f ). The pealty is calculated i terms of the Rademacher average of this smaller class. More precisely, defie the complexity estimate by (4.3) Ĉ = ( 8 R F log() log(), 8 L( f ) + 7û ) 1. Agai, ot too much attetio should be paid to the values of the costats ivolved. We favored simple readable proofs over optimal costats. Note that, through S (X1 ), the pealty also depeds o the radom shatter coefficiet of the whole class F. However, the term ivolvig the shatter coefficiet of the etire class F, 1 log() log S (X 1 ), is typically much smaller (by a factor 1/2 ) tha the Rademacher average of the whole class F. (For istace, see iequality (4.8) ad ropositio 4.6 below.) We have the followig performace boud for the expected loss of the miimizer f of the pealized empirical loss L( f ) + Ĉ.

12 12 GÁBOR LUGOSI AND MARTEN WEGKAM Theorem 4.1. For every, EL( f) L if I additio, with probability greater tha 1 44/ 2, ad also ( ) L L + EĈ L( f) L if (L L + 2Ĉ) lim if L( f) L if (L L + 2Ĉ) = 1. The ext theorem is here to poit out that the boud above is ideed a sigificat improvemet over bouds of the type (1.2), ad that the depedece o the miimal loss L ad the radom shatter coefficiet has the form suggested by (1.4). For this purpose, we itroduce (4.4) ad the class We also set u = 16 8E log S (X1 ) + 17 log() F = f F : L(f) 64L + 63u. ε = 2 1 log(). Theorem 4.2. The followig oracle iequality holds EL( f) L mi 1 L L + 8E R F + 15ε + 16 L + u 2ε I particular, there exists uiversal costats κ 1 ad κ 2 such that EL( f) L if L L + κ 1 L (E log S (X 1 ) log()) E log S (X1 + κ ) log() 2 This oracle iequality has the desired form outlied i the itroductio ad improves upo the results of 2 ad 13. For example, i the special case whe L = 0 for 0, we obtai, for some umerical costats c 1 ad c 2, EL( f) E log S (X1 mi c ) log() 1 + c which is of a differet order of magitude from the pealties cosidered by 2 ad 13. Theorem 4.2 is oly stated for the expected loss but a iequality which holds with large probability may be obtaied just as i Theorem 4.1..

13 LOCALIZED RANDOM ENALTIES 13 roofs of Theorems 4.1 ad 4.2. First, recall the defiitios of û ad u i (4.1) ad (4.4), respectively, ad i additio, defie ad the evet u = 8 2 log ES (X 1 ) + 2 log() B def = u û u. Observe ropositio 3.3 above yields that, with probability at least 1 1/() 2, ad therefore u = 16 log ES (X 1 ) + log() 16 2E log S (X 1 ) + log() log S (X 1 ) + 4 log() + log() = û 16 42E log S (X 1 ) + 4 log() + 9 log() = 16 8E log S (X 1 ) + 17 log() = u (4.5) B c () 2. Fially, we itroduce the evet ad the class A = sup L(f) 2 L(f) u sup L(f) 2L(f) u F = f F : L(f) 4L + 3u. The followig itermediate result will be useful i the proofs of both theorems. Lemma 4.3. We have (4.6) ad o the set A B, the followig holds: (i) f F. A B 1 9 () 2, (ii) F F, ad i particular, R F R F. (iii) L 2 L( f ) + u. roof. To begi with, otice that ES (X 2 1 ) ES (X 1 )S (X 2 +1) = E 2 S (X 1 )

14 14 GÁBOR LUGOSI AND MARTEN WEGKAM by the defiitio of the shatter coefficiet ad by the idepedece of the X i. ropositio 3.2, ( A c 8ES (X1 2 ) exp u ) This boud ad (4.5) imply assertio (4.6). To prove claim (i), observe that o A, L( f ) 2 L( f ) + u (by defiitio of A ) Thus, by 2 L(f ) + u (by defiitio of f ) 2 (2L + u ) + u (by defiitio of A ) = 4L + 3u. For claim (ii), otice that for ay f F, L(f) 2L(f) + u (by defiitio of A ) 2 4L + 3u + u (by defiitio of F ) = 8L + 7u 8L( f ) + 7u (by defiitio of L ) 16 L( f ) + 15u (by defiitio of A ) 16 L( f ) + 15û (by defiitio of B ). Claim (ii) ow follows. Claim (iii) is immediate from the defiitio of A sice both f ad f belog to F. Next we li the Rademacher average R F to E sup L(f) L(f). By a classical symmetrizatio device (cf. Gié ad Zi 10 or Va der Vaart ad Weller 25) (4.7) E sup L(f) L(f) 2E R F.

15 LOCALIZED RANDOM ENALTIES 15 Also, R F is ow to cocetrate sharply aroud its mea. For example, we have, by results of Bouchero, Lugosi, ad Massart 4, 5, the followig bouds. ropositio 4.4. For all ɛ > 0, 1, RF 2E R F + ɛ e 6ɛ/5 ad R F 1 2 E R F ɛ e ɛ. roof. Defie Z def = R F, the it follows from Bouchero, Lugosi, ad Massart 4 that log E exp(λ(z EZ)) EZ(e λ 1 λ), which implies further that for 0 λ < 3 log E exp(λ(z EZ)) After a applicatio of Marov s iequality we fid λez 2(1 λ/3). Z EZ + 2EZx + x/3 e x. We obtai the desired upper-tail boud by isertig Z = R F i the precedig display ad ivoig the iequality 2 xy x + y. The boud for the lower tail follows from the iequality Z EZ 2xEZ e x (see 4) ad sice x y 2xy. Fially, we mae ey use of the followig cocetratio iequality for the supremum of a empirical process, recetly established by Talagrad 23, see also Ledoux 14, Massart 19, Rio 21. The best ow costats reported here have bee obtaied by Bousquet 6. ropositio 4.5. Set Σ F = sup L(f)(1 L(f)). For all ɛ > 0, 1 sup L(f) L(f) 4ɛ 2E sup L(f) L(f) + Σ F 2ɛ + e ɛ. 3 We are ow ready to prove Theorems 4.1 ad 4.2.

16 16 GÁBOR LUGOSI AND MARTEN WEGKAM roof of Theorem 4.1. Deduce, usig (i), (ii) ad (iii) of Lemma 4.3, the followig strig of iequalities: L( f ) L( f ) + Ĉ A B = L( f ) L( f ) + 8 R F + 10ε + 8 L( f ) + 7û 2ε A B f F : L(f) L(f) + 8 R F + 10ε + 8 L( f ) + 7û 2ε A B (by property (i) ) f F : L(f) L(f) + 8 R F + 10ε + 8 L( f ) + 7u 2ε A B (by property (ii) ad defiitio of B ) f F : L(f) L(f) + 8 R F + 10ε + 4L + 3u 2ε A B (by property (iii) ) sup L(f) L(f) 8 R F + 10ε + Σ F 2ε where the last iequality follows from Σ 2 F = sup Var(If(X) Y ) sup L(f) 4L + 3u. Ivoe iequality (4.7), iequality (4.6) ad ropositios 4.4 ad 4.5 above to coclude that L( f ) L( f ) + Ĉ sup L(f) L(f) 8 R F + 10ε + Σ F 2ε (sice (A B ) c 9/( 2 2 ) by (4.6) i Lemma 4.3) sup L(f) L(f) 4E R F + 2ε + Σ F 2ε (by ropositio 4.4) sup L(f) L(f) 2E sup L(f) L(f) + 4ε 3 + Σ F (by (4.7) ) 11 2 (by ropositio 4.5). 2 2ε This proves the first assertio. The almost sure statemet follows by ivoig Lemma 2.2 ad the precedig argumet (which also shows that Ĉ (L L)(f )

17 LOCALIZED RANDOM ENALTIES 17 although the last assertio could be show i a much easier way as it oly ivolves a sigle fuctio f ). Theorem 4.1 follows from Lemma 2.1 ad Lemma 2.2. (4.8) I the proof of Theorem 4.2 we eed the symmetrizatio device E R F 2E sup L(f) L(f) + sup L(f). (see, e.g., Medelso 20, p.18), ad also the followig result due to Massart 18. versio stated here is tae from Lugosi 16.) (The ropositio 4.6. Set Σ = sup L(f)(1 L(f)). The for all 1, E sup L(f) L(f) 8E log 2S (X1 2) 2Σ E log 2S (X1 2). roof. The statemet follows almost immediately from Theorem 1.10 i Lugosi 16 by otig that the worst-case shatter coefficiets may be replaced with impuity by the radom shatter coefficiets. roof of Theorem 4.2. Observe that o the evet A B, F F, where F is as defied i Theorem 4.2. Ideed, for ay f F, L(f) 2 L(f) + u (by defiitio of A ) 2 16 L( f ) + 15û + u (by defiitio of F ) 32 L( f ) + 31u (by defiitio of B ) 32 L(f ) + 31u (by defiitio of f ) 322L + u + 31u (by defiitio of A ) = 64L + 63u. Also, we otice that o the evet A, L( f ) L(f ) 2L + u.

18 18 GÁBOR LUGOSI AND MARTEN WEGKAM These observatios imply that Ĉ I A B 8 R F + 10ε L + 63u 2ε 8 R F + 10ε + 16 L + u 2ε. Cosequetly, it follows from Lemma 4.3 above that EĈ EĈI A + (A B ) c 8E R F + 10ε + 16 L + u 2ε + 9() 2 8E R F + 15ε + 16 L + u 2ε. This boud ad Theorem 4.1 yield the first iequality of Theorem 4.2. The secod iequality follows from the symmetrizatio (4.8) ad ropositio 4.6 above. Acowledgemets. We tha Olivier Bousquet for his ivaluable remars ad advices. We also appreciate the helpful remars by two referees. Refereces 1 M. Athoy ad J. Shawe-Taylor. A result of Vapi with applicatios. Discrete Applied Mathematics, 47: , Bartlett, S. Bouchero, ad G. Lugosi. Model selectio ad error estimatio. Machie Learig, 48:85 113, Bartlett, O. Bousquet, ad S. Medelso. Localized Rademacher complexities. I roceedigs of the 15th aual coferece o Computatioal Learig Theory, pages 44 48, S. Bouchero, G. Lugosi, ad. Massart. A sharp cocetratio iequality with applicatios. Radom Structures ad Algorithms, 16: , S. Bouchero, G. Lugosi, ad. Massart. Cocetratio iequalities usig the etropy method. Aals of robability, to appear, O. Bousquet. A Beett cocetratio iequality ad its applicatio to suprema of empirical processes. C. R. Acad. Sci. aris, 334: , O. Bousquet. New approaches to statistical learig theory. Aals of the Istitute of Statistical Mathematics, O. Bousquet, V. Koltchisii, ad D. acheo. Some local measures of complexity of covex hulls ad geeralizatio bouds. I roceedigs of the 15th Aual Coferece o Computatioal Learig Theory, pages Spriger, L. Devroye, L. Györfi, ad G. Lugosi. A robabilistic Theory of atter Recogitio. Spriger-Verlag, New Yor, E. Gié ad J. Zi. Some limit theorems for empirical processes. Aals of robability, 12: , V. Koltchisii. Rademacher pealties ad structural ris miimizatio. IEEE Trasactios o Iformatio Theory, 47: , V. Koltchisii ad D. acheo. Rademacher processes ad boudig the ris of fuctio learig. I E. Gié, D.M. Maso, ad J.A. Weller, editors, High Dimesioal robability II, pages , V. Koltchisii ad D. acheo. Empirical margi distributios ad boudig the geeralizatio error of combied classifiers. Aals of Statistics, 30, M. Ledoux. O Talagrad s deviatio iequalities for product measures. ESAIM: robability ad Statistics, 1:63 87, F. Lozao. Model selectio usig rademacher pealizatio. I roceedigs of the Secod ICSC Symposia o Neural Computatio (NC2000). ICSC Adademic ress, 2000.

19 LOCALIZED RANDOM ENALTIES G. Lugosi. atter classificatio ad learig theory. I L. Györfi, editor, riciples of Noparametric Learig, pages Spriger, Viea, G. Lugosi ad A. Nobel. Adaptive model selectio usig empirical complexities. Aals of Statistics, 27: , Massart. About the costats i Talagrad s cocetratio iequalities for empirical processes. Aals of robability, 28: , Massart. Some applicatios of cocetratio iequalities to statistics. Aales de la Faculté des Sciecies de Toulouse, IX: , S. Medelso. A few otes o statistical learig theory. I S. Medelso ad A. Smola, editors, Advaced Lectures i Machie Learig, LNCS 2600, pages Spriger, E. Rio. Iégalités de cocetratio pour les processus empiriques de classes de parties. robability Theory ad Related Fields, 119: , J. Shawe-Taylor,.L. Bartlett, R.C. Williamso, ad M. Athoy. Structural ris miimizatio over data-depedet hierarchies. IEEE Trasactios o Iformatio Theory, 44: , M. Talagrad. A ew loo at idepedece. Aals of robability, 24:1 34, (Special Ivited aper). 24 A. B. Tsybaov. Optimal aggregatio of classifiers i statistical learig. C. R. Acad. Sci. aris, to appear, A.W. va der Waart ad J.A. Weller. Wea covergece ad empirical processes. Spriger-Verlag, New Yor, V.N. Vapi. Statistical Learig Theory. Joh Wiley, New Yor, V.N. Vapi ad A.Ya. Chervoeis. Theory of atter Recogitio. Naua, Moscow, (i Russia); Germa traslatio: Theorie der Zeicheereug, Aademie Verlag, Berli, Departmet of Ecoomics, ompeu Fabra Uiversity, Ramo Trias Fargas 25-27, Barceloa, Spai address: lugosi@upf.es Departmet of Statistics, Yale Uiversity,.O. Box , New Have, CT , Uited States of America address: marte.wegamp@yale.edu

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the