A survey on penalized empirical risk minimization Sara A. van de Geer

Size: px

Start display at page:

Download "A survey on penalized empirical risk minimization Sara A. van de Geer"

Charleen Cole
5 years ago
Views:

1 A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the estimatio error. Mai poit is however that the estimatio error depeds o ukow parameters. We discuss a olocal estimate of the estimatio error. Moreover, we show that the l 1 pealty allows oe to avoid explicitly estimatig the estimatio error. The framework is as follows. Let the data X 1,..., X be i.i.d. copies of a radom variable X X with distributio P. The empirical distributio is P = i=1 δ X i /. We are iterested i the parameter f 0 Λ, (Λ, d) beig a metric space. This parameter f 0 is defied as the miimizer of the theoretical loss R(f) := P γ f, f Λ, where γ f : X R is a give loss fuctio. To estimate f 0, we replace R(f) by its empirical couterpart R (f) := P γ f. Next, we choose a model class F Λ, ad defie the pealized empirical risk miimizer f F R (f). Geerally, it is ecessary to choose a model class F which is strictly smaller tha Λ. This is because Λ may be a very rich set, ad empirical risk miimizatio over Λ may lead to overfittig the data. Give the model class F, the approximatio error is defied as where B 2 = R(f ) R(f 0 ), f = arg mi f F R(f) is the miimizer over the class F. The estimatio error is The excess risk of ˆf is V = R( ˆf ) R(f ). R( ˆf ) R(f 0 ). Thus we have a bias-variace type decompositio for the excess risk: R( ˆf ) R(f 0 ) = B 2 + V. Note that both the approximatio error ad the estimatio error deped o F. We express this by writig B 2 = B(F) 2 ad V = V (F). Cosider ow a collectio of cadidate models F}. The optimal model F oracle is the the oe which optimally trades off approximatio error ad estimatio error, i.e., F oracle = arg mi F F} B 2 (F) + V (F) }. Our aim is to fid a estimator that mimics this trade off. The followig elemetary lemma tells us that we ca boud the estimatio error by the empirical process ν, defied by ν (f) = (R (f) R(f)). 1

2 Elemetary lemma 1. Let f F R (f) ad f = arg mi f F R(f). The we have the followig boud for the estimatio error V := R( ˆf ) R(f ): V [ν ( ˆf ) ν (f )]/. The ext lemma idicates that i pealized empirical risk miimizatio, oe should take the pealty, pe(f), ˆ equal to a good boud for the estimatio error. Elemetary lemma 2. Let ˆf (F) = arg mi f F R (f) ad ˆF = arg mi R ( ˆf } (F)) + pe(f) ˆ. F} Fix some F F} ad some f F, ad defie the approximatio error B 2 (F ) = R(f ) R(f 0 ) ad estimatio error boud (1) V (F) = [ν ( ˆf (F)) ν (f )]/. Suppose that with probability at least 1 ɛ, we have The with probability at least 1 ɛ, pe(f) ˆ V (F), F. R( ˆf (ˆF )) R(f 0 ) B(F 2 ) + pe(f ˆ ). Cocetratio iequalities provide expoetial probability iequalities for the cocetratio of the supremum of the empirical process aroud its mea (see e.g. [9]). Oe may ow derive a olocal boud for V (F) defied i (1). Note first that for a o-radom choice of f, EV (F) = Eν ( ˆf )]/ E R R F, where we use the otatio F for the sup-orm of a class of fuctios o F. Moreover, E R R F 2E R σ F, with R(f) σ = i=1 σ iγ f (X i )/ beig the symmetrized versio ivolvig the Rademacher sequece σ i } i=1. The latter is defied as a sequece of i.i.d. radom variables, idepedet of X i } i=1, with P(σ i = 1) = P(σ i = 1) = 1/2 (i = 1,..., ). Fially, E R σ F = EE X1,...,X R σ F, where E X1,...,X deotes coditioal expectatio give X 1,..., X. Cocetratio iequalities (see [5]) ow tell us (uder coditios) that, with probability 1 ɛ, up to a 1/2 term ivolvig ɛ, 2E X1,...,X R σ F is a boud for V (F). If we use this boud, it is rather difficult to get rid of the 1/2 term ad establish rates faster tha /12. The reaso is that our estimate of the estimatio error is a olocal oe. 2

3 We will ow illustrate that geerally, the estimatio error is smaller tha O( 1/2 ). More details are e.g. i [3], [4], [5] ad [8]. We itroduce the followig two coditios, which both ivolve the same parameter 0 < β 1. Margi coditio. Let G = g(x)dx, with g a strictly icreasig fuctio 0 o the positive halflie, havig g(0) = 0. Suppose R(f) R(f 0 ) G(d β (f, f 0 )), f Λ. Empirical process coditio. Let f = arg mi f F R(f). Suppose that for some positive costats d ad C, we have with probability at least 1 ɛ ν (f) ν (f ) sup f F d β (f, f ) + d β C. Lemma 3. Assume the margi coditio ad the empirical process coditio. Let ˆf = arg mi f F R (f), ad B 2 = R(f ) R(f 0 ). Let 0 < δ < 1. With probability at least 1 ɛ, we have where ad H = 0 g 1 (x)dx. R( ˆf ) R(f 0 ) 1 + δ 1 δ B2 + V + 1/2 d β C }, V = 2δH( C δ ), As a typical example, suppose we have β = 1 ad that g is the idetity. The G(x) = H(x) = x 2 /2, ad we fid V = C2 δ. the costat C 2 is typically somethig like dimesio or a more geeral measure of complexity of F. If it does ot grow too fast i, ad if i additio d decreases fast i, we ideed arrive at estimatio error of order smaller tha 1/2. It will be clear however that i geeral it is ot obvious to verify the coditios, as they deped o the uderlyig distributio. I particular, it is ofte ot clear what the fuctio g is the margi coditio. Thus, we do ot kow how large V is. However, as is show i literature (see for example [1], [2], [5], [6], [7], [11]), there are ways to obtai a good local estimate. We ow tur to l 1 pealizatio, to avoid the problem of ukow margi behavior. Let γ f = γ f, ad suppose γ is covex, ad Lipschitz with Lipschitz costat 1. Suppose Λ L 2 (ν), with ν some measure o X. Let F m be a covex subset of f α = m k=1 α kψ k }, where ψ k } m k=1 L 2(ν) are give base fuctios. We assume that m D for some D 1. Also, we assume max ψ k k=1,...,m 3 log.

4 We cosider the estimator Here, we take with We let f α F m R (f α ) + ˆλ ˆλ 864 ˆΨ log D, ˆΨ 2 = Ψ 2 0 = m α k }. k=1 max P ψk k=1,..., max P k=1,...,m ψ2 k 4 2, ad let λ be the theoretical couterpart of the smoothig parameter ˆλ, i.e. λ = ˆλ Ψ 0 ˆΨ. Now, our further coditios deped o the ukow uderlyig distributio, so we call them o-verifiable coditios. Note however that our estimatio procedure does ot require them to be verifiable. No-verifiable coditios. The margi coditio holds. It holds that f f 2,ν d β (f, f) for all f, f F m. Here β is from the margi coditio, ad 2,ν deotes the L 2 (ν)-orm. It holds that f f K d(f, f) 2 for all f, f F. Here K is a sequece satisfyig a growth coditio (see Theorem 4). For some diagoal matrix W = diag(w 1,..., w m ) of positive weights, the matrix W Σ ν W has smallest eigevalue equal to oe. Here Σ ν = ψψ T dν with ψ = (ψ 1,..., ψ m ) T. We ow defie the estimatio error boud as with H = 0 g 1 (x)dx, ad with Let V (α) = 2δH(18λ C(α)/δ), C 2 (α) = D k:α k 0 w 2 k. ɛ = 1 + δ } log 1 δ mi R(f α ) R(f 0 ) + V (α) + 2λ. f α F The followig theorem is a geeralizatio of the result i [10]. Theorem 4. Cosider the estimator f α F m R (f α ) + ˆλ 4 m α k }. k=1

5 Assume the o-verifiable coditios with growth rate coditio K β G 1 (ɛ ) 1. The there is a uiversal costat c, such that with probability at least 1 c/ 2, we have R( ˆf ) R(f 0 ) ɛ. Refereces [1] Audibert, J.-Y., Classificatio uder polyomial etropy ad margi assumptios ad radomized estimators, Preprit, Laboratoire de Probabilités et Modèles Aléatoires (2004). [2] Bartlett, P.L., Bousquet, O. ad Medelso, S., Local Rademacher complexities, A. Statist. 33 (2005), [3] Blachard, G., Lugosi, G. ad Vayatis, N., O the rate of covergece of regularized boostig classifiers J. Machie L. Research 4 (2003), [4] Blachard, G., Bousquet, O. ad Massart, P., Statistical performace of support vector machies, Mauscript (2004). [5] Bousquet, O. Bouchero, S. ad Lugosi, G., Theory of classificatio: a survey of recet advaces, (2005). To appear i ESAIM: Probability ad Statistics. [6] Koltchiskii, V., Local Rademacher complexities ad oracle iequalities i risk miimizatio, (2003). To appear i A. Statist. [7] Lugosi, G. ad Wegkamp M., Complexity regularizatio via localized radom pealties, A. Statist. 32 (2004), [8] Massart, P., Some applicatios of cocetratio iequalities to statistics, Aales de la Faculté de Toulouse 9 (2000), [9] Massart, P., About the costats i Talagrad s cocetratio iequalities for empirical processes, A. Probab. 28 (2000), [10] Tariga, B. ad va de Geer, S.A., Classifiers of support vector machie type, with l 1 complexity regularizatio, submitted (2005). [11] Tsybakov, A.B., Optimal aggregatio of classifiers i statistical learig, A. Statist. 32 (2004),

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce