Data-driven calibration of linear estimators with minimal penalties

Size: px

Start display at page:

Download "Data-driven calibration of linear estimators with minimal penalties"

Rosaline Miles
6 years ago
Views:

1 Data-drive calibratio of liear estimators with miimal pealties Sylvai Arlot CNRS ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie, F Paris, Frace sylvai.arlot@es.fr Fracis Bach INRIA ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie, F Paris, Frace fracis.bach@es.fr Abstract This paper tackles the problem of selectig amog several liear estimators i o-parametric regressio; this icludes model selectio for liear regressio, the choice of a regularizatio parameter i kerel ridge regressio or splie smoothig, ad the choice of a kerel i multiple kerel learig. We propose a ew algorithm which first estimates cosistetly the variace of the oise, based upo the cocept of miimal pealty which was previously itroduced i the cotext of model selectio. The, pluggig our variace estimate i Mallows C L pealty is proved to lead to a algorithm satisfyig a oracle iequality. Simulatio experimets with kerel ridge regressio ad multiple kerel learig show that the proposed algorithm ofte improves sigificatly existig calibratio procedures such as 10-fold cross-validatio or geeralized cross-validatio. 1 Itroductio Kerel-based methods are ow well-established tools for supervised learig, allowig to perform various tasks, such as regressio or biary classificatio, with liear ad o-liear predictors 1, 2. A cetral issue commo to all regularizatio frameworks is the choice of the regularizatio parameter: while most practitioers use cross-validatio procedures to select such a parameter, data-drive procedures ot based o cross-validatio are rarely used. The choice of the kerel, a seemigly urelated issue, is also importat for good predictive performace: several techiques exist, either based o cross-validatio, Gaussia processes or multiple kerel learig 3, 4, 5. I this paper, we cosider least-squares regressio ad cast these two problems as the problem of selectig amog several liear estimators, where the goal is to choose a estimator with a quadratic risk which is as small as possible. This problem icludes for istace model selectio for liear regressio, the choice of a regularizatio parameter i kerel ridge regressio or splie smoothig, ad the choice of a kerel i multiple kerel learig (see Sectio 2). The mai cotributio of the paper is to exted the otio of miimal pealty 6, 7 to all discrete classes of liear operators, ad to use it for defiig a fully data-drive selectio algorithm satisfyig a o-asymptotic oracle iequality. Our ew theoretical results preseted i Sectio 4 exted similar results which were limited to uregularized least-squares regressio (i.e., projectio operators). Fially, i Sectio 5, we show that our algorithm improves the performaces of classical selectio procedures, such as GCV 8 ad 10-fold cross-validatio, for kerel ridge regressio or multiple kerel learig, for moderate values of the sample size. arlot/ fbach/ 1

2 2 Liear estimators I this sectio, we defie the problem we aim to solve ad give several examples of liear estimators. 2.1 Framework ad otatio Let us assume that oe observes Y i = f(x i ) + ε i R for i = 1..., where ε 1,...,ε are i.i.d. cetered radom variables with Eε 2 i = σ2 ukow, f is a ukow measurable fuctio X R ad x 1,...,x X are determiistic desig poits. No assumptio is made o the set X. The goal is to recostruct the sigal F = (f(x i )) 1 i R, with some estimator F R, depedig oly o (x 1,Y 1 ),...,(x,y ), ad havig a small quadratic risk 1 F F 2 2, where t R, we deote by t 2 the l 2 -orm of t, defied as t 2 2 := i=1 t2 i. I this paper, we focus o liear estimators F that ca be writte as a liear fuctio of Y = (Y 1,...,Y ) R, that is, F = AY, for some (determiistic) matrix A. Here ad i the rest of the paper, vectors such as Y or F are assumed to be colum-vectors. We preset i Sectio 2.2 several importat families of estimators of this form. The matrix A may deped o x 1,...,x (which are kow ad determiistic), but ot o Y, ad may be parameterized by certai quatities usually regularizatio parameter or kerel combiatio weights. 2.2 Examples of liear estimators I this paper, our theoretical results apply to matrices A which are symmetric positive semi-defiite, such as the oes defied below. Ordiary least-squares regressio / model selectio. If we cosider liear predictors from a desig matrix X R p, the F = AY with A = X(X X) 1 X, which is a projectio matrix (i.e., A A = A); F = AY is ofte called a projectio estimator. I the variable selectio settig, oe wats to select a subset J {1,...,p}, ad matrices A are parameterized by J. Kerel ridge regressio / splie smoothig. We assume that a positive defiite kerel k : X X R is give, ad we are lookig for a fuctio f : X R i the associated reproducig kerel Hilbert space (RKHS) F, with orm F. If K deotes the kerel matrix, defied by K ab = k(x a,x b ), the the ridge regressio estimator a.k.a. splie smoothig estimator for splie kerels 9 is obtaied by miimizig with respect to f F 2: 1 (Y i f(x i )) 2 + λ f 2 F. i=1 The uique solutio is equal to f = i=1 α ik(,x i ), where α = (K +λi) 1 Y. This leads to the smoothig matrix A λ = K(K + λi ) 1, parameterized by the regularizatio parameter λ R +. Multiple kerel learig / Group Lasso / Lasso. We ow assume that we have p differet kerels k j, feature spaces F j ad feature maps Φ j : X F j, j = 1,...,p. The group Lasso 10 ad multiple kerel learig 11, 5 frameworks cosider the followig objective fuctio ( J(f 1,...,f p )= 1 yi p f j,φ j (x i ) ) 2 p p +2λ f j Fj = L(f 1,...,f p )+2λ f j Fj. i=1 Note that whe Φ j (x) is simply the j-th coordiate of x R p, we get back the pealizatio by the l 1 -orm ad thus the regular Lasso 12. Usig a 1/2 1 = mi b 0 2 { a b + b}, we obtai a variatioal formulatio of the sum of orms 2 p f { } p fj j = mi 2 η R p + η j + η j. Thus, miimizig J(f 1,...,f p ) with respect to (f 1,...,f p ) is equivalet to miimizig with respect to η R p + (see 5 for more details): mi L(f 1,...,f p ) + λ f 1,...,f p p f j 2 + λ η j p η j = 1 y ( p η ) 1y p jk j + λi + λ η j, 2

3 where I is the idetity matrix. Moreover, give η, this leads to a smoothig matrix of the form A η,λ = ( p η jk j )( p η jk j + λi ) 1, (1) parameterized by the regularizatio parameter λ R + ad the kerel combiatios i R p + ote that it depeds oly o λ 1 η, which ca be grouped i a sigle parameter i R p +. Thus, the Lasso/group lasso ca be see as particular (covex) ways of optimizig over η. I this paper, we propose a o-covex alterative with better statistical properties (oracle iequality i Theorem 1). Note that i our settig, fidig the solutio of the problem is hard i geeral sice the optimizatio is ot covex. However, while the model selectio problem is by ature combiatorial, our optimizatio problems for multiple kerels are all differetiable ad are thus ameable to gradiet descet procedures which oly fid local optima. No symmetric liear estimators. Other liear estimators are commoly used, such as earesteighbor regressio or the Nadaraya-Watso estimator 13; those however lead to o symmetric matrices A, ad are ot etirely covered by our theoretical results. 3 Liear estimator selectio I this sectio, we first describe the statistical framework of liear estimator selectio ad itroduce the otio of miimal pealty. 3.1 Ubiased risk estimatio heuristics Usually, several estimators of the form F = AY ca be used. The problem that we cosider i this paper is the to select oe of them, that is, to choose a matrix A. Let us assume that a family of matrices (A λ ) λ Λ is give (examples are show i Sectio 2.2), hece a family of estimators ( F λ ) λ Λ ca be used, with F λ := A λ Y. The goal is to choose from data some λ Λ, so that the quadratic risk of F bλ is as small as possible. The best choice would be the oracle: λ arg mi λ Λ { 1 F } λ F 2 2 which caot be used sice it depeds o the ukow sigal F. Therefore, the goal is to defie a data-drive λ satisfyig a oracle iequality 1 F { bλ F 2 2 C if 1 F } λ F R, (2) λ Λ with large probability, where the leadig costat C should be close to 1 (at least for large ) ad the remaider term R should be egligible compared to the risk of the oracle. May classical selectio methods are built upo the ubiased risk estimatio heuristics: If λ miimizes a criterio crit(λ) such that λ Λ, Ecrit(λ) E 1 F λ F 2 2, the λ satisfies a oracle iequality such as i Eq. (2) with large probability. For istace, crossvalidatio 14, 15 ad geeralized cross-validatio (GCV) 8 are built upo this heuristics. Oe way of implemetig this heuristics is pealizatio, which cosists i miimizig the sum of the empirical risk ad a pealty term, i.e., usig a criterio of the form: crit(λ) = 1 F λ Y pe(λ). The ubiased risk estimatio heuristics, also called Mallows heuristics, the leads to the ideal (determiistic) pealty pe id (λ) := E 1 F λ F 2 2 E 1 F λ Y 2 2., 3

4 Whe F λ = A λ Y, we have: F λ F 2 2 = (A λ I )F A λε A λε, (A λ I )F, (3) F λ Y 2 2 = F λ F ε ε, A λε + 2 ε, (I A λ )F, (4) where ε = Y F R ad t,u R, t, u = i=1 t iu i. Sice ε is cetered with covariace matrix σ 2 I, Eq. (3) ad Eq. (4) imply that pe id (λ) = 2σ2 tr(a λ ), (5) up to the term E 1 ε 2 2= σ 2, which ca be dropped off sice it does ot vary with λ. Note that df(λ) = tr(a λ ) is called the effective dimesioality or degrees of freedom 16, so that the ideal pealty i Eq. (5) is proportioal to the dimesioality associated with the matrix A λ for projectio matrices, we get back the dimesio of the subspace, which is classical i model selectio. The expressio of the ideal pealty i Eq. (5) led to several selectio procedures, i particular Mallows C L (called C p i the case of projectio estimators) 17, where σ 2 is replaced by some estimator σ 2. The estimator of σ 2 usually used with C L is based upo the value of the empirical risk at some λ 0 with df(λ 0 ) large; it has the drawback of overestimatig the risk, i a way which depeds o λ 0 ad F 18. GCV, which implicitly estimates σ 2, has the drawback of overfittig if the family (A λ ) λ Λ cotais a matrix too close to I 19; GCV also overestimates the risk eve more tha C L for most A λ (see (7.9) ad Table 4 i 18). I this paper, we defie a estimator of σ 2 directly related to the selectio task which does ot have similar drawbacks. Our estimator relies o the cocept of miimal pealty, itroduced by Birgé ad Massart 6 ad further studied i Miimal ad optimal pealties We deduce from Eq. (3) the bias-variace decompositio of the risk: E 1 F λ F 2 2 = 1 (A λ I )F 22 + tr(a λ A λ)σ 2 = bias + variace, (6) ad from Eq. (4) the expectatio of the empirical risk: E 1 F ( λ Y 2 2 ε 2 2 = 1 (A λ I )F 22 2tr(Aλ ) tr(a λ A λ) ) σ 2. (7) Note that the variace term i Eq. (6) is ot proportioal to the effective dimesioality df(λ) = tr(a λ ) but to tr(a λ A λ). Although several papers argue these terms are of the same order (for istace, they are equal whe A λ is a projectio matrix), this may ot hold i geeral. If A λ is symmetric with a spectrum Sp(A λ ) 0,1, as i all the examples of Sectio 2.2, we oly have 0 tr(a λ A λ ) tr(a λ ) 2tr(A λ ) tr(a λ A λ ) 2tr(A λ ). (8) I order to give a first ituitive iterpretatio of Eq. (6) ad Eq. (7), let us cosider the kerel ridge regressio example ad assume that the risk ad the empirical risk behave as their expectatios i Eq. (6) ad Eq. (7); see also Fig. 1. Completely rigorous argumets based upo cocetratio iequalities are developed i 20 ad summarized i Sectio 4, leadig to the same coclusio as the preset iformal reasoig. First, as proved i 20, the bias 1 (A λ I )F 2 2 is a decreasig fuctio of the dimesioality df(λ) = tr(a λ ), ad the variace tr(a λ A λ)σ 2 1 is a icreasig fuctio of df(λ), as well as 2tr(A λ ) tr(a λ A λ). Therefore, Eq. (6) shows that the optimal λ realizes the best trade-off betwee bias (which decreases with df(λ)) ad variace (which icreases with df(λ)), which is a classical fact i model selectio. Secod, the expectatio of the empirical risk i Eq. (7) ca be decomposed ito the bias ad a egative variace term which is the opposite of pe mi (λ) := 1 ( 2tr(A λ ) tr(a λ A λ ) ) σ 2. (9) 4

5 geeralizatio errors σ 2 tra σ 2 tra 2 2σ 2 tra bias variace σ 2 tr A 2 geeralizatio error bias + σ 2 tr A 2 empirical error σ 2 bias+σ 2 tra 2 2σ 2 tr A degrees of freedom ( tr A ) Figure 1: Bias-variace decompositio of the geeralizatio error, ad miimal/optimal pealties. As suggested by the otatio pe mi, we will show it is a miimal pealty i the followig sese. If { C 0, λmi (C) arg mi 1 F } λ Y C pe mi (λ), λ Λ the, up to cocetratio iequalities that are detailed i Sectio 4.2, λ mi (C) behaves like a miimizer of g C (λ) = E 1 F λ Y C pe mi (λ) 1 σ 2 = 1 (A λ I )F 2 2 +(C 1)pe mi(λ). Therefore, two mai cases ca be distiguished: if C < 1, the g C (λ) decreases with df(λ) so that df( λ mi (C)) is huge: λ mi (C) overfits. if C > 1, the g C (λ) icreases with df(λ) whe df(λ) is large eough, so that df( λ mi (C)) is much smaller tha whe C < 1. As a coclusio, pe mi (λ) is the miimal amout of pealizatio eeded so that a miimizer λ of a pealized criterio is ot clearly overfittig. Followig a idea first proposed i 6 ad further aalyzed or used i several other papers such as 21, 7, 22, we ow propose to use that pe mi (λ) is a miimal pealty for estimatig σ 2 ad plug this estimator ito Eq. (5). This leads to the algorithm described i Sectio 4.1. Note that the miimal pealty give by Eq. (9) is ew; it geeralizes previous results 6, 7 where pe mi (A λ ) = 1 tr(a λ )σ 2 because all A λ were assumed to be projectio matrices, i.e., A λ A λ = A λ. Furthermore, our results geeralize the slope heuristics pe id 2pe mi (oly valid for projectio estimators 6, 7) to geeral liear estimators for which pe id /pe mi (1,2. 4 Mai results I this sectio, we first describe our algorithm ad the preset our theoretical results. 4.1 Algorithm The followig algorithm first computes a estimator of Ĉ of σ2 usig the miimal pealty i Eq. (9), the cosiders the ideal pealty i Eq. (5) for selectig λ. Iput: Λ a fiite set with Card(Λ) K α for some K,α 0, ad matrices A λ. C > 0, compute λ 0 (C) arg mi λ Λ { F λ Y C ( 2tr(A λ ) tr(a λ A λ) ) }. Fid Ĉ such that df( λ 0 (Ĉ)) 3/4,/10. Select λ arg mi λ Λ { F λ Y Ĉ tr(a λ)}. I the steps 1 ad 2 of the above algorithm, i practice, a grid i log-scale is used, ad our theoretical results from the ext sectio suggest to use a step-size of order 1/4. Note that it may ot be 5

6 possible i all cases to fid a C such that df( λ 0 (C)) 3/4,/10 ; therefore, our coditio i step 2, could be relaxed to fidig a Ĉ such that for all C > Ĉ + δ, df( λ 0 (C)) < 3/4 ad for all C < Ĉ δ, df( λ 0 (C)) > /10, with δ = 1/4+ξ, where ξ > 0 is a small costat. Alteratively, usig the same grid i log-scale, we ca select Ĉ with maximal jump betwee successive values of df( λ 0 (C)) ote that our theoretical result the does ot etirely hold, as we show the presece of a jump aroud σ 2, but do ot show the absece of similar jumps elsewhere. 4.2 Oracle iequality Theorem 1 Let Ĉ ad λ be defied as i the algorithm of Sectio 4.1, with Card(Λ) K α for some K,α 0. Assume that λ Λ, A λ is symmetric with Sp(A λ ) 0,1, that ε i are i.i.d. Gaussia with variace σ 2 > 0, ad that λ 1,λ 2 Λ with df(λ 1 ) 2, df(λ 2), ad i {1,2}, 1 (A λi I )F 2 2 σ2 l(). (A 1 2 ) The, a umerical costat C a ad a evet of probability at least 1 8K 2 exist o which, for every C a, ( ( ) l() 1 91(α + 2) )σ 2 Ĉ (α + 2) l() σ 2. (10) 1/4 Furthermore, if κ 1, λ Λ, 1 tr(a λ )σ 2 κe 1 F λ F 2 2, (A 3 ) the, a costat C b depedig oly o κ exists such that for every C b, o the same evet, ( 1 F bλ F κ ) { if 1 l() F } λ F 2 36(κ + α + 2)l()σ (11) λ Λ Theorem 1 is proved i 20. The proof maily follows from the iformal argumets developed i Sectio 3.2, completed with the followig two cocetratio iequalities: If ξ R is a stadard Gaussia radom vector, α R ad M is a real-valued matrix, the for every x 0, ( α, ξ ) 2x α 2 ( P θ > 0, P 1 2e x (12) ) Mξ 2 2 tr(m M) θ tr(m M) + 2(1 + θ 1 ) M 2 x 1 2e x, (13) where M is the operator orm of M. A proof of Eq. (12) ad (13) ca be foud i Discussio of the assumptios of Theorem 1 Gaussia oise. Whe ε is sub-gaussia, Eq. (12) ad Eq. (13) ca be proved for ξ = σ 1 ε at the price of additioal techicalities, which implies that Theorem 1 is still valid. Symmetry. The assumptio that matrices A λ must be symmetric ca certaily be relaxed, sice it is oly used for derivig from Eq. (13) a cocetratio iequality for A λ ξ, ξ. Note that Sp(A λ ) 0,1 barely is a assumptio sice it meas that A λ actually shriks Y. Assumptios (A 1 2 ). (A 1 2 ) holds if max λ Λ {df(λ)} /2 ad the bias is smaller tha cdf(λ) d for some c,d > 0, a quite classical assumptio i the cotext of model selectio. Besides, (A 1 2 ) is much less restrictive ad ca eve be relaxed, see 20. Assumptio (A 3 ). The upper boud (A 3 ) o tr(a λ ) is certaily the strogest assumptio of Theorem 1, but it is oly eeded for Eq. (11). Accordig to Eq. (6), (A 3 ) holds with κ = 1 whe A λ is a projectio matrix sice tr(a λ A λ) = tr(a λ ). I the kerel ridge regressio framework, (A 3 ) holds as soo as the eigevalues of the kerel matrix K decrease like j α see 20. I geeral, (A 3 ) meas that F λ should ot have a risk smaller tha the parametric covergece rate associated with a model of dimesio df(λ) = tr(a λ ). Whe (A 3 ) does ot hold, selectig amog estimators whose risks are below the parametric rate is a rather difficult problem ad it may ot be possible to attai the risk of the oracle i geeral. 6

7 selected degrees of freedom log(c/σ 2 ) miimal pealty optimal pealty / 2 selected degrees of freedom log(c/σ 2 ) optimal/2 miimal (discrete) miimal (cotiuous) Figure 2: Selected degrees of freedom vs. pealty stregth log(c/σ 2 ) : ote that whe pealizig by the miimal pealty, there is a strog jump at C = σ 2, while whe usig half the optimal pealty, this is ot the case. Left: sigle kerel case, Right: multiple kerel case. Nevertheless, a oracle iequality ca still be proved without (A 3 ), at the price of elargig Ĉ slightly ad addig a small fractio of σ 2 1 tr(a λ ) i the right-had side of Eq. (11), see 20. Elargig Ĉ is ecessary i geeral: If tr(a λ A λ) tr(a λ ) for most λ Λ, the miimal pealty is very close to 2σ 2 1 tr(a λ ), so that accordig to Eq. (10), overfittig is likely as soo as Ĉ uderestimates σ 2, eve by a very small amout. 4.4 Mai cosequeces of Theorem 1 ad compariso with previous results Cosistet estimatio of σ 2. The first part of Theorem 1 shows that Ĉ is a cosistet estimator of σ 2 i a geeral framework ad uder mild assumptios. Compared to classical estimators of σ 2, such as the oe usually used with Mallows C L, Ĉ does ot deped o the choice of some model assumed to have almost o bias, which ca lead to overestimatig σ 2 by a ukow amout 18. Oracle iequality. Our algorithm satisfies a oracle iequality with high probability, as show by Eq. (11): The risk of the selected estimator F bλ is close to the risk of the oracle, up to a remaider term which is egligible whe the dimesioality df(λ ) grows with faster tha l(), a typical situatio whe the bias is ever equal to zero, for istace i kerel ridge regressio. Several oracle iequalities have bee proved i the statistical literature for Mallows C L with a cosistet estimator of σ 2, for istace i 23. Nevertheless, except for the model selectio problem (see 6 ad refereces therei), all previous results were asymptotic, meaig that is implicitly assumed to be larged compared to each parameter of the problem. This assumptio ca be problematic for several learig problems, for istace i multiple kerel learig whe the umber p of kerels may grow with. O the cotrary, Eq. (11) is o-asymptotic, meaig that it holds for every fixed as soo as the assumptios explicitly made i Theorem 1 are satisfied. Compariso with other procedures. Accordig to Theorem 1 ad previous theoretical results 23, 19, C L, GCV, cross-validatio ad our algorithm satisfy similar oracle iequalities i various frameworks. This should ot lead to the coclusio that these procedures are completely equivalet. Ideed, secod-order terms ca be large for a give, while they are hidde i asymptotic results ad ot tightly estimated by o-asymptotic results. As showed by the simulatios i Sectio 5, our algorithm yields statistical performaces as good as existig methods, ad ofte quite better. Furthermore, our algorithm ever overfits too much because df( λ) is by costructio smaller tha the effective dimesioality of λ 0 (Ĉ) at which the jump occurs. This is a quite iterestig property compared for istace to GCV, which is likely to overfit if it is ot corrected because GCV miimizes a criterio proportioal to the empirical risk. 5 Simulatios Throughout this sectio, we cosider expoetial kerels o R d, k(x,y)= d i=1 e xi yi, with the x s sampled i.i.d. from a stadard multivariate Gaussia. The fuctios f are the selected radomly as m i=1 α ik(,z i ), where both α ad z are i.i.d. stadard Gaussia (i.e., f belogs to the RKHS). 7

8 fold CV MKL+CV 2.5 GCV 3 GCV 2 mi. pealty kerel sum 2.5 mi. pealty log() log() Figure 3: Compariso of various smoothig parameter selectio (miikerel, GCV, 10-fold cross validatio) for various values of umbers of observatios, averaged over 20 replicatios. Left: sigle kerel, right: multiple kerels. mea( error / error oracle ) mea( error / error Mallows ) Jump. I Figure 2 (left), we cosider data x i R 6, = 1000, ad study the size of the jump i Figure 2 for kerel ridge regressio. With half the optimal pealty (which is used i traditioal variable selectio for liear regressio), we do ot get ay jump, while with the miimal pealty we always do. I Figure 2 (right), we plot the same curves for the multiple kerel learig problem with two kerels o two differet 4-dimesioal variables, with similar results. I additio, we show two ways of optimizig over λ Λ = R 2 +, by discrete optimizatio with differet kerel matrices a situatio covered by Theorem 1 or with cotiuous optimizatio with respect to η i Eq. (1), by gradiet descet a situatio ot covered by Theorem 1. Compariso of estimator selectio methods. I Figure 3, we plot model selectio results for 20 replicatios of data (d = 4, = 500), comparig GCV 8, our miimal pealty algorithm, ad cross-validatio methods. I the left part (sigle kerel), we compare to the oracle (which ca be computed because we ca eumerate Λ), ad use for cross-validatio all possible values of λ. I the right part (multiple kerel), we compare to the performace of Mallows C L whe σ 2 is kow (i.e., pealty i Eq. (5)), ad sice we caot eumerate all λ s, we use the solutio obtaied by MKL with CV 5. We also compare to usig our miimal pealty algorithm with the sum of kerels. 6 Coclusio A ew light o the slope heuristics. Theorem 1 geeralizes some results first proved i 6 where all A λ are assumed to be projectio matrices, a framework where assumptio (A 3 ) is automatically satisfied. To this extet, Birgé ad Massart s slope heuristics has bee modified i a way that sheds a ew light o the magical factor 2 betwee the miimal ad the optimal pealty, as proved i 6, 7. Ideed, Theorem 1 shows that for geeral liear estimators, pe id (λ) pe mi (λ) = 2tr(A λ ) 2tr(A λ ) tr(a λ A λ), (14) which ca take ay value i (1,2 i geeral; this ratio is oly equal to 2 whe tr(a λ ) tr(a λ A λ), hece mostly whe A λ is a projectio matrix. Future directios. I the case of projectio estimators, the slope heuristics still holds whe the desig is radom ad data are heteroscedastic 7; we would like to kow whether Eq. (14) is still valid for heteroscedastic data with geeral liear estimators. I additio, the good empirical performaces of elbow heuristics based algorithms (i.e., based o the sharp variatio of a certai quatity aroud good hyperparameter values) suggest that Theorem 1 ca be geeralized to may learig frameworks (ad potetially to o-liear estimators), probably with small modificatios i the algorithm, but always relyig o the cocept of miimal pealty. Aother iterestig ope problem would be to exted the results of Sectio 4, where Card(Λ) K α is assumed, to cotiuous sets Λ such as the oes appearig aturally i kerel ridge regressio ad multiple kerel learig. We cojecture that Theorem 1 is valid without modificatio for a small cotiuous Λ, such as i kerel ridge regressio where takig a grid of size i log-scale is almost equivalet to takig Λ = R +. O the cotrary, i applicatios such as the Lasso with p variables, the atural set Λ caot be well covered by a grid of cardiality α with α small, ad our miimal pealty algorithm ad Theorem 1 certaily have to be modified. 8

9 Refereces 1 J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, B. Schölkopf ad A. J. Smola. Learig with Kerels. MIT Press, O. Chapelle ad V. Vapik. Model selectio for support vector machies. I Advaces i Neural Iformatio Processig Systems (NIPS), C. E. Rasmusse ad C. Williams. Gaussia Processes for Machie Learig. MIT Press, F. Bach. Cosistecy of the group Lasso ad multiple kerel learig. Joural of Machie Learig Research, 9: , L. Birgé ad P. Massart. Miimal pealties for Gaussia model selectio. Probab. Theory Related Fields, 138(1-2):33 73, S. Arlot ad P. Massart. Data-drive calibratio of pealties for least-squares regressio. J. Mach. Lear. Res., 10: , P. Crave ad G. Wahba. Smoothig oisy data with splie fuctios. Estimatig the correct degree of smoothig by the method of geeralized cross-validatio. Numer. Math., 31(4): , 1978/79. 9 G. Wahba. Splie Models for Observatioal Data. SIAM, M. Yua ad Y. Li. Model selectio ad estimatio i regressio with grouped variables. Joural of The Royal Statistical Society Series B, 68(1):49 67, G. R. G. Lackriet, N. Cristiaii, P. Bartlett, L. El Ghaoui, ad M. I. Jorda. Learig the kerel matrix with semidefiite programmig. J. Mach. Lear. Res., 5:27 72, 2003/ R. Tibshirai. Regressio shrikage ad selectio via the Lasso. Joural of The Royal Statistical Society Series B, 58(1): , T. Hastie, R. Tibshirai, ad J. Friedma. The Elemets of Statistical Learig. Spriger- Verlag, D. M. Alle. The relatioship betwee variable selectio ad data augmetatio ad a method for predictio. Techometrics, 16: , M. Stoe. Cross-validatory choice ad assessmet of statistical predictios. J. Roy. Statist. Soc. Ser. B, 36: , T. Zhag. Learig bouds for kerel regressio usig effective data dimesioality. Neural Comput., 17(9): , C. L. Mallows. Some commets o C p. Techometrics, 15: , B. Efro. How biased is the apparet error rate of a predictio rule? J. Amer. Statist. Assoc., 81(394): , Y. Cao ad Y. Golubev. O oracle iequalities related to smoothig splies. Math. Methods Statist., 15(4): (2007), S. Arlot ad F. Bach. Data-drive calibratio of liear estimators with miimal pealties, September Log versio. arxiv: v1. 21 É. Lebarbier. Detectig multiple chage-poits i the mea of a gaussia process by model selectio. Sigal Proces., 85: , C. Maugis ad B. Michel. Slope heuristics for variable selectio ad clusterig via gaussia mixtures. Techical Report 6550, INRIA, K.-C. Li. Asymptotic optimality for C p, C L, cross-validatio ad geeralized cross-validatio: discrete idex set. A. Statist., 15(3): ,

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the