arxiv: v2 [math.st] 20 Mar 2008

Size: px

Start display at page:

Download "arxiv: v2 [math.st] 20 Mar 2008"

Archibald Bell
6 years ago
Views:

1 Joural of Machie Learig Research 0 (0000) 0 Submitted 3/08; Published 0/00 Data-drive calibratio of pealties for least-squares regressio arxiv: v2 [math.st] 20 Mar 2008 Sylvai Arlot Uiv Paris-Sud, UMR 8628, Laboratoire de Mathematiques, Orsay, F ; CNRS, Orsay, F ; INRIA-Futurs, Projet Select Pascal Massart Uiv Paris-Sud, UMR 8628, Laboratoire de Mathematiques, Orsay, F ; CNRS, Orsay, F ; INRIA-Futurs, Projet Select Editor: Editor Abstract sylvai.arlot@math.u-psud.fr pascal.massart@math.u-psud.fr Pealizatio procedures ofte suffer from their depedece o multiplyig factors, whose optimal values are either ukow or hard to estimate from the data. I this paper, we propose a completely data-drive calibratio method for this parameter i the leastsquares regressio framework, without assumig a particular shape for the pealty. Our algorithm relies o the cocept of miimal pealty, which has bee itroduced i a recet paper by Birgé ad Massart (2007) i the cotext of pealized least squares for Gaussia homoscedastic regressio. Iterestigly, the miimal pealty ca be evaluated from the data themselves, which leads to a data-drive estimatio of a optimal pealty that oe ca use i practice. Ufortuately their approach heavily relies o the homoscedastic Gaussia ature of the stochastic framework that they cosider. Our purpose i this paper is twofold: statig a more geeral heuristics to desig a datadrive pealty (the slope heuristics) ad provig that it works for pealized least squares radom desig regressio, eve whe the data is heteroscedastic. For some techical reasos which are explaied i the paper, we could prove some precise mathematical results oly for histogram bi-width selectio. Eve though we could ot work at the level of geerality that we were expectig, this is at least a first step towards further results. Our mathematical results hold i some specific framework, but the approach ad the method that we use are ideed geeral. Keywords: Data-drive calibratio, No-parametric regressio, Model selectio by pealizatio, Heteroscedastic data, Histogram 1. Itroductio Model selectio has received much iterest i the last decades. A very commo approach is pealizatio. I a utshell, it chooses the model which miimizes the sum of the empirical risk (how does the algorithm fit the data) ad some complexity measure of the model (called the pealty). This is the case of FPE (Akaike, 1970), AIC (Akaike, 1973) ad Mallows C p or c 0000 Sylvai Arlot ad Pascal Massart.

2 S. Arlot ad P. Massart C L (Mallows, 1973). May other pealizatio procedures have bee proposed sice, amog which we metio Rademacher complexities (Koltchiskii, 2001; Bartlett et al., 2002), local Rademacher complexities (Bartlett et al., 2005; Koltchiskii, 2006), bootstrap, resamplig ad V -fold pealties (Efro, 1983; Arlot, 2008b,a), to ame but a few. I this article, we cosider the questio of the efficiecy of such pealizatio procedures, i.e. that their quadratic risk is asymptotically equivalet to the risk of the oracle. This property is ofte called asymptotic optimality. It does ot mea that the procedure fids out a true model (which may ot eve exist), which would be the cosistecy problem. A procedure is efficiet whe it makes the best possible use of the data i terms of the quadratic risk of the fial estimator. There is a huge amout of literature about this questio. Cosider first Mallows C p ad Akaike s FPE ad AIC. Their asymptotic optimality has bee prove by Shibata (1981) for Gaussia errors, Li (1987) uder suitable momet assumptios o the errors, ad Polyak ad Tsybakov (1990) for sharper momet coditios i the Fourier case. The, oasymptotic oracle iequalities (with a costat C > 1) have bee prove by Barro et al. (1999) ad Birgé ad Massart (2001) i the Gaussia case, ad Baraud (2000, 2002) uder some momet coditios o the errors. I the Gaussia case, o-asymptotic oracle iequalities with a costat C which goes to 1 whe goes to ifiity have bee obtaied by Birgé ad Massart (2007). However, both AIC ad Mallows C p still have serious drawbacks from the practical viewpoit. Ideed, AIC relies o a strog asymptotic assumptio, so that the optimal multiplyig factor may be quite differet from oe for small sample sizes. This is why corrected versios of AIC have bee proposed (Sugiura, 1978; Hurvich ad Tsai, 1989). O the other had, the optimal calibratio of Mallows C p requires the kowledge of the oise level σ 2, which is assumed to be costat. With real data, oe has to estimate σ 2 separately, but it is hard to make it idepedetly from ay model. I additio, it is quite ulikely that the best estimator of σ 2 automatically leads to the most efficiet model selectio procedure. Oe of the purposes of this article is to provide a data-depedet calibratio rule which directly aims at the efficiecy of the fial procedure. Focusig directly o efficiecy may improve sigificatly the more classical plug-i method, i terms of the performace of the model selectio procedure itself. Actually, most of the pealizatio procedures have similar or eve stroger drawbacks, ofte because of a gap betwee theoretical results ad their practical use. For istace, their is a factor 2 betwee the (global) Rademacher complexities for which theoretical results have bee prove, ad the way they are used i practice (Lozao, 2000). Sice this factor is uavoidable i some sese (Arlot (2007), Chap. 9), the optimal calibratio of these pealties is a practical issue. The problem is tougher for local Rademacher complexities, sice theoretical results are oly valid with very large calibratio costats (i particular the multiplyig factor), ad o oe kows which are their optimal values. Oe of our goals is to address this questio for such geeral-shape pealties (i particular data-depedet pealties), at least for the optimizatio of the multiplyig factor. There are ot so may calibratio algorithms available. Obviously, the most popular oes are cross-validatio methods (Alle, 1974; Stoe, 1974), i particular V -fold crossvalidatio (Geisser, 1975), i particular because these are geeral-purpose methods, relyig 2

3 Data-drive calibratio of pealties o a heuristics likely to be widely valid. However, their computatioal cost may be too heavy, because they require to perform V times the etire model selectio procedure for each cadidate value of the costat to be calibrated. For pealties based o the dimesio of the models (assumed to be vector spaces), such as Mallows C p, a alterative calibratio procedure has bee proposed by George ad Foster (2000). A completely differet approach is the oe of Birgé ad Massart (2007), who have also cosidered dimesioality based pealties. Sice our purpose is to exted their approach to a much wider rage of applicatios, let us recall briefly their mai claims. I the Gaussia homoscedastic regressio o a fixed-desig framework, assume that each model is a fiitedimesioal vector space. The, cosider the pealty pe(m) = KD m, where D m is the dimesio of the model m ad K > 0 is a positive costat, to be calibrated. I several situatios, it turs out that the optimal costat K (i.e. the oe which leads to a asymptotically efficiet procedure) is exactly twice the miimal costat K mi (defied as the oe uder which the ratio betwee the quadratic risk of the chose estimator ad the quadratic risk of the oracle goes to ifiity with the sample size). I other words, the optimal pealty is twice the miimal pealty, which is called the slope heuristics by Birgé ad Massart. A crucial fact is that the miimal costat K mi ca be estimated from the data, because very large models are selected if ad oly if K < K mi. This leads them to the followig strategy for choosig K from the data. Defie m(k) the model selected by pe(d m ) = KD m as a fuctio of K. First, compute K mi such that D bm(k) is huge for K < K mi ad reasoable whe K K mi. Secod, defie m := m(2k mi ). Such a method has bee successfully applied for multiple chage poits detectio by Lebarbier (2005). From the theoretical viewpoit, a crucial questio to uderstad (ad validate) this approach is the existece of a miimal pealty. I other words, how much should we pealize at least? I the framework of Gaussia regressio o a fixed-desig, this questio has bee addressed by Birgé ad Massart (2001, 2007) ad Baraud et al. (2007) (the latter cosiderig the ukow variace case). However, othig is kow for o Gaussia or heteroscedastic data. Oe of our goals is thus to fill part of this gap i the theoretical uderstadig of pealizatio procedures. I this paper, we use a similar lik betwee miimal ad optimal pealties, i order to calibrate ay pealty (amely, the favorite pealty of the fial user, icludig all the aforemetioed pealties, ad ot ecessarily dimesioality-based pealties), i a more geeral framework (e.g., we allow the oise to be heteroscedastic ad o-gaussia, which is much more realistic). This leads us to Algorithm 1, which is defied i Sect. 3.1 i the least-squares regressio framework, ad relies o a geeralizatio of the slope heuristics. We the tackle the theoretical validatio of this algorithm, from the o-asymptotic viewpoit. By o-asymptotic, we mea i particular that the collectio of models is allowed to deped o. This is quite atural sice it is commo i practice to itroduce more explaatory variables (for istace) whe oe has more observatios. Cosiderig models with a large umber of parameters (e.g. of the order of a power of the sample size ) is also ecessary to approximate fuctios belogig to a geeral approximatio space. Thus, the o-asymptotic viewpoit allows us ot to assume that the regressio fuctio ca be described with a very small umber of parameters. 3

4 S. Arlot ad P. Massart First, we prove the existece of miimal pealties for heteroscedatic regressio o a radom-desig (Thm. 1). The, we prove i the same framework that twice the miimal pealty has some optimality properties (Thm. 2), which meas that we have exteded the so-called slope heuristics to heteroscedatic least-squares regressio o a radom-desig. For provig such a result, we have to assume that each model is the vector space of piecewise costat fuctios o some partitio of the feature space. This is quite a restrictio, but we cojecture that it is maily techical, ad that the slope heuristics stays valid at least i the geeral least-square regressio framework. We provide some evidece for this by provig two key cocetratio iequalities without the restrictio to histograms. Aother argumet supportig this cojecture is that several simulatio studies have show recetly that the slope heuristics could be used i several frameworks: mixture models (Maugis ad Michel, 2007), clusterig (Baudry, 2007), spatial statistics (Verzele, 2007), estimatio of oil reserves (Lepez, 2002) ad geomics (Villers, 2007). Our results do ot give a formal proof for these applicatios of the slope heuristics (cf. Sect. 3.2 for istaces of completely data-drive pealties for which we have prove rigorously that our algorithm is workig). However, they are a first step towards such a result, by provig that it ca be applied whe the ideal pealty has a geeral shape. This paper is orgaized as follows. We describe the framework ad our mai heuristics i Sect. 2. The resultig algorithm is defied i Sect. 3. Our mai theoretical results are stated i Sect. 4. Appedix A is devoted to computatioal issues. All the proofs are give i Appedix B. 2. Framework 2.1 Least-squares regressio We observe some data (X i,y i ) X R, i.i.d. with commo law P. Our goal is to predict Y give X, where (X,Y ) P is idepedet from the data. Deotig by s the regressio fuctio, we ca write Y i = s(x i ) + σ(x i )ǫ i (1) where σ : X R is the heteroscedastic oise-level ad ǫ i are i.i.d. cetered oise terms, possibly depedet from X i, but with mea 0 ad variace 1 coditioally to X i. Typically, the feature space X is a compact set of R d. Give a predictor t : X Y, its quality is measured by the (quadratic) predictio loss E (X,Y ) P [γ(t,(x,y ))] =: Pγ(t) where γ(t,(x,y)) = (t(x) y) 2 is the least-square cotrast. The, the Bayes predictor (i.e. the miimizer of P γ(t) over the set of all predictors) is the regressio fuctio s, ad we defie the excess loss as l(s,t) := Pγ (t) Pγ (s) = E (X,Y ) P (t(x) s(x)) 2. Give a particular set of predictors S m (called a model), we defie the best predictor over S m s m := arg mi t S m {Pγ(t)}, 4

5 Data-drive calibratio of pealties ad its empirical couterpart ŝ m := arg mi t S m {P γ(t)} (whe it exists ad is uique), where P = 1 i=1 δ (X i,y i ). This estimator is the wellkow empirical risk miimizer, also called least-square estimator sice γ is the least-square cotrast. 2.2 Ideal model selectio We ow assume that we have a family of models (S m ) m M, hece a family of estimators (ŝ m ) m M (via empirical risk miimizatio). We are lookig for some data-depedet m M such that l(s,ŝ bm ) is as small as possible. This is the model selectio problem. For istace, we would like to prove some oracle iequality of the form l(s,ŝ bm ) C if m M {l(s,ŝ m )} + R i expectatio or o a evet of large probability, with C close to 1 ad R = o( 1 ). Geeral pealizatio procedures ca be described as follows. Let pe : M R + be some pealty fuctio, possibly data-depedet. The, defie m arg mi m M {crit(m)} with crit(m) := P γ(ŝ m ) + pe(m). (2) Sice the ideal criterio crit is the true predictio error Pγ (ŝ m ), the ideal pealty is pe id (m) := Pγ(ŝ m ) P γ(ŝ m ). Of course, this quatity is ukow because it depeds o the true distributio P. A atural idea is to choose pe as close as possible to pe id for every model m M. We show below, i a very geeral settig, that whe pe estimates well the ideal pealty pe id, m satisfies a oracle iequality with a leadig costat C close to 1. By defiitio of m, For every m M, we defie m M, P γ(ŝ bm ) P γ(ŝ m ) + pe(m) pe( m). p 1 (m) = P (γ(ŝ m ) γ(s m )) p 2 (m) = P (γ(s m ) γ(ŝ m )) δ(m) = (P P)(γ(s m )) so that We the have, for every m M, l(s,ŝ m ) = P γ(ŝ m ) + p 1 (m) + p 2 (m) δ(m) Pγ(s). l(s,ŝ bm ) + (pe pe id )( m) l(s,ŝ m ) + (pe pe id )(m). (3) So, i order to derive a oracle iequality from (3), we have to show that for every m M, pe(m) is close to pe id (m). 5

6 S. Arlot ad P. Massart 2.3 The slope heuristics Whe the pealty pe is too large, the left-had side of (3) stays larger tha l(s,ŝ bm ) so that we ca still obtai a oracle iequality (possibly with a large costat C). O the cotrary, whe pe is too small, the left-had side of (3) ca become egligible i frot of l(s,ŝ bm ) (which makes C explode) or worse ca be opositive (so that we ca o loger derive a oracle iequality from (3)). We shall see i the followig that this correspods to the existece of a miimal pealty. Cosider first the case pe(m) = p 2 (m) i (2). The, E [crit(m)] = E [P γ (s m )] = Pγ (s m ), so that m teds to be the model with the smallest bias, hece the more complex oe. As a cosequece, the risk of ŝ bm is very large. Whe pe(m) = Kp 2 (m), if K < 1, crit(m) is a decreasig fuctio of the complexity of m, so that m is still oe of the more complex models. O the cotrary, whe K > 1, crit(m) starts to icrease with the complexity of m (at least for the largest models), so that m has a smallest complexity. This ituitio supports the cojecture that the miimal amout of pealty required for the model selectio procedure to work may be p 2 (m). I several situatios (such as the framework of Sect. 4.1, as we shall prove i the followig), it turs out that m M, p 1 (m) p 2 (m). As a cosequece, the ideal pealty pe id (m) p 1 (m) + p 2 (m) is close to 2p 2 (m). O the other had, p 2 (m) is actually a miimal pealty. So, we deduce that the optimal pealty is close to twice the miimal pealty: pe id (m) 2pe mi (m). This is the so-called slope heuristics, which was first itroduced by Birgé ad Massart (2007) i a Gaussia settig. The practical iterest of this heuristics is that the miimal pealty ca be estimated from the data. Ideed, whe the pealty is too small, the selected model m is amog the more complex. O the cotrary, whe the pealty is larger tha the miimal oe, the complexity of m should be much smaller. This leads to the algorithm described i the ext sectio. 3. A data-drive calibratio algorithm We are ow i positio to defie a data-drive calibratio algorithm for pealizatio procedures. It geeralizes a method proposed by Birgé ad Massart (2007) ad implemeted by Lebarbier (2005). 3.1 The geeral algorithm Assume that we kow the shape pe shape : M R + of the ideal pealty (because of some prior kowledge, or because we have bee able to estimate it first, see Sect. 3.2). This meas that the pealty K pe shape provides a approximately optimal procedure, for some ukow costat K > 0. Our goal is to fid some K such that K pe shape is approximately optimal. 6

7 Data-drive calibratio of pealties We also assume that we kow some complexity measure D m for each model m M. Typically, whe the models are fiite-dimesioal vector spaces, D m is the dimesio of S m. Accordig to the slope heuristics, detailed i Sect. 2.3, the followig algorithm provides a optimal calibratio of the pealty pe shape. Algorithm 1 (Data-drive pealizatio with slope heuristics) 1. Compute the selected model m(k) as a fuctio of K > 0 m(k) arg mi m M { P γ(ŝ m ) + K pe shape (m) }. 2. Fid K mi > 0 such that D bm(k) is very large for K < K mi ad reasoably small for K > K mi. 3. Select the model m = m (2 K mi ). Computatioal aspects of Algorithm 1 ad the accurate defiitio of K mi are discussed i App. A. I particular, oce P γ (ŝ m ) ad pe shape (m) are kow for every m M, the first step of this algorithm ca be performed with a complexity proportioal to Card(M ) 2 (cf. Algorithm 2 ad Prop. 3). This is a crucial poit compared to cross-validatio methods, i particular whe performig empirical risk miimizatio is computatioally heavy. 3.2 Shape of the pealty For usig Algorithm 1 i practice, it is ecessary to kow a priori, or at least to estimate, the optimal shape pe shape of the pealty. We ow explai how this ca be doe i several differet situatios. At first readig, oe ca have i mid the simple example pe shape (m) = D m. It is valid for homoscedastic least-squares regressio o liear models, as show by several papers metioed i the itroductio. Ideed, whe Card(M ) is smaller tha some power of, it is well kow that Mallows C p pealty defied by pe(m) = 2E [ σ 2 (X) ] 1 D m is asymptotically optimal. For larger collectios M, more elaborate results (Birgé ad Massart, 2001, 2007) have show that a pealty proportioal to l()e [ σ 2 (X) ] 1 D m (depedig o the size of M ) is asymptotically optimal. Algorithm 1 the provides a alterative to pluggig a estimator of E [ σ 2 (X) ] ito the above pealties. We would like to uderlie two mai advaces with our approach. First, we avoid the difficult task of estimatig E [ σ 2 (X) ], which geerally relies o the existece of a large model without bias. Our algorithm provides a model-free estimatio of the multiplyig factor i frot of the pealty. Secod, there is absolutely o reaso that the best estimator σ 2 of E [ σ 2 (X) ] (i terms of bias or quadratic risk, for istace) leads to the more efficiet model selectio procedure. For istace, it is well kow that uderpealizatio (i.e. uderestimatig the multiplicative factor) leads to very poor performaces, whereas overpealizatio is geerally less costly. The, oe ca expect that miimizig the probability of uderestimatio of E [ σ 2 (X) ] may lead to better performaces tha the bias. Addig that there are certaily several other importat factors i order to optimize the choice of σ 2, some of them ukow, the plug-i approach seems quite tricky. 7

8 S. Arlot ad P. Massart With Algorithm 1, we do ot care about the bias or the quadratic risk of 2 K mi as a estimator of 2E [ σ 2 (X) ] 1. Sice we defie K mi i terms of the output of the model selectio procedure m(k), we focus directly o the model selectio problem. I particular, we guaratee that the selected model is ot too large, which solves part of the uderpealizatio issue. I brief, we would like to emphasize that Algorithm 1 with pe shape (m) = D m is quite differet from a simple plug-i versio of Mallows C p. It leads to a really data-depedet pealty, which may perform better i practice tha the best determiistic pealty K D m. I a more geeral framework, Algorithm 1 allows to choose a differet shape of pealty pe shape. For istace, i the heteroscedastic least-squares regressio framework of Sect. 2.1, the optimal pealty is o loger proportioal to the dimesio D m of the models. This ca be show from computatios made by Arlot (2008b) whe S m is assumed to be the vector space of piecewise costat fuctios o a partitio (I λ ) λ Λm of X: E [pe id (m)] = E [(P P )γ (ŝ m )] 1 λ Λ m E [ σ(x) 2 X Iλ ]. (4) A more accurate result ca eve be foud i Chap. 4 of (Arlot, 2007), where a example of model selectio problem is give where o pealty proportioal to D m ca be asymptotically optimal. A first aswer to this issue ca be give whe both the distributio of X ad the shape of the oise level σ are kow, which is simply to use (4) to compute pe shape. This is of course usatisfactory because oe has seldom such a prior kowledge i practice. Our suggest i this situatio is the use of resamplig pealties (Efro, 1983; Arlot, 2008a), or V -fold pealties (Arlot, 2008b) which have a much smaller computatioal cost. Ideed, up to a multiplicative factor (which is automatically estimated by Algorithm 1), these pealties should estimate well E [pe id (m)] i a geeral framework. I particular, their asymptotic optimality have bee prove i the heteroscedastic least-squares regressio framework by Arlot (2008b,a), i the framework of Sect. 4.1, ad several theoretical results supports the cojecture of their validity much more geerally. 3.3 The geeral predictio framework I Sect. 2 ad i the defiitio of Algorithm 1, we have restricted ourselves to the leastsquares regressio framework. This is actually ot ecessary at all to make Algorithm 1 well-defied, so that we ca aturally exted it to the geeral predictio framework. More precisely, the (X i,y i ) ca oly be assumed to belog to X Y for some geeral Y, ad γ : S (X Y) [0;+ ) ay cotrast fuctio. I particular, Y = {0,1 } leads to the biary classificatio problem, ad a atural cotrast fuctio is the 0-1 loss γ(t;(x, y)) = 1 t(x) y. I this case, the shape of the pealty pe shape ca for istace be estimated with the global or local Rademacher complexities metioed i itroductio, as well as several other classical pealties. However, oe ca woder whether the slope heuristics of Sect. 2.3, upo which Algorithm 1 relies, ca be exteded to this geeral framework. We do ot have a complete aswer to these questios, but several prelimiary evidece. First, i order to prove the 8

9 Data-drive calibratio of pealties validity of the slope heuristics i the least-squares regressio framework (with the theoretical results of Sect. 4), we use several cocetratio results which are valid i a very geeral settig, icludig biary classificatio. Eve if the factor 2 (which comes from the closeess of E [p 1 ] ad E [p 2 ], cf. Sect. 2.3) may ot be uiversally valid, we cojecture that Algorithm 1 ca be used i several settigs outside the least-squares regressio case. Secod, as already metioed at the ed of the itroductio, several empirical studies have show that Algorithm 1 ca be successfully applied for several problems, with several shapes for the pealty. A formal proof of this fact remais a iterestig ope problem, up to our kowledge. 4. Theoretical results Algorithm 1 maily relies o the slope heuristics, which is developped i Sect The goal of this sectio is to provide a theoretical justificatio of this heuristics. It is splitted ito two mai results. First, lower bouds o D bm ad the risk of ŝ bm whe the pealty is smaller tha pe mi (m) := E [p 2 (m)] (Thm. 1). Secod, a oracle iequality with costat almost oe whe pe(m) 2E [p 2 (m)] (Thm. 2), relyig o (3) ad the compariso p 1 p 2. I order to prove these two theorems, we eed two kids of probabilistic results. First, both p 1, p 2 ad δ cocetrate aroud their expectatios (which ca be doe i a quite geeral framework, at least for p 2 ad δ, see App. B.5). Secod, E [p 1 (m)] E [p 2 (m)] for every m M. The latter poit is quite hard i geeral, so that we must make a structural assumptio o the models. This is why, i this sectio, we restrict ourselves to the histogram case, assumig that for every m M, S m is the set of piecewise costat fuctios o some fixed partitio (I λ ) λ Λm. We describe this framework i the ext subsectio. Remember that we do ot cosider histograms as a fial goal. We oly make this assumptio i order to prove some first theoretical results cofirmig that Algorithm 1 ca be used i practical applicatios. Such theoretical results may also be quite iterestig i order to uderstad better how to use this algorithm i practice. 4.1 Histograms A model of histograms S m is the the set of piecewise costat fuctios (histograms) o some partitio (I λ ) λ Λm of X. It is thus a vector space of dimesio D m = Card(Λ m ), spaed by the family (1 Iλ ) λ Λm. As this basis is orthogoal i L 2 (µ) for ay probability measure o X, computatios are quite easy. This is the oly reaso why we assume that each S m is a model of histograms i this sectio. I particular, we have: s m = β λ 1 Iλ ad ŝ m = βλ 1 Iλ, λ Λ m λ Λ m where β λ := E P [Y X I λ ] βλ := 1 p λ X i I λ Y i p λ := P (X I λ ). Remark that ŝ m is uiquely defied if ad oly if each I λ cotais at least oe of the X i. Otherwise, we cosider that the model m ca ot be chose. 9

10 S. Arlot ad P. Massart 4.2 Mai assumptios For both our mai results, we make the followig assumptios. First, (S m ) m M is a family of histogram models satisfyig (P1) Polyomial complexity of M : Card(M ) c M α M. (P2) Richess of M : m 0 M s.t. D m0 [,c rich ]. Assumptio (P1) is quite classical whe oe aims at provig the asymptotic optimality of a model selectio procedure (it is for istace implicitly assumed by Li (1987), i the homoscedastic fixed-desig case). For ay pealty fuctio pe : M R +, we defie the followig model selectio procedure: m arg mi {P γ(ŝ m ) + pe(m)}. (5) m M, mi λ Λm { bp λ }>0 Moreover, we assume that the data (X i,y i ) 1 i are i.i.d. ad satisfy the followig: (Ab) The data is bouded: Y i A <. (A) Uiform lower-boud o the oise-level: σ(x i ) σ mi > 0 a.s. (Ap u ) The bias decreases as a power of D m : there exists β + > 0 ad C + > 0 such that l(s,s m ) C + D β + m. (Ar X l ) Lower regularity of the partitios for L(X): D m mi λ Λm {P (X I λ )} c X r,l. Further commets are made i the followig about these assumptios, explaiig i particular how to relax them. 4.3 Miimal pealties Our first result is the existece of a miimal pealty. Theorem 1 Make all the assumptios of Sect Let K [0;1), L > 0, ad assume that there is a evet of probability at least 1 L 2 o which m M, 0 pe(m) KE [P (γ(s m ) γ(ŝ m ))]. (6) The, if m is defied by (5), there exists two costats K 1, K 2 such that, with probability at least 1 K 1 2, D bm K 2 l() 1. (7) O the same evet, l(s,ŝ bm ) l() if m M {l(s,ŝ m )}. (8) The costats K 1 ad K 2 may deped o K, L ad costats i (P1), (P2), (Ab), (A), (Ap u ) ad (Ar X l ), but ot o. 10

11 Data-drive calibratio of pealties This theorem thus validates the first part of the heuristics of Sect. 2.3, provig that there is a miimal amout of pealizatio required, uder which both the selected dimesio D bm ad the quadratic risk of the fial estimator l(s,ŝ bm ) are blowig up. This couplig is quite iterestig, sice the dimesio D bm is kow i practice, cotrary to l(s,ŝ bm ). It is the possible to detect from the data that the pealty is too small, as proposed i Algorithm. 1. The mai iterest of this result is its couplig with Thm. 2 below. However, Thm. 1 is also of self-iterest, sice it helps to uderstad better the theoretical properties of pealizatio procedures. Ideed, it geeralizes the results of Birgé ad Massart (2007) o the existece of miimal pealties to heteroscedastic regressio o a radom desig (eve if we have to restrict to histogram models, as already explaied). We the have a geeral formulatio for the miimal pealty pe mi (m) := E [P (γ(s m ) γ(ŝ m ))], which icludes situatios where it is ot proportioal to the dimesio D m of the models (cf. Sect. 3.2 ad refereces therei). I additio, assumptios (Ab) ad (A) o the data are much weaker tha the Gaussia homoscedastic assumptio. They are also much more realistic, ad a importat poit is that they ca be strogly relaxed. Roughly, the boudedess of the data ca be replaced by some coditios o the momets of the oise, ad the uiform lower boud of the data is o loger ecessary whe σ satisfies some mild regularity assumptios. We refer to (Arlot, 2008a) (i particular Sect. 4.3) for detailed statemets of these assumptios, ad explaatios o how to adapt our proofs to these situatios. Fially, let us commet briefly (Ap u ) ad (Ar X l ). The upper boud (Ap u) o the bias occurs i most reasoable situatios, for istace whe X R k is bouded, the partitio (I λ ) λ Λm is regular ad the regressio fuctio s is α-hölderia for some α > 0 (β + depedig o α ad k). It esures that large models have a sigificatly smaller bias tha smaller oes (otherwise, the selected dimesio would be allowed to be smaller with a sigificat probability). O the other had, (Ar X l ) is satisfied at least for almost regular histograms, whe X has a lower bouded desity w.r.t. the Lebesgue measure o X R k. The reaso why we state Thm. 1 with a geeral formulatio of (Ap u ) ad (Ar X l ) (istead of assumig that s is α-hölderia ad X has a lower bouded desity w.r.t Leb, for istace) is to poit out the geerality of the miimal pealizatio pheomeo. It occurs as soo as the models are ot too pathological. I particular, we do ot make ay assumptio o the distributio of X itself, but oly that the models are ot too badly chose accordig to this distributio. Such a coditio ca be checked i practice if oe has some prior kowledge o L(X), or if oe has some ulabeled data (which is ofte the case). 4.4 Optimal pealties Algorithm 1 relies o a lik betwee the miimal pealty (poited out by Thm. 1) ad some optimal pealty. The followig result is a formal proof of this lik i our framework: pealties close to twice the miimal pealty satisfy a oracle iequality with a leadig costat approximately equal to oe. 11

12 S. Arlot ad P. Massart Theorem 2 Make all the assumptios of Sect. 4.2, ad add the followig: (Ap) The bias decreases like a power of D m : there exists β β + > 0 ad C +,C > 0 such that C D β m l(s,s m ) C + D β + m. Let δ (0,1), L > 0, ad assume that there is a evet of probability at least 1 L 2 o which, for every m M, (2 δ)e [P (γ(s m ) γ(ŝ m ))] pe(m) (2 + δ)e [P (γ(s m ) γ(ŝ m ))]. (9) The, if m is defied by (5) ad 0 < η < mi {β + ;1} /2, there exists a costat K 3 ad a sequece ǫ covergig to zero at ifiity such that, with probability at least 1 K 3 2, D bm 1 η ad l(s,ŝ bm ) ( ) 1 + δ 1 δ + ǫ Moreover, we have the oracle iequality E [l(s,ŝ bm )] ( ) [ 1 + δ 1 δ + ǫ E if {l(s,ŝ m )}. (10) m M if {l(s,ŝ m )} m M ] + A2 K 3 2. (11) The costat K 3 may deped o L,δ,η ad the costats i (P1), (P2), (Ab), (A), (Ap) ad (Ar X l ), but ot o. The small term ǫ is smaller tha l() 1/5 ; it ca also be take smaller tha δ for ay δ (0;δ 0 (β,β + )) at the price of elargig K 3. This theorem shows that twice the miimal pealty pe mi poited out by Thm. 1 satisfies a oracle iequality with a leadig costat almost equal to oe. It eve stays valid whe the pealty is oly close to twice the miimal oe, which meas i particular that oe ca estimate the shape of the miimal pealty by resamplig for istace (see Sect. 3.2). The ratioale behid this theorem is that the ideal pealty pe id (m) is close to its expectatio, which is itself close to 2E [P (γ(s m ) γ(ŝ m ))]. The, (3) directly implies a oracle iequality like (10), hece (11). I other words, we have prove the secod part of the slope heuristics of Sect Actually, Thm. 2 above is a corollary of a more geeral result (Thm. 5), that we state i App. B.2. I particular, if pe(m) KE [P (γ(s m ) γ(ŝ m ))] (12) istead of (9), we ca prove uder the same assumptios that the same oracle iequality holds with a large probability, with a leadig costat C(K) + ǫ istead of almost oe. Whe K (1,2], we have C(K) = (K 1) 1, ad whe K > 2, C(K) = K 1. This meas that for every K > 1, the pealty defied by (12) is efficiet, up to a multiplicative costat. This is well kow i the homoscedastic case (Birgé ad Massart, 2001; Baraud, 2000, 2002), but ew i the heteroscedastic oe. The most importat cosequeces of this result follows from its combiatio with Thm. 1. We detail them i the ext subsectio. Let us first commet the additioal 12

13 Data-drive calibratio of pealties assumptio (Ap), i.e. the lower boud o the bias. It meas that s is ot too well approximated by the models S m, which may seem surprisig. Notice that it is classical to assume that l(s,s m ) > 0 for every m M, for provig the asymptotic optimality of Mallows C p (cf. Shibata (1981), Li (1987) ad Birgé ad Massart (2007)). Moreover, the stroger assumptio (Ap) has already bee made by Stoe (1985) ad Burma (2002) i the desity estimatio framework, for the same techical reasos as ours. As detailed i (Arlot, 2008a) where a similar techique is used to derive a oracle iequality, whe the lower boud i (Ap) is o loger assumed, (10) holds with two modificatios i its right-had side: the if is restricted to models of dimesio larger tha l() γ 1, ad there is a remaider term l() γ 2 1 (where γ 1 ad γ 2 are umerical). This is essetially the same as (10), uless there is a model of small dimesio with a very small bias, ad the lower boud i (Ap) is sufficiet to esure that this do ot happe. Notice that if there is such a very small model very close to s, it is hopeless to obtai a oracle iequality with a pealty which estimates pe id, simply because deviatios of pe id aroud its expectatio would be much larger tha the excess loss of the oracle. I such a situatio, BIC-like methods are more appropriate. Aother argumet i favour of (Ap) is that it is ot too strog, because it is at least satisfied i the followig case: (I λ ) λ Λm is regular, X has a lower-bouded desity w.r.t. the Lebesgue measure o X R k, ad s is o-costat ad α-hölderia (w.r.t. ), with β 1 = k 1 + α 1 (k 1)k 1 α 1 ad β 2 = 2αk 1. We refer to Sect i (Arlot, 2007) for more details about this claim (icludig complete proofs). We fially metio that this is ot the oly case where (Ap) holds, which is the reaso why we use (Ap) as a assumptio, ad ot these sufficiet coditios (cf. the commets at the ed of Sect. 4.3). 4.5 Mai theoretical ad practical cosequeces Combiig Thm. 1 ad 2, we are ow i positio to prove the slope heuristics described i Sect. 2.3, as well as the validity of our Algorithm 1 (provided that pe shape is well chose, for istace estimated by resamplig) Optimal pealty vs. miimal pealty For the sake of simplicity, cosider the pealty KE [p 2 (m)] with ay K > 0 (the same pheomeo occurig for a pealty approximately equal to this oe). At first readig, oe ca thik of the homoscedastic case where E [p 2 (m)] σ 2 D m 1, the geeral picture beig quite similar (this geeralizatio is oe of the ovelties of our results). With Thm. 2, we have show that it satisfies a oracle iequality with a leadig costat C (K) as soo as K > 1. Moreover, C (2) 1. Accordig to (Arlot, 2008b) (the proof of its Thm. 1, i particular Lemma 6), C (K) stays away from 1 as soo as K is ot close to 2. This meas that K = 2 is the optimal multiplyig factor i frot of E [p 2 (m)]. O the other had, whe K < 1, Thm. 1 shows that o oracle iequality ca hold with a leadig costat C (K) smaller tha l() (ad eve much larger i most cases, accordig to the proof of Thm. 1). Sice C (K) (K 1) 1 < l() as soo as K > 1+l() 1, this 13

14 S. Arlot ad P. Massart meas that K = 1 is the miimal multiplyig factor i frot of E [p 2 (m)]. More geerally, we have prove that pe mi (m) := E [p 2 (m)] is a miimal pealty. I a utshell, this is a formal proof of the heuristics of Sect. 2.3: optimal pealty 2 miimal pealty. This has already bee proposed by Birgé ad Massart (2007), but their results were restricted to the Gaussia homoscedastic framework. I this paper, we exted them to a o-gaussia ad heteroscedastic settig Dimesio jump I additio, Thm. 1 ad 2 prove the existece of a crucial pheomeo aroud the miimal pealty, which is the existece of a dimesio jump. This is the oly reaso why we ca estimate the miimal pealty i practice (sice the explosio of the predictio error ca ot be directly observed), so that Algorithm 1 strogly relies o it. Ideed, cosider agai the pealty KE [p 2 (m)], ad defie m(k) the selected model as a fuctio of K. For each K > 0, with a large probability, we have D bm(k) 1 η if K > 1 ad D bm(k) K 2 (l()) 1 if K < 1 (the costat K 2 depeds o K). More precisely, a careful look at the proofs shows that this holds simultaeously i the followig sese: there are costats K 4,K 5 > 0 ad a evet of probability 1 K 4 2 o which K ( 0,1 l() 1), D bm(k) K 5 (l()) 2 ad K ( 1 + l() 1,+ ), D bm(k) 1 η. This meas that there must be a dimesio jump aroud K = 1, from dimesios of order at least (l()) 2 to dimesios much smaller, of order at most 1 η. Actually, there ca be several jumps istead of oly oe, but they occur for very close values of K (at least whe is large). Let us ow come back to Algorithm 1. Defiig a reasoably small dimesio as ay dimesio smaller tha (l()) 3, we have prove that K mi must be close to the true miimal multiplyig factor. Whe the pealty is KE [p 2 (m)], we have 1 1 l() K mi l() with a probability at least 1 K 4 2. Notice that (l()) 3 ca be replaced by ay dimesio betwee K 5 (l()) 2 ad 1 η, which are very far as soo as is large eough. Hece, this dimesio threshold does ot have to be chose accurately as soo as is ot small. Combied with Thm. 2, this shows that the model selectio procedure of Algorithm 1 satisfies a oracle iequality with a leadig costat smaller tha 1+2l() 1/5, o a large probability evet. I additio, the same result holds whe pe shape is oly close to the ideal pealty shape, e.g. withi a ratio 1 ± l() 1. I particular, the resamplig pealties of Efro (1983) ad Arlot (2008b,a) satisfy this coditio o a large probability evet. We refer to Sect. 3.2 for further discussio o this questio. 14

15 Data-drive calibratio of pealties 5. Coclusio We have see i this paper that it is possible to provide mathematical evideces that the method itroduced by Birgé ad Massart (2007) to desig data-drive pealties remais efficiet i a o Gaussia cotext. Our purpose i this coclusive sectio is to relate the heuristics that we have developped i Sect. 2 to the well kow Mallows C p ad Akaike s criteria ad to the ubiased (or almost ubiased) estimatio of the risk priciple. To explai our idea which cosists i guessig what is the right pealty to be used from the data themselves, let us come come back to Gaussia model selectio. Towards this aim let us cosider some empirical criterio γ (which ca be the least squares criterio as i this paper but which could be the log-likelihood criterio as well). Let us also cosider some collectio of models (S m ) m M ad i each model S m some miimizer s m of t E [γ (t)] over S m (assumig that such a poit does exist). Defiig for every m M, bm = γ (s m ) γ (s) ad v m = γ (s m ) γ (ŝ m ), miimizig some pealized criterio γ (ŝ m ) + pe(m) over M amouts to miimize bm v m + pe(m). The poit is that b m is a ubiased estimator of the bias term l(s,s m ). If we have i mid to use cocetratio argumets, oe ca hope that miimizig the quatity above will be approximately equivalet to miimize l(s,s m ) E [ v m ] + pe(m). Sice the purpose of the game is to miimize the risk E [l(s,ŝ m )], a ideal pealty would therefore be pe(m) = E [ v m ] + E [l(s m,ŝ m )]. I the Mallows C p case (for Gaussia fixed desig regressio least squares), the models S m are liear ad E [ v m ] = E [l(s m,ŝ m )] are explicitly computable (at least if the level of oise is assumed to be kow). For Akaike s pealized log-likelihood criterio, this is similar, at least asymptotically. More precisely, oe uses the fact that E [ v m ] E [l(s m,ŝ m )] D m 2, where D m stads for the umber of parameters defiig model S m. The coclusio of these cosideratios is that Mallows C p as well as Akaike s criterio are ideed both based o the ubiased risk (or asymptotically ubiased) estimatio priciple. The first idea that we are usig i this paper is that oe ca go further i this directio ad that the approximatio E [ v m ] E [l(s m,ŝ m )] remais valid eve i a o-asymptotic cotext. If oe believes i it the a good pealty becomes 2E [ v m ] or equivaletly (havig still i mid cocetratio argumets) 2 v m. This i some sese explais the rule of thumb which is give by Birgé ad Massart (2007) ad further studied i this paper, ad coect 15

16 S. Arlot ad P. Massart it to Mallows C p ad Akaike s heuristics. Ideed, the miimal pealty is v m while the optimal pealty should be v m + E [l(s m,ŝ m )] ad their ratio is approximately equal to 2. The secod idea that we are usig i this paper is that oe ca guess the miimal pealty from the data. There are ideed several ways to perform the estimatio of the miimal pealty. Here we are usig the jump of dimesio which occurs aroud the miimal pealty. Whe the shape of the miimal pealty is (at least approximately) of the form αd m, this amouts to estimate the ukow value α by the slope of the graph of γ (ŝ m ) for large eough values of D m. It is easy to exted this method to other shapes of pealties, simply by replacig D m by some (kow!) fuctio f (D m ). It is eve possible to combie resamplig ideas with the slope heuristics by takig a radom fuctio f which is built from a radomized empirical criterio. As show by Arlot (2007) this approach turs out to be much more efficiet tha the rougher choice f (D m ) = D m for highly heteroscedastic radom regressio frameworks. Of course, the questio of the optimality of the slope heuristics remais widely ope but we believe that o the oe had this heuristics ca be helpful i practice ad that o the other had, provig its efficiecy eve o a toy model as we did i this paper is already somethig. Let us fially metio that cotrary to Birgé ad Massart (2007), we have restricted our study to the situatio where the collectio of models M is small, i.e. has a size growig at most like a power of. For several problems, such that complete variable selectio, this assumptio does ot hold, ad it is kow from the homoscedastic case that the miimal pealty is much larger tha E [p 2 (m)]. For istace, usig the results by Birgé ad Massart (2007) i the Gaussia )) case, Émilie Lebarbier has used the slope heuristics with f (D m ) = D m (2.5 + l( D m for multiple chage poits detectio from oisy data. Let us ow explai how we expect to geeralize their heuristics to the o- Gaussia heteroscedastic case. First, group the models accordig to some complexity idex C m (for istace their dimesios, or the approximate value of their resamplig pealty suitably ormalized): for C { 1,..., k }, defie S C = C S m=c m. The, replace the model selectio problem with the family ( (S ) m ) m M by a complexity selectio problem, i.e. model selectio with the family SC 1 C k. We cojecture that this groupig of the models is sufficiet to take ito accout the richess of M for the optimal calibratio of the pealty. A theoretical justificatio of this poit may rely o the extesio of our results to ay kid of model, ot oly histogram oes (each S C is ot a histogram model, sice it is eve ot a vector space). As already metioed, this remais a iterestig ope problem. Appedix A. Computatioal aspects of the slope heuristics With Algorithm 2 (possibly combied with resamplig pealties for step 1), we have a completely data-drive ad optimal model selectio procedure. From the practical viewpoit, the last two problems may be steps 1 ad 2. First, at step 1, how ca we compute exactly m(k) for every K (0,+ ), this latter set beig ucoutable? The aswer is that the whole trajectory ( m(k)) K 0 ca be described with a small umber of parameters, which ca be computed fastly. This poit is the object of Sect. A.1. Secod, at step 2, how ca the jump of dimesio be detected automatically i practice? I other words, how should 16

17 Data-drive calibratio of pealties K mi be defied exactly, as a fuctio of ( m(k)) K 0? We try to aswer this questio i Sect. A.2. A.1 Computatio of ( m(k)) K 0 For every model m M, defie f(m) = P γ (ŝ m ) g(m) = pe shape (m) ad K 0, m(k) arg mi m M {f(m) + Kg(m)}. Sice the latter defiitio ca be ambiguous, we choose ay total orderig o M such that g is o-decreasig. The, m(k) is defied as the smallest elemet of E(K) := arg mi m M {f(m) + Kg(m)} for. The mai reaso why the whole trajectory ( m(k)) K 0 ca be computed efficietly is its very particular shape. Ideed, the results below (mostly Lemma 4) show that K m(k) is piecewise costat, ad o-icreasig for. We the have i {0,...,i max }, K [K i,k i+1 ), m(k) = m i, ad the whole trajectory ( m(k)) K 0 ca be represeted by: a o-egative iteger i max Card(M ) 1 (the umber of jumps), a icreasig sequece of positive reals (K i ) 0 i imax+1 (the locatio of the jumps, with K 0 = 0 ad K imax+1 = + ) a o-icreasig sequece of models (m i ) 0 i imax. We are ow i positio to give a efficiet algorithm for step 1 i Algorithm 2. The poit is that the K i ad the m i ca be computed sequetially, each step havig a complexity proportioal to Card(M ). This meas that its overall complexity is lower tha a costat times i max Card(M ) Card(M ) 2 (ad the latter boud is quite pessimistic i geeral). Notice also that Algorithm 2 ca be stopped earlier if the oly goal is to idetify K mi (which may be doe oly with the first m i ). Algorithm 2 (Step 1 of Algorithm 1) For every m M, defie f(m) = P γ (ŝ m ) ad g(m) = pe shape (m). Choose ay total orderig o M such that g is o-decreasig. Iit: K 0 = 0, m 0 = arg mi m M {f(m)} (whe this miimum is attaied several times, m 0 is defied as the smallest oe for ). Step i, i 1: Let G(m i 1 ) := {m M s.t. f(m) > f(m i 1 ) ad g(m) < g(m i 1 )}. 17

18 S. Arlot ad P. Massart If G(m i 1 ) =, the put K i = +, i max = i 1 ad stop. Otherwise, defie { } f(m) f(mi 1 ) K i := if g(m i 1 ) g(m) s.t. m G(m i 1) (13) ad m i the smallest elemet (for ) of F i := arg mi m G(m i 1 ) { } f(m) f(mi 1 ) g(m i 1 ) g(m). The validity of Algorithm 2 is justified by the followig propositio, showig that these K i ad m i are the same as the oes describig ( m(k)) K 0. Propositio 3 If M is fiite, Algorithm 2 termiates ad i max Card(M ) 1. Usig the otatios of Algorithm 2, ad defiig m(k) as the smallest elemet (for ) of E(K) := arg mi m M {f(m) + Kg(m)}, (K i ) 0 i imax+1 is icreasig ad i {0,...,i max 1}, K [K i,k i+1 ), m(k) = m i. It is prove i Sect. A.3. A.2 Defiitio of K mi We ow come to the questio of defiig K mi as a fuctio of ( m(k)) K>0. As we have metioed i Sect , it correspods to a dimesio jump, which should be observable sice the whole trajectory of ( D bm(k) is kow. )K 0 As a illustratio to this questio, we represeted o Fig. 1 D bm(k) as a fuctio of K, for two simulated samples. O the left (a), the dimesio jump is quite clear, ad we expect a formal defiitio of Kmi to fid this jump. The same picture holds for approximately 85% of the data sets. O the right (b), there seems to be several jumps, ad a proper defiitio of Kmi is problematic. What is sure is the ecessity to fid some automatic choice for K mi, that is defiig it properly. We ow propose two defiitios that seem reasoable to us. For the first oe, choose a threshold D reas., of order /(l()), correspodig to the largest reasoable dimesio for the selected model. The, defie K mi := if { K > 0 s.t. D bm(k) D reas. }. With this defiitio, oe ca stop Algorithm 2 as soo as the threshold is reached. However, K mi may deped strogly o the choice of the threshold, which may ot be quite obvious i the o-asymptotic situatio (where /l() is ot so far from ). Our secod idea is that K mi should match with the largest dimesio jump, i.e. { } K mi := K imax.jump with i max.jump = arg max Dmi+1 D mi. i {0,...,i max 1 } 18

19 Data-drive calibratio of pealties Maximal jump Reasoable dimesio Maximal jump Reasoable dimesio dimesio of m(k) dimesio of m(k) K (a) Oe clear jump K (b) Two jumps, two values for K mi. Figure 1: D bm(k) as a fuctio of K for two differet samples. Data are simulated with X U([0,1]), ǫ N(0,1), s(x) = si(πx), σ 1, = 200. (S m ) m M is the collectio of regular histogram models with dimesio betwee 1 ad /(l()). pe shape (m) = D m. Reasoable dimesios are below /(2l()) 19. See (Arlot, 2008b) for details (experimet S1). Although this defiitio may seem less arbitrary tha the previous oe, it still depeds strogly o M, which may ot cotai so may large models for computatioal reasos. I order to esure that there is a clear jump, a idea may be to add a few models of dimesio /2, so that at least oe has a well-defied empirical risk miimizer ŝ m. I practice, several huge models with a well-defied ŝ m may be ecessary, i order to decrease the variability of K mi. This modificatio has the default of beig quite arbitrary. As a illustratio, we compared the two defiitios above ( reasoable dimesio vs. maximal jump ) o oe thousad simulated samples similar to the oe of Fig. 1. Three cases occured: 1. The values of K mi do ot differ (about 85% of the data sets; this is the (a) situatio). 2. The values of K mi differ, but the selected models m (2 K ) mi are still equal (about 8.5% of the data sets). 3. The fially selected models are differet (about 6.5% of the data sets; this is the (b) situatio). Hece, i this o-asymptotic framework, the formal defiitio of K mi does ot matter i geeral, but stays problematic i a few cases. I terms of predictio error, we have compared the two methods by estimatig the costat C or that would appear i some oracle iequality: C or := E [l(s,ŝ bm )] E [if m M {l(s,ŝ m )}]. 19

20 S. Arlot ad P. Massart With the reasoable dimesio defiitio, C or With the maximal jump defiitio, C or As a compariso, Mallows C p (with a classical estimator of the variace σ 2 ) has a performace of C or 1.93 o the same data. For the three procedures, the stadard deviatio of the estimator of C or is about See Chap. 4 of (Arlot, 2007) for more details. This prelimiary simulatio study shows that Algorithm 1 works efficietly (it is competitive with Mallows C p i a situatio where this oe is also optimal). It also suggests that the reasoable dimesio defiitio may be better, but without very covicig evidece. I order to make the choice of K mi as automatic as possible, we suggest to use simultaeously the two methods. Whe the selected models are ot the same, the, sed a warig to the fial user, advisig him to look at the curve K D bm(k) himself. Otherwise, stay cofidet i the automatic choice of m(2 K mi ). A.3 Proof of Prop. 3 First of all, sice M is fiite, the ifimum i (13) is attaied as soo as G(m i 1 ), so that m i is well defied for every i i max. Moreover, by costructio, g(m i ) decreases with i, so that all the m i M are distict. Hece, Algorithm 2 termiates ad i max +1 Card(M ). We ow prove by iductio the followig property for every i {0,...,i max }: P i : K i < K i+1 ad K [K i,k i+1 ), m(k) = m i. Notice also that K i ca always be defied by (13) with the covetio if = +. P 0 holds true By defiitio of K 1, it is clear that K 1 > 0 (it may be equal to + if G(m 0 ) = ). For K = K 0 = 0, the defiitio of m 0 is the oe of m(0), so that m(k) = m 0. For K (0,K 1 ), Lemma 4 shows that either m(k) = m(0) = m 0 or m(k) G(0). I the latter case, by defiitio of K 1, f( m(k)) f(m 0 ) g(m 0 ) g( m(k)) K 1 > K so that f( m(k)) + Kg( m(k)) > f(m 0 ) + Kg(m 0 ) which is cotradictory with the defiitio of m(k). Hece, P 0 holds true. P i P i+1 for every i {0,...,i max 1} Assume that P i holds true. First, we have to prove that K i+2 > K i+1. Sice K imax+1 = +, this is clear if i = i max 1. Otherwise, K i+2 < + ad m i+2 exists. The, by defiitio of m i+2 ad K i+2 (resp. m i+1 ad K i+1 ), we have f(m i+2 ) f(m i+1 ) = K i+2 (g(m i+1 ) g(m i+2 )) (14) f(m i+1 ) f(m i ) = K i+1 (g(m i ) g(m i+1 )). (15) 20

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space