arxiv: v2 [math.st] 20 Mar 2008

Size: px
Start display at page:

Download "arxiv: v2 [math.st] 20 Mar 2008"

Transcription

1 Joural of Machie Learig Research 0 (0000) 0 Submitted 3/08; Published 0/00 Data-drive calibratio of pealties for least-squares regressio arxiv: v2 [math.st] 20 Mar 2008 Sylvai Arlot Uiv Paris-Sud, UMR 8628, Laboratoire de Mathematiques, Orsay, F ; CNRS, Orsay, F ; INRIA-Futurs, Projet Select Pascal Massart Uiv Paris-Sud, UMR 8628, Laboratoire de Mathematiques, Orsay, F ; CNRS, Orsay, F ; INRIA-Futurs, Projet Select Editor: Editor Abstract sylvai.arlot@math.u-psud.fr pascal.massart@math.u-psud.fr Pealizatio procedures ofte suffer from their depedece o multiplyig factors, whose optimal values are either ukow or hard to estimate from the data. I this paper, we propose a completely data-drive calibratio method for this parameter i the leastsquares regressio framework, without assumig a particular shape for the pealty. Our algorithm relies o the cocept of miimal pealty, which has bee itroduced i a recet paper by Birgé ad Massart (2007) i the cotext of pealized least squares for Gaussia homoscedastic regressio. Iterestigly, the miimal pealty ca be evaluated from the data themselves, which leads to a data-drive estimatio of a optimal pealty that oe ca use i practice. Ufortuately their approach heavily relies o the homoscedastic Gaussia ature of the stochastic framework that they cosider. Our purpose i this paper is twofold: statig a more geeral heuristics to desig a datadrive pealty (the slope heuristics) ad provig that it works for pealized least squares radom desig regressio, eve whe the data is heteroscedastic. For some techical reasos which are explaied i the paper, we could prove some precise mathematical results oly for histogram bi-width selectio. Eve though we could ot work at the level of geerality that we were expectig, this is at least a first step towards further results. Our mathematical results hold i some specific framework, but the approach ad the method that we use are ideed geeral. Keywords: Data-drive calibratio, No-parametric regressio, Model selectio by pealizatio, Heteroscedastic data, Histogram 1. Itroductio Model selectio has received much iterest i the last decades. A very commo approach is pealizatio. I a utshell, it chooses the model which miimizes the sum of the empirical risk (how does the algorithm fit the data) ad some complexity measure of the model (called the pealty). This is the case of FPE (Akaike, 1970), AIC (Akaike, 1973) ad Mallows C p or c 0000 Sylvai Arlot ad Pascal Massart.

2 S. Arlot ad P. Massart C L (Mallows, 1973). May other pealizatio procedures have bee proposed sice, amog which we metio Rademacher complexities (Koltchiskii, 2001; Bartlett et al., 2002), local Rademacher complexities (Bartlett et al., 2005; Koltchiskii, 2006), bootstrap, resamplig ad V -fold pealties (Efro, 1983; Arlot, 2008b,a), to ame but a few. I this article, we cosider the questio of the efficiecy of such pealizatio procedures, i.e. that their quadratic risk is asymptotically equivalet to the risk of the oracle. This property is ofte called asymptotic optimality. It does ot mea that the procedure fids out a true model (which may ot eve exist), which would be the cosistecy problem. A procedure is efficiet whe it makes the best possible use of the data i terms of the quadratic risk of the fial estimator. There is a huge amout of literature about this questio. Cosider first Mallows C p ad Akaike s FPE ad AIC. Their asymptotic optimality has bee prove by Shibata (1981) for Gaussia errors, Li (1987) uder suitable momet assumptios o the errors, ad Polyak ad Tsybakov (1990) for sharper momet coditios i the Fourier case. The, oasymptotic oracle iequalities (with a costat C > 1) have bee prove by Barro et al. (1999) ad Birgé ad Massart (2001) i the Gaussia case, ad Baraud (2000, 2002) uder some momet coditios o the errors. I the Gaussia case, o-asymptotic oracle iequalities with a costat C which goes to 1 whe goes to ifiity have bee obtaied by Birgé ad Massart (2007). However, both AIC ad Mallows C p still have serious drawbacks from the practical viewpoit. Ideed, AIC relies o a strog asymptotic assumptio, so that the optimal multiplyig factor may be quite differet from oe for small sample sizes. This is why corrected versios of AIC have bee proposed (Sugiura, 1978; Hurvich ad Tsai, 1989). O the other had, the optimal calibratio of Mallows C p requires the kowledge of the oise level σ 2, which is assumed to be costat. With real data, oe has to estimate σ 2 separately, but it is hard to make it idepedetly from ay model. I additio, it is quite ulikely that the best estimator of σ 2 automatically leads to the most efficiet model selectio procedure. Oe of the purposes of this article is to provide a data-depedet calibratio rule which directly aims at the efficiecy of the fial procedure. Focusig directly o efficiecy may improve sigificatly the more classical plug-i method, i terms of the performace of the model selectio procedure itself. Actually, most of the pealizatio procedures have similar or eve stroger drawbacks, ofte because of a gap betwee theoretical results ad their practical use. For istace, their is a factor 2 betwee the (global) Rademacher complexities for which theoretical results have bee prove, ad the way they are used i practice (Lozao, 2000). Sice this factor is uavoidable i some sese (Arlot (2007), Chap. 9), the optimal calibratio of these pealties is a practical issue. The problem is tougher for local Rademacher complexities, sice theoretical results are oly valid with very large calibratio costats (i particular the multiplyig factor), ad o oe kows which are their optimal values. Oe of our goals is to address this questio for such geeral-shape pealties (i particular data-depedet pealties), at least for the optimizatio of the multiplyig factor. There are ot so may calibratio algorithms available. Obviously, the most popular oes are cross-validatio methods (Alle, 1974; Stoe, 1974), i particular V -fold crossvalidatio (Geisser, 1975), i particular because these are geeral-purpose methods, relyig 2

3 Data-drive calibratio of pealties o a heuristics likely to be widely valid. However, their computatioal cost may be too heavy, because they require to perform V times the etire model selectio procedure for each cadidate value of the costat to be calibrated. For pealties based o the dimesio of the models (assumed to be vector spaces), such as Mallows C p, a alterative calibratio procedure has bee proposed by George ad Foster (2000). A completely differet approach is the oe of Birgé ad Massart (2007), who have also cosidered dimesioality based pealties. Sice our purpose is to exted their approach to a much wider rage of applicatios, let us recall briefly their mai claims. I the Gaussia homoscedastic regressio o a fixed-desig framework, assume that each model is a fiitedimesioal vector space. The, cosider the pealty pe(m) = KD m, where D m is the dimesio of the model m ad K > 0 is a positive costat, to be calibrated. I several situatios, it turs out that the optimal costat K (i.e. the oe which leads to a asymptotically efficiet procedure) is exactly twice the miimal costat K mi (defied as the oe uder which the ratio betwee the quadratic risk of the chose estimator ad the quadratic risk of the oracle goes to ifiity with the sample size). I other words, the optimal pealty is twice the miimal pealty, which is called the slope heuristics by Birgé ad Massart. A crucial fact is that the miimal costat K mi ca be estimated from the data, because very large models are selected if ad oly if K < K mi. This leads them to the followig strategy for choosig K from the data. Defie m(k) the model selected by pe(d m ) = KD m as a fuctio of K. First, compute K mi such that D bm(k) is huge for K < K mi ad reasoable whe K K mi. Secod, defie m := m(2k mi ). Such a method has bee successfully applied for multiple chage poits detectio by Lebarbier (2005). From the theoretical viewpoit, a crucial questio to uderstad (ad validate) this approach is the existece of a miimal pealty. I other words, how much should we pealize at least? I the framework of Gaussia regressio o a fixed-desig, this questio has bee addressed by Birgé ad Massart (2001, 2007) ad Baraud et al. (2007) (the latter cosiderig the ukow variace case). However, othig is kow for o Gaussia or heteroscedastic data. Oe of our goals is thus to fill part of this gap i the theoretical uderstadig of pealizatio procedures. I this paper, we use a similar lik betwee miimal ad optimal pealties, i order to calibrate ay pealty (amely, the favorite pealty of the fial user, icludig all the aforemetioed pealties, ad ot ecessarily dimesioality-based pealties), i a more geeral framework (e.g., we allow the oise to be heteroscedastic ad o-gaussia, which is much more realistic). This leads us to Algorithm 1, which is defied i Sect. 3.1 i the least-squares regressio framework, ad relies o a geeralizatio of the slope heuristics. We the tackle the theoretical validatio of this algorithm, from the o-asymptotic viewpoit. By o-asymptotic, we mea i particular that the collectio of models is allowed to deped o. This is quite atural sice it is commo i practice to itroduce more explaatory variables (for istace) whe oe has more observatios. Cosiderig models with a large umber of parameters (e.g. of the order of a power of the sample size ) is also ecessary to approximate fuctios belogig to a geeral approximatio space. Thus, the o-asymptotic viewpoit allows us ot to assume that the regressio fuctio ca be described with a very small umber of parameters. 3

4 S. Arlot ad P. Massart First, we prove the existece of miimal pealties for heteroscedatic regressio o a radom-desig (Thm. 1). The, we prove i the same framework that twice the miimal pealty has some optimality properties (Thm. 2), which meas that we have exteded the so-called slope heuristics to heteroscedatic least-squares regressio o a radom-desig. For provig such a result, we have to assume that each model is the vector space of piecewise costat fuctios o some partitio of the feature space. This is quite a restrictio, but we cojecture that it is maily techical, ad that the slope heuristics stays valid at least i the geeral least-square regressio framework. We provide some evidece for this by provig two key cocetratio iequalities without the restrictio to histograms. Aother argumet supportig this cojecture is that several simulatio studies have show recetly that the slope heuristics could be used i several frameworks: mixture models (Maugis ad Michel, 2007), clusterig (Baudry, 2007), spatial statistics (Verzele, 2007), estimatio of oil reserves (Lepez, 2002) ad geomics (Villers, 2007). Our results do ot give a formal proof for these applicatios of the slope heuristics (cf. Sect. 3.2 for istaces of completely data-drive pealties for which we have prove rigorously that our algorithm is workig). However, they are a first step towards such a result, by provig that it ca be applied whe the ideal pealty has a geeral shape. This paper is orgaized as follows. We describe the framework ad our mai heuristics i Sect. 2. The resultig algorithm is defied i Sect. 3. Our mai theoretical results are stated i Sect. 4. Appedix A is devoted to computatioal issues. All the proofs are give i Appedix B. 2. Framework 2.1 Least-squares regressio We observe some data (X i,y i ) X R, i.i.d. with commo law P. Our goal is to predict Y give X, where (X,Y ) P is idepedet from the data. Deotig by s the regressio fuctio, we ca write Y i = s(x i ) + σ(x i )ǫ i (1) where σ : X R is the heteroscedastic oise-level ad ǫ i are i.i.d. cetered oise terms, possibly depedet from X i, but with mea 0 ad variace 1 coditioally to X i. Typically, the feature space X is a compact set of R d. Give a predictor t : X Y, its quality is measured by the (quadratic) predictio loss E (X,Y ) P [γ(t,(x,y ))] =: Pγ(t) where γ(t,(x,y)) = (t(x) y) 2 is the least-square cotrast. The, the Bayes predictor (i.e. the miimizer of P γ(t) over the set of all predictors) is the regressio fuctio s, ad we defie the excess loss as l(s,t) := Pγ (t) Pγ (s) = E (X,Y ) P (t(x) s(x)) 2. Give a particular set of predictors S m (called a model), we defie the best predictor over S m s m := arg mi t S m {Pγ(t)}, 4

5 Data-drive calibratio of pealties ad its empirical couterpart ŝ m := arg mi t S m {P γ(t)} (whe it exists ad is uique), where P = 1 i=1 δ (X i,y i ). This estimator is the wellkow empirical risk miimizer, also called least-square estimator sice γ is the least-square cotrast. 2.2 Ideal model selectio We ow assume that we have a family of models (S m ) m M, hece a family of estimators (ŝ m ) m M (via empirical risk miimizatio). We are lookig for some data-depedet m M such that l(s,ŝ bm ) is as small as possible. This is the model selectio problem. For istace, we would like to prove some oracle iequality of the form l(s,ŝ bm ) C if m M {l(s,ŝ m )} + R i expectatio or o a evet of large probability, with C close to 1 ad R = o( 1 ). Geeral pealizatio procedures ca be described as follows. Let pe : M R + be some pealty fuctio, possibly data-depedet. The, defie m arg mi m M {crit(m)} with crit(m) := P γ(ŝ m ) + pe(m). (2) Sice the ideal criterio crit is the true predictio error Pγ (ŝ m ), the ideal pealty is pe id (m) := Pγ(ŝ m ) P γ(ŝ m ). Of course, this quatity is ukow because it depeds o the true distributio P. A atural idea is to choose pe as close as possible to pe id for every model m M. We show below, i a very geeral settig, that whe pe estimates well the ideal pealty pe id, m satisfies a oracle iequality with a leadig costat C close to 1. By defiitio of m, For every m M, we defie m M, P γ(ŝ bm ) P γ(ŝ m ) + pe(m) pe( m). p 1 (m) = P (γ(ŝ m ) γ(s m )) p 2 (m) = P (γ(s m ) γ(ŝ m )) δ(m) = (P P)(γ(s m )) so that We the have, for every m M, l(s,ŝ m ) = P γ(ŝ m ) + p 1 (m) + p 2 (m) δ(m) Pγ(s). l(s,ŝ bm ) + (pe pe id )( m) l(s,ŝ m ) + (pe pe id )(m). (3) So, i order to derive a oracle iequality from (3), we have to show that for every m M, pe(m) is close to pe id (m). 5

6 S. Arlot ad P. Massart 2.3 The slope heuristics Whe the pealty pe is too large, the left-had side of (3) stays larger tha l(s,ŝ bm ) so that we ca still obtai a oracle iequality (possibly with a large costat C). O the cotrary, whe pe is too small, the left-had side of (3) ca become egligible i frot of l(s,ŝ bm ) (which makes C explode) or worse ca be opositive (so that we ca o loger derive a oracle iequality from (3)). We shall see i the followig that this correspods to the existece of a miimal pealty. Cosider first the case pe(m) = p 2 (m) i (2). The, E [crit(m)] = E [P γ (s m )] = Pγ (s m ), so that m teds to be the model with the smallest bias, hece the more complex oe. As a cosequece, the risk of ŝ bm is very large. Whe pe(m) = Kp 2 (m), if K < 1, crit(m) is a decreasig fuctio of the complexity of m, so that m is still oe of the more complex models. O the cotrary, whe K > 1, crit(m) starts to icrease with the complexity of m (at least for the largest models), so that m has a smallest complexity. This ituitio supports the cojecture that the miimal amout of pealty required for the model selectio procedure to work may be p 2 (m). I several situatios (such as the framework of Sect. 4.1, as we shall prove i the followig), it turs out that m M, p 1 (m) p 2 (m). As a cosequece, the ideal pealty pe id (m) p 1 (m) + p 2 (m) is close to 2p 2 (m). O the other had, p 2 (m) is actually a miimal pealty. So, we deduce that the optimal pealty is close to twice the miimal pealty: pe id (m) 2pe mi (m). This is the so-called slope heuristics, which was first itroduced by Birgé ad Massart (2007) i a Gaussia settig. The practical iterest of this heuristics is that the miimal pealty ca be estimated from the data. Ideed, whe the pealty is too small, the selected model m is amog the more complex. O the cotrary, whe the pealty is larger tha the miimal oe, the complexity of m should be much smaller. This leads to the algorithm described i the ext sectio. 3. A data-drive calibratio algorithm We are ow i positio to defie a data-drive calibratio algorithm for pealizatio procedures. It geeralizes a method proposed by Birgé ad Massart (2007) ad implemeted by Lebarbier (2005). 3.1 The geeral algorithm Assume that we kow the shape pe shape : M R + of the ideal pealty (because of some prior kowledge, or because we have bee able to estimate it first, see Sect. 3.2). This meas that the pealty K pe shape provides a approximately optimal procedure, for some ukow costat K > 0. Our goal is to fid some K such that K pe shape is approximately optimal. 6

7 Data-drive calibratio of pealties We also assume that we kow some complexity measure D m for each model m M. Typically, whe the models are fiite-dimesioal vector spaces, D m is the dimesio of S m. Accordig to the slope heuristics, detailed i Sect. 2.3, the followig algorithm provides a optimal calibratio of the pealty pe shape. Algorithm 1 (Data-drive pealizatio with slope heuristics) 1. Compute the selected model m(k) as a fuctio of K > 0 m(k) arg mi m M { P γ(ŝ m ) + K pe shape (m) }. 2. Fid K mi > 0 such that D bm(k) is very large for K < K mi ad reasoably small for K > K mi. 3. Select the model m = m (2 K mi ). Computatioal aspects of Algorithm 1 ad the accurate defiitio of K mi are discussed i App. A. I particular, oce P γ (ŝ m ) ad pe shape (m) are kow for every m M, the first step of this algorithm ca be performed with a complexity proportioal to Card(M ) 2 (cf. Algorithm 2 ad Prop. 3). This is a crucial poit compared to cross-validatio methods, i particular whe performig empirical risk miimizatio is computatioally heavy. 3.2 Shape of the pealty For usig Algorithm 1 i practice, it is ecessary to kow a priori, or at least to estimate, the optimal shape pe shape of the pealty. We ow explai how this ca be doe i several differet situatios. At first readig, oe ca have i mid the simple example pe shape (m) = D m. It is valid for homoscedastic least-squares regressio o liear models, as show by several papers metioed i the itroductio. Ideed, whe Card(M ) is smaller tha some power of, it is well kow that Mallows C p pealty defied by pe(m) = 2E [ σ 2 (X) ] 1 D m is asymptotically optimal. For larger collectios M, more elaborate results (Birgé ad Massart, 2001, 2007) have show that a pealty proportioal to l()e [ σ 2 (X) ] 1 D m (depedig o the size of M ) is asymptotically optimal. Algorithm 1 the provides a alterative to pluggig a estimator of E [ σ 2 (X) ] ito the above pealties. We would like to uderlie two mai advaces with our approach. First, we avoid the difficult task of estimatig E [ σ 2 (X) ], which geerally relies o the existece of a large model without bias. Our algorithm provides a model-free estimatio of the multiplyig factor i frot of the pealty. Secod, there is absolutely o reaso that the best estimator σ 2 of E [ σ 2 (X) ] (i terms of bias or quadratic risk, for istace) leads to the more efficiet model selectio procedure. For istace, it is well kow that uderpealizatio (i.e. uderestimatig the multiplicative factor) leads to very poor performaces, whereas overpealizatio is geerally less costly. The, oe ca expect that miimizig the probability of uderestimatio of E [ σ 2 (X) ] may lead to better performaces tha the bias. Addig that there are certaily several other importat factors i order to optimize the choice of σ 2, some of them ukow, the plug-i approach seems quite tricky. 7

8 S. Arlot ad P. Massart With Algorithm 1, we do ot care about the bias or the quadratic risk of 2 K mi as a estimator of 2E [ σ 2 (X) ] 1. Sice we defie K mi i terms of the output of the model selectio procedure m(k), we focus directly o the model selectio problem. I particular, we guaratee that the selected model is ot too large, which solves part of the uderpealizatio issue. I brief, we would like to emphasize that Algorithm 1 with pe shape (m) = D m is quite differet from a simple plug-i versio of Mallows C p. It leads to a really data-depedet pealty, which may perform better i practice tha the best determiistic pealty K D m. I a more geeral framework, Algorithm 1 allows to choose a differet shape of pealty pe shape. For istace, i the heteroscedastic least-squares regressio framework of Sect. 2.1, the optimal pealty is o loger proportioal to the dimesio D m of the models. This ca be show from computatios made by Arlot (2008b) whe S m is assumed to be the vector space of piecewise costat fuctios o a partitio (I λ ) λ Λm of X: E [pe id (m)] = E [(P P )γ (ŝ m )] 1 λ Λ m E [ σ(x) 2 X Iλ ]. (4) A more accurate result ca eve be foud i Chap. 4 of (Arlot, 2007), where a example of model selectio problem is give where o pealty proportioal to D m ca be asymptotically optimal. A first aswer to this issue ca be give whe both the distributio of X ad the shape of the oise level σ are kow, which is simply to use (4) to compute pe shape. This is of course usatisfactory because oe has seldom such a prior kowledge i practice. Our suggest i this situatio is the use of resamplig pealties (Efro, 1983; Arlot, 2008a), or V -fold pealties (Arlot, 2008b) which have a much smaller computatioal cost. Ideed, up to a multiplicative factor (which is automatically estimated by Algorithm 1), these pealties should estimate well E [pe id (m)] i a geeral framework. I particular, their asymptotic optimality have bee prove i the heteroscedastic least-squares regressio framework by Arlot (2008b,a), i the framework of Sect. 4.1, ad several theoretical results supports the cojecture of their validity much more geerally. 3.3 The geeral predictio framework I Sect. 2 ad i the defiitio of Algorithm 1, we have restricted ourselves to the leastsquares regressio framework. This is actually ot ecessary at all to make Algorithm 1 well-defied, so that we ca aturally exted it to the geeral predictio framework. More precisely, the (X i,y i ) ca oly be assumed to belog to X Y for some geeral Y, ad γ : S (X Y) [0;+ ) ay cotrast fuctio. I particular, Y = {0,1 } leads to the biary classificatio problem, ad a atural cotrast fuctio is the 0-1 loss γ(t;(x, y)) = 1 t(x) y. I this case, the shape of the pealty pe shape ca for istace be estimated with the global or local Rademacher complexities metioed i itroductio, as well as several other classical pealties. However, oe ca woder whether the slope heuristics of Sect. 2.3, upo which Algorithm 1 relies, ca be exteded to this geeral framework. We do ot have a complete aswer to these questios, but several prelimiary evidece. First, i order to prove the 8

9 Data-drive calibratio of pealties validity of the slope heuristics i the least-squares regressio framework (with the theoretical results of Sect. 4), we use several cocetratio results which are valid i a very geeral settig, icludig biary classificatio. Eve if the factor 2 (which comes from the closeess of E [p 1 ] ad E [p 2 ], cf. Sect. 2.3) may ot be uiversally valid, we cojecture that Algorithm 1 ca be used i several settigs outside the least-squares regressio case. Secod, as already metioed at the ed of the itroductio, several empirical studies have show that Algorithm 1 ca be successfully applied for several problems, with several shapes for the pealty. A formal proof of this fact remais a iterestig ope problem, up to our kowledge. 4. Theoretical results Algorithm 1 maily relies o the slope heuristics, which is developped i Sect The goal of this sectio is to provide a theoretical justificatio of this heuristics. It is splitted ito two mai results. First, lower bouds o D bm ad the risk of ŝ bm whe the pealty is smaller tha pe mi (m) := E [p 2 (m)] (Thm. 1). Secod, a oracle iequality with costat almost oe whe pe(m) 2E [p 2 (m)] (Thm. 2), relyig o (3) ad the compariso p 1 p 2. I order to prove these two theorems, we eed two kids of probabilistic results. First, both p 1, p 2 ad δ cocetrate aroud their expectatios (which ca be doe i a quite geeral framework, at least for p 2 ad δ, see App. B.5). Secod, E [p 1 (m)] E [p 2 (m)] for every m M. The latter poit is quite hard i geeral, so that we must make a structural assumptio o the models. This is why, i this sectio, we restrict ourselves to the histogram case, assumig that for every m M, S m is the set of piecewise costat fuctios o some fixed partitio (I λ ) λ Λm. We describe this framework i the ext subsectio. Remember that we do ot cosider histograms as a fial goal. We oly make this assumptio i order to prove some first theoretical results cofirmig that Algorithm 1 ca be used i practical applicatios. Such theoretical results may also be quite iterestig i order to uderstad better how to use this algorithm i practice. 4.1 Histograms A model of histograms S m is the the set of piecewise costat fuctios (histograms) o some partitio (I λ ) λ Λm of X. It is thus a vector space of dimesio D m = Card(Λ m ), spaed by the family (1 Iλ ) λ Λm. As this basis is orthogoal i L 2 (µ) for ay probability measure o X, computatios are quite easy. This is the oly reaso why we assume that each S m is a model of histograms i this sectio. I particular, we have: s m = β λ 1 Iλ ad ŝ m = βλ 1 Iλ, λ Λ m λ Λ m where β λ := E P [Y X I λ ] βλ := 1 p λ X i I λ Y i p λ := P (X I λ ). Remark that ŝ m is uiquely defied if ad oly if each I λ cotais at least oe of the X i. Otherwise, we cosider that the model m ca ot be chose. 9

10 S. Arlot ad P. Massart 4.2 Mai assumptios For both our mai results, we make the followig assumptios. First, (S m ) m M is a family of histogram models satisfyig (P1) Polyomial complexity of M : Card(M ) c M α M. (P2) Richess of M : m 0 M s.t. D m0 [,c rich ]. Assumptio (P1) is quite classical whe oe aims at provig the asymptotic optimality of a model selectio procedure (it is for istace implicitly assumed by Li (1987), i the homoscedastic fixed-desig case). For ay pealty fuctio pe : M R +, we defie the followig model selectio procedure: m arg mi {P γ(ŝ m ) + pe(m)}. (5) m M, mi λ Λm { bp λ }>0 Moreover, we assume that the data (X i,y i ) 1 i are i.i.d. ad satisfy the followig: (Ab) The data is bouded: Y i A <. (A) Uiform lower-boud o the oise-level: σ(x i ) σ mi > 0 a.s. (Ap u ) The bias decreases as a power of D m : there exists β + > 0 ad C + > 0 such that l(s,s m ) C + D β + m. (Ar X l ) Lower regularity of the partitios for L(X): D m mi λ Λm {P (X I λ )} c X r,l. Further commets are made i the followig about these assumptios, explaiig i particular how to relax them. 4.3 Miimal pealties Our first result is the existece of a miimal pealty. Theorem 1 Make all the assumptios of Sect Let K [0;1), L > 0, ad assume that there is a evet of probability at least 1 L 2 o which m M, 0 pe(m) KE [P (γ(s m ) γ(ŝ m ))]. (6) The, if m is defied by (5), there exists two costats K 1, K 2 such that, with probability at least 1 K 1 2, D bm K 2 l() 1. (7) O the same evet, l(s,ŝ bm ) l() if m M {l(s,ŝ m )}. (8) The costats K 1 ad K 2 may deped o K, L ad costats i (P1), (P2), (Ab), (A), (Ap u ) ad (Ar X l ), but ot o. 10

11 Data-drive calibratio of pealties This theorem thus validates the first part of the heuristics of Sect. 2.3, provig that there is a miimal amout of pealizatio required, uder which both the selected dimesio D bm ad the quadratic risk of the fial estimator l(s,ŝ bm ) are blowig up. This couplig is quite iterestig, sice the dimesio D bm is kow i practice, cotrary to l(s,ŝ bm ). It is the possible to detect from the data that the pealty is too small, as proposed i Algorithm. 1. The mai iterest of this result is its couplig with Thm. 2 below. However, Thm. 1 is also of self-iterest, sice it helps to uderstad better the theoretical properties of pealizatio procedures. Ideed, it geeralizes the results of Birgé ad Massart (2007) o the existece of miimal pealties to heteroscedastic regressio o a radom desig (eve if we have to restrict to histogram models, as already explaied). We the have a geeral formulatio for the miimal pealty pe mi (m) := E [P (γ(s m ) γ(ŝ m ))], which icludes situatios where it is ot proportioal to the dimesio D m of the models (cf. Sect. 3.2 ad refereces therei). I additio, assumptios (Ab) ad (A) o the data are much weaker tha the Gaussia homoscedastic assumptio. They are also much more realistic, ad a importat poit is that they ca be strogly relaxed. Roughly, the boudedess of the data ca be replaced by some coditios o the momets of the oise, ad the uiform lower boud of the data is o loger ecessary whe σ satisfies some mild regularity assumptios. We refer to (Arlot, 2008a) (i particular Sect. 4.3) for detailed statemets of these assumptios, ad explaatios o how to adapt our proofs to these situatios. Fially, let us commet briefly (Ap u ) ad (Ar X l ). The upper boud (Ap u) o the bias occurs i most reasoable situatios, for istace whe X R k is bouded, the partitio (I λ ) λ Λm is regular ad the regressio fuctio s is α-hölderia for some α > 0 (β + depedig o α ad k). It esures that large models have a sigificatly smaller bias tha smaller oes (otherwise, the selected dimesio would be allowed to be smaller with a sigificat probability). O the other had, (Ar X l ) is satisfied at least for almost regular histograms, whe X has a lower bouded desity w.r.t. the Lebesgue measure o X R k. The reaso why we state Thm. 1 with a geeral formulatio of (Ap u ) ad (Ar X l ) (istead of assumig that s is α-hölderia ad X has a lower bouded desity w.r.t Leb, for istace) is to poit out the geerality of the miimal pealizatio pheomeo. It occurs as soo as the models are ot too pathological. I particular, we do ot make ay assumptio o the distributio of X itself, but oly that the models are ot too badly chose accordig to this distributio. Such a coditio ca be checked i practice if oe has some prior kowledge o L(X), or if oe has some ulabeled data (which is ofte the case). 4.4 Optimal pealties Algorithm 1 relies o a lik betwee the miimal pealty (poited out by Thm. 1) ad some optimal pealty. The followig result is a formal proof of this lik i our framework: pealties close to twice the miimal pealty satisfy a oracle iequality with a leadig costat approximately equal to oe. 11

12 S. Arlot ad P. Massart Theorem 2 Make all the assumptios of Sect. 4.2, ad add the followig: (Ap) The bias decreases like a power of D m : there exists β β + > 0 ad C +,C > 0 such that C D β m l(s,s m ) C + D β + m. Let δ (0,1), L > 0, ad assume that there is a evet of probability at least 1 L 2 o which, for every m M, (2 δ)e [P (γ(s m ) γ(ŝ m ))] pe(m) (2 + δ)e [P (γ(s m ) γ(ŝ m ))]. (9) The, if m is defied by (5) ad 0 < η < mi {β + ;1} /2, there exists a costat K 3 ad a sequece ǫ covergig to zero at ifiity such that, with probability at least 1 K 3 2, D bm 1 η ad l(s,ŝ bm ) ( ) 1 + δ 1 δ + ǫ Moreover, we have the oracle iequality E [l(s,ŝ bm )] ( ) [ 1 + δ 1 δ + ǫ E if {l(s,ŝ m )}. (10) m M if {l(s,ŝ m )} m M ] + A2 K 3 2. (11) The costat K 3 may deped o L,δ,η ad the costats i (P1), (P2), (Ab), (A), (Ap) ad (Ar X l ), but ot o. The small term ǫ is smaller tha l() 1/5 ; it ca also be take smaller tha δ for ay δ (0;δ 0 (β,β + )) at the price of elargig K 3. This theorem shows that twice the miimal pealty pe mi poited out by Thm. 1 satisfies a oracle iequality with a leadig costat almost equal to oe. It eve stays valid whe the pealty is oly close to twice the miimal oe, which meas i particular that oe ca estimate the shape of the miimal pealty by resamplig for istace (see Sect. 3.2). The ratioale behid this theorem is that the ideal pealty pe id (m) is close to its expectatio, which is itself close to 2E [P (γ(s m ) γ(ŝ m ))]. The, (3) directly implies a oracle iequality like (10), hece (11). I other words, we have prove the secod part of the slope heuristics of Sect Actually, Thm. 2 above is a corollary of a more geeral result (Thm. 5), that we state i App. B.2. I particular, if pe(m) KE [P (γ(s m ) γ(ŝ m ))] (12) istead of (9), we ca prove uder the same assumptios that the same oracle iequality holds with a large probability, with a leadig costat C(K) + ǫ istead of almost oe. Whe K (1,2], we have C(K) = (K 1) 1, ad whe K > 2, C(K) = K 1. This meas that for every K > 1, the pealty defied by (12) is efficiet, up to a multiplicative costat. This is well kow i the homoscedastic case (Birgé ad Massart, 2001; Baraud, 2000, 2002), but ew i the heteroscedastic oe. The most importat cosequeces of this result follows from its combiatio with Thm. 1. We detail them i the ext subsectio. Let us first commet the additioal 12

13 Data-drive calibratio of pealties assumptio (Ap), i.e. the lower boud o the bias. It meas that s is ot too well approximated by the models S m, which may seem surprisig. Notice that it is classical to assume that l(s,s m ) > 0 for every m M, for provig the asymptotic optimality of Mallows C p (cf. Shibata (1981), Li (1987) ad Birgé ad Massart (2007)). Moreover, the stroger assumptio (Ap) has already bee made by Stoe (1985) ad Burma (2002) i the desity estimatio framework, for the same techical reasos as ours. As detailed i (Arlot, 2008a) where a similar techique is used to derive a oracle iequality, whe the lower boud i (Ap) is o loger assumed, (10) holds with two modificatios i its right-had side: the if is restricted to models of dimesio larger tha l() γ 1, ad there is a remaider term l() γ 2 1 (where γ 1 ad γ 2 are umerical). This is essetially the same as (10), uless there is a model of small dimesio with a very small bias, ad the lower boud i (Ap) is sufficiet to esure that this do ot happe. Notice that if there is such a very small model very close to s, it is hopeless to obtai a oracle iequality with a pealty which estimates pe id, simply because deviatios of pe id aroud its expectatio would be much larger tha the excess loss of the oracle. I such a situatio, BIC-like methods are more appropriate. Aother argumet i favour of (Ap) is that it is ot too strog, because it is at least satisfied i the followig case: (I λ ) λ Λm is regular, X has a lower-bouded desity w.r.t. the Lebesgue measure o X R k, ad s is o-costat ad α-hölderia (w.r.t. ), with β 1 = k 1 + α 1 (k 1)k 1 α 1 ad β 2 = 2αk 1. We refer to Sect i (Arlot, 2007) for more details about this claim (icludig complete proofs). We fially metio that this is ot the oly case where (Ap) holds, which is the reaso why we use (Ap) as a assumptio, ad ot these sufficiet coditios (cf. the commets at the ed of Sect. 4.3). 4.5 Mai theoretical ad practical cosequeces Combiig Thm. 1 ad 2, we are ow i positio to prove the slope heuristics described i Sect. 2.3, as well as the validity of our Algorithm 1 (provided that pe shape is well chose, for istace estimated by resamplig) Optimal pealty vs. miimal pealty For the sake of simplicity, cosider the pealty KE [p 2 (m)] with ay K > 0 (the same pheomeo occurig for a pealty approximately equal to this oe). At first readig, oe ca thik of the homoscedastic case where E [p 2 (m)] σ 2 D m 1, the geeral picture beig quite similar (this geeralizatio is oe of the ovelties of our results). With Thm. 2, we have show that it satisfies a oracle iequality with a leadig costat C (K) as soo as K > 1. Moreover, C (2) 1. Accordig to (Arlot, 2008b) (the proof of its Thm. 1, i particular Lemma 6), C (K) stays away from 1 as soo as K is ot close to 2. This meas that K = 2 is the optimal multiplyig factor i frot of E [p 2 (m)]. O the other had, whe K < 1, Thm. 1 shows that o oracle iequality ca hold with a leadig costat C (K) smaller tha l() (ad eve much larger i most cases, accordig to the proof of Thm. 1). Sice C (K) (K 1) 1 < l() as soo as K > 1+l() 1, this 13

14 S. Arlot ad P. Massart meas that K = 1 is the miimal multiplyig factor i frot of E [p 2 (m)]. More geerally, we have prove that pe mi (m) := E [p 2 (m)] is a miimal pealty. I a utshell, this is a formal proof of the heuristics of Sect. 2.3: optimal pealty 2 miimal pealty. This has already bee proposed by Birgé ad Massart (2007), but their results were restricted to the Gaussia homoscedastic framework. I this paper, we exted them to a o-gaussia ad heteroscedastic settig Dimesio jump I additio, Thm. 1 ad 2 prove the existece of a crucial pheomeo aroud the miimal pealty, which is the existece of a dimesio jump. This is the oly reaso why we ca estimate the miimal pealty i practice (sice the explosio of the predictio error ca ot be directly observed), so that Algorithm 1 strogly relies o it. Ideed, cosider agai the pealty KE [p 2 (m)], ad defie m(k) the selected model as a fuctio of K. For each K > 0, with a large probability, we have D bm(k) 1 η if K > 1 ad D bm(k) K 2 (l()) 1 if K < 1 (the costat K 2 depeds o K). More precisely, a careful look at the proofs shows that this holds simultaeously i the followig sese: there are costats K 4,K 5 > 0 ad a evet of probability 1 K 4 2 o which K ( 0,1 l() 1), D bm(k) K 5 (l()) 2 ad K ( 1 + l() 1,+ ), D bm(k) 1 η. This meas that there must be a dimesio jump aroud K = 1, from dimesios of order at least (l()) 2 to dimesios much smaller, of order at most 1 η. Actually, there ca be several jumps istead of oly oe, but they occur for very close values of K (at least whe is large). Let us ow come back to Algorithm 1. Defiig a reasoably small dimesio as ay dimesio smaller tha (l()) 3, we have prove that K mi must be close to the true miimal multiplyig factor. Whe the pealty is KE [p 2 (m)], we have 1 1 l() K mi l() with a probability at least 1 K 4 2. Notice that (l()) 3 ca be replaced by ay dimesio betwee K 5 (l()) 2 ad 1 η, which are very far as soo as is large eough. Hece, this dimesio threshold does ot have to be chose accurately as soo as is ot small. Combied with Thm. 2, this shows that the model selectio procedure of Algorithm 1 satisfies a oracle iequality with a leadig costat smaller tha 1+2l() 1/5, o a large probability evet. I additio, the same result holds whe pe shape is oly close to the ideal pealty shape, e.g. withi a ratio 1 ± l() 1. I particular, the resamplig pealties of Efro (1983) ad Arlot (2008b,a) satisfy this coditio o a large probability evet. We refer to Sect. 3.2 for further discussio o this questio. 14

15 Data-drive calibratio of pealties 5. Coclusio We have see i this paper that it is possible to provide mathematical evideces that the method itroduced by Birgé ad Massart (2007) to desig data-drive pealties remais efficiet i a o Gaussia cotext. Our purpose i this coclusive sectio is to relate the heuristics that we have developped i Sect. 2 to the well kow Mallows C p ad Akaike s criteria ad to the ubiased (or almost ubiased) estimatio of the risk priciple. To explai our idea which cosists i guessig what is the right pealty to be used from the data themselves, let us come come back to Gaussia model selectio. Towards this aim let us cosider some empirical criterio γ (which ca be the least squares criterio as i this paper but which could be the log-likelihood criterio as well). Let us also cosider some collectio of models (S m ) m M ad i each model S m some miimizer s m of t E [γ (t)] over S m (assumig that such a poit does exist). Defiig for every m M, bm = γ (s m ) γ (s) ad v m = γ (s m ) γ (ŝ m ), miimizig some pealized criterio γ (ŝ m ) + pe(m) over M amouts to miimize bm v m + pe(m). The poit is that b m is a ubiased estimator of the bias term l(s,s m ). If we have i mid to use cocetratio argumets, oe ca hope that miimizig the quatity above will be approximately equivalet to miimize l(s,s m ) E [ v m ] + pe(m). Sice the purpose of the game is to miimize the risk E [l(s,ŝ m )], a ideal pealty would therefore be pe(m) = E [ v m ] + E [l(s m,ŝ m )]. I the Mallows C p case (for Gaussia fixed desig regressio least squares), the models S m are liear ad E [ v m ] = E [l(s m,ŝ m )] are explicitly computable (at least if the level of oise is assumed to be kow). For Akaike s pealized log-likelihood criterio, this is similar, at least asymptotically. More precisely, oe uses the fact that E [ v m ] E [l(s m,ŝ m )] D m 2, where D m stads for the umber of parameters defiig model S m. The coclusio of these cosideratios is that Mallows C p as well as Akaike s criterio are ideed both based o the ubiased risk (or asymptotically ubiased) estimatio priciple. The first idea that we are usig i this paper is that oe ca go further i this directio ad that the approximatio E [ v m ] E [l(s m,ŝ m )] remais valid eve i a o-asymptotic cotext. If oe believes i it the a good pealty becomes 2E [ v m ] or equivaletly (havig still i mid cocetratio argumets) 2 v m. This i some sese explais the rule of thumb which is give by Birgé ad Massart (2007) ad further studied i this paper, ad coect 15

16 S. Arlot ad P. Massart it to Mallows C p ad Akaike s heuristics. Ideed, the miimal pealty is v m while the optimal pealty should be v m + E [l(s m,ŝ m )] ad their ratio is approximately equal to 2. The secod idea that we are usig i this paper is that oe ca guess the miimal pealty from the data. There are ideed several ways to perform the estimatio of the miimal pealty. Here we are usig the jump of dimesio which occurs aroud the miimal pealty. Whe the shape of the miimal pealty is (at least approximately) of the form αd m, this amouts to estimate the ukow value α by the slope of the graph of γ (ŝ m ) for large eough values of D m. It is easy to exted this method to other shapes of pealties, simply by replacig D m by some (kow!) fuctio f (D m ). It is eve possible to combie resamplig ideas with the slope heuristics by takig a radom fuctio f which is built from a radomized empirical criterio. As show by Arlot (2007) this approach turs out to be much more efficiet tha the rougher choice f (D m ) = D m for highly heteroscedastic radom regressio frameworks. Of course, the questio of the optimality of the slope heuristics remais widely ope but we believe that o the oe had this heuristics ca be helpful i practice ad that o the other had, provig its efficiecy eve o a toy model as we did i this paper is already somethig. Let us fially metio that cotrary to Birgé ad Massart (2007), we have restricted our study to the situatio where the collectio of models M is small, i.e. has a size growig at most like a power of. For several problems, such that complete variable selectio, this assumptio does ot hold, ad it is kow from the homoscedastic case that the miimal pealty is much larger tha E [p 2 (m)]. For istace, usig the results by Birgé ad Massart (2007) i the Gaussia )) case, Émilie Lebarbier has used the slope heuristics with f (D m ) = D m (2.5 + l( D m for multiple chage poits detectio from oisy data. Let us ow explai how we expect to geeralize their heuristics to the o- Gaussia heteroscedastic case. First, group the models accordig to some complexity idex C m (for istace their dimesios, or the approximate value of their resamplig pealty suitably ormalized): for C { 1,..., k }, defie S C = C S m=c m. The, replace the model selectio problem with the family ( (S ) m ) m M by a complexity selectio problem, i.e. model selectio with the family SC 1 C k. We cojecture that this groupig of the models is sufficiet to take ito accout the richess of M for the optimal calibratio of the pealty. A theoretical justificatio of this poit may rely o the extesio of our results to ay kid of model, ot oly histogram oes (each S C is ot a histogram model, sice it is eve ot a vector space). As already metioed, this remais a iterestig ope problem. Appedix A. Computatioal aspects of the slope heuristics With Algorithm 2 (possibly combied with resamplig pealties for step 1), we have a completely data-drive ad optimal model selectio procedure. From the practical viewpoit, the last two problems may be steps 1 ad 2. First, at step 1, how ca we compute exactly m(k) for every K (0,+ ), this latter set beig ucoutable? The aswer is that the whole trajectory ( m(k)) K 0 ca be described with a small umber of parameters, which ca be computed fastly. This poit is the object of Sect. A.1. Secod, at step 2, how ca the jump of dimesio be detected automatically i practice? I other words, how should 16

17 Data-drive calibratio of pealties K mi be defied exactly, as a fuctio of ( m(k)) K 0? We try to aswer this questio i Sect. A.2. A.1 Computatio of ( m(k)) K 0 For every model m M, defie f(m) = P γ (ŝ m ) g(m) = pe shape (m) ad K 0, m(k) arg mi m M {f(m) + Kg(m)}. Sice the latter defiitio ca be ambiguous, we choose ay total orderig o M such that g is o-decreasig. The, m(k) is defied as the smallest elemet of E(K) := arg mi m M {f(m) + Kg(m)} for. The mai reaso why the whole trajectory ( m(k)) K 0 ca be computed efficietly is its very particular shape. Ideed, the results below (mostly Lemma 4) show that K m(k) is piecewise costat, ad o-icreasig for. We the have i {0,...,i max }, K [K i,k i+1 ), m(k) = m i, ad the whole trajectory ( m(k)) K 0 ca be represeted by: a o-egative iteger i max Card(M ) 1 (the umber of jumps), a icreasig sequece of positive reals (K i ) 0 i imax+1 (the locatio of the jumps, with K 0 = 0 ad K imax+1 = + ) a o-icreasig sequece of models (m i ) 0 i imax. We are ow i positio to give a efficiet algorithm for step 1 i Algorithm 2. The poit is that the K i ad the m i ca be computed sequetially, each step havig a complexity proportioal to Card(M ). This meas that its overall complexity is lower tha a costat times i max Card(M ) Card(M ) 2 (ad the latter boud is quite pessimistic i geeral). Notice also that Algorithm 2 ca be stopped earlier if the oly goal is to idetify K mi (which may be doe oly with the first m i ). Algorithm 2 (Step 1 of Algorithm 1) For every m M, defie f(m) = P γ (ŝ m ) ad g(m) = pe shape (m). Choose ay total orderig o M such that g is o-decreasig. Iit: K 0 = 0, m 0 = arg mi m M {f(m)} (whe this miimum is attaied several times, m 0 is defied as the smallest oe for ). Step i, i 1: Let G(m i 1 ) := {m M s.t. f(m) > f(m i 1 ) ad g(m) < g(m i 1 )}. 17

18 S. Arlot ad P. Massart If G(m i 1 ) =, the put K i = +, i max = i 1 ad stop. Otherwise, defie { } f(m) f(mi 1 ) K i := if g(m i 1 ) g(m) s.t. m G(m i 1) (13) ad m i the smallest elemet (for ) of F i := arg mi m G(m i 1 ) { } f(m) f(mi 1 ) g(m i 1 ) g(m). The validity of Algorithm 2 is justified by the followig propositio, showig that these K i ad m i are the same as the oes describig ( m(k)) K 0. Propositio 3 If M is fiite, Algorithm 2 termiates ad i max Card(M ) 1. Usig the otatios of Algorithm 2, ad defiig m(k) as the smallest elemet (for ) of E(K) := arg mi m M {f(m) + Kg(m)}, (K i ) 0 i imax+1 is icreasig ad i {0,...,i max 1}, K [K i,k i+1 ), m(k) = m i. It is prove i Sect. A.3. A.2 Defiitio of K mi We ow come to the questio of defiig K mi as a fuctio of ( m(k)) K>0. As we have metioed i Sect , it correspods to a dimesio jump, which should be observable sice the whole trajectory of ( D bm(k) is kow. )K 0 As a illustratio to this questio, we represeted o Fig. 1 D bm(k) as a fuctio of K, for two simulated samples. O the left (a), the dimesio jump is quite clear, ad we expect a formal defiitio of Kmi to fid this jump. The same picture holds for approximately 85% of the data sets. O the right (b), there seems to be several jumps, ad a proper defiitio of Kmi is problematic. What is sure is the ecessity to fid some automatic choice for K mi, that is defiig it properly. We ow propose two defiitios that seem reasoable to us. For the first oe, choose a threshold D reas., of order /(l()), correspodig to the largest reasoable dimesio for the selected model. The, defie K mi := if { K > 0 s.t. D bm(k) D reas. }. With this defiitio, oe ca stop Algorithm 2 as soo as the threshold is reached. However, K mi may deped strogly o the choice of the threshold, which may ot be quite obvious i the o-asymptotic situatio (where /l() is ot so far from ). Our secod idea is that K mi should match with the largest dimesio jump, i.e. { } K mi := K imax.jump with i max.jump = arg max Dmi+1 D mi. i {0,...,i max 1 } 18

19 Data-drive calibratio of pealties Maximal jump Reasoable dimesio Maximal jump Reasoable dimesio dimesio of m(k) dimesio of m(k) K (a) Oe clear jump K (b) Two jumps, two values for K mi. Figure 1: D bm(k) as a fuctio of K for two differet samples. Data are simulated with X U([0,1]), ǫ N(0,1), s(x) = si(πx), σ 1, = 200. (S m ) m M is the collectio of regular histogram models with dimesio betwee 1 ad /(l()). pe shape (m) = D m. Reasoable dimesios are below /(2l()) 19. See (Arlot, 2008b) for details (experimet S1). Although this defiitio may seem less arbitrary tha the previous oe, it still depeds strogly o M, which may ot cotai so may large models for computatioal reasos. I order to esure that there is a clear jump, a idea may be to add a few models of dimesio /2, so that at least oe has a well-defied empirical risk miimizer ŝ m. I practice, several huge models with a well-defied ŝ m may be ecessary, i order to decrease the variability of K mi. This modificatio has the default of beig quite arbitrary. As a illustratio, we compared the two defiitios above ( reasoable dimesio vs. maximal jump ) o oe thousad simulated samples similar to the oe of Fig. 1. Three cases occured: 1. The values of K mi do ot differ (about 85% of the data sets; this is the (a) situatio). 2. The values of K mi differ, but the selected models m (2 K ) mi are still equal (about 8.5% of the data sets). 3. The fially selected models are differet (about 6.5% of the data sets; this is the (b) situatio). Hece, i this o-asymptotic framework, the formal defiitio of K mi does ot matter i geeral, but stays problematic i a few cases. I terms of predictio error, we have compared the two methods by estimatig the costat C or that would appear i some oracle iequality: C or := E [l(s,ŝ bm )] E [if m M {l(s,ŝ m )}]. 19

20 S. Arlot ad P. Massart With the reasoable dimesio defiitio, C or With the maximal jump defiitio, C or As a compariso, Mallows C p (with a classical estimator of the variace σ 2 ) has a performace of C or 1.93 o the same data. For the three procedures, the stadard deviatio of the estimator of C or is about See Chap. 4 of (Arlot, 2007) for more details. This prelimiary simulatio study shows that Algorithm 1 works efficietly (it is competitive with Mallows C p i a situatio where this oe is also optimal). It also suggests that the reasoable dimesio defiitio may be better, but without very covicig evidece. I order to make the choice of K mi as automatic as possible, we suggest to use simultaeously the two methods. Whe the selected models are ot the same, the, sed a warig to the fial user, advisig him to look at the curve K D bm(k) himself. Otherwise, stay cofidet i the automatic choice of m(2 K mi ). A.3 Proof of Prop. 3 First of all, sice M is fiite, the ifimum i (13) is attaied as soo as G(m i 1 ), so that m i is well defied for every i i max. Moreover, by costructio, g(m i ) decreases with i, so that all the m i M are distict. Hece, Algorithm 2 termiates ad i max +1 Card(M ). We ow prove by iductio the followig property for every i {0,...,i max }: P i : K i < K i+1 ad K [K i,k i+1 ), m(k) = m i. Notice also that K i ca always be defied by (13) with the covetio if = +. P 0 holds true By defiitio of K 1, it is clear that K 1 > 0 (it may be equal to + if G(m 0 ) = ). For K = K 0 = 0, the defiitio of m 0 is the oe of m(0), so that m(k) = m 0. For K (0,K 1 ), Lemma 4 shows that either m(k) = m(0) = m 0 or m(k) G(0). I the latter case, by defiitio of K 1, f( m(k)) f(m 0 ) g(m 0 ) g( m(k)) K 1 > K so that f( m(k)) + Kg( m(k)) > f(m 0 ) + Kg(m 0 ) which is cotradictory with the defiitio of m(k). Hece, P 0 holds true. P i P i+1 for every i {0,...,i max 1} Assume that P i holds true. First, we have to prove that K i+2 > K i+1. Sice K imax+1 = +, this is clear if i = i max 1. Otherwise, K i+2 < + ad m i+2 exists. The, by defiitio of m i+2 ad K i+2 (resp. m i+1 ad K i+1 ), we have f(m i+2 ) f(m i+1 ) = K i+2 (g(m i+1 ) g(m i+2 )) (14) f(m i+1 ) f(m i ) = K i+1 (g(m i ) g(m i+1 )). (15) 20

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Suboptimality of penalties proportional to the dimension for model selection in heteroscedastic regression

Suboptimality of penalties proportional to the dimension for model selection in heteroscedastic regression Suboptimality of pealties proportioal to the dimesio for model selectio i heteroscedastic regressio Sylvai Arlot To cite this versio: Sylvai Arlot. Suboptimality of pealties proportioal to the dimesio

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

STATISTICS 593C: Spring, Model Selection and Regularization

STATISTICS 593C: Spring, Model Selection and Regularization STATISTICS 593C: Sprig, 27 Model Selectio ad Regularizatio Jo A. Weller Lecture 2 (March 29): Geeral Notatio ad Some Examples Here is some otatio ad termiology that I will try to use (more or less) systematically

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Data-driven calibration of linear estimators with minimal penalties

Data-driven calibration of linear estimators with minimal penalties Data-drive calibratio of liear estimators with miimal pealties Sylvai Arlot CNRS ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie,

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

1 Approximating Integrals using Taylor Polynomials

1 Approximating Integrals using Taylor Polynomials Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Quantile regression with multilayer perceptrons.

Quantile regression with multilayer perceptrons. Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Chi-Squared Tests Math 6070, Spring 2006

Chi-Squared Tests Math 6070, Spring 2006 Chi-Squared Tests Math 6070, Sprig 2006 Davar Khoshevisa Uiversity of Utah February XXX, 2006 Cotets MLE for Goodess-of Fit 2 2 The Multiomial Distributio 3 3 Applicatio to Goodess-of-Fit 6 3 Testig for

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Mathematical Induction

Mathematical Induction Mathematical Iductio Itroductio Mathematical iductio, or just iductio, is a proof techique. Suppose that for every atural umber, P() is a statemet. We wish to show that all statemets P() are true. I a

More information

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series Roberto s Notes o Series Chapter 2: Covergece tests Sectio 7 Alteratig series What you eed to kow already: All basic covergece tests for evetually positive series. What you ca lear here: A test for series

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Law of the sum of Bernoulli random variables

Law of the sum of Bernoulli random variables Law of the sum of Beroulli radom variables Nicolas Chevallier Uiversité de Haute Alsace, 4, rue des frères Lumière 68093 Mulhouse icolas.chevallier@uha.fr December 006 Abstract Let be the set of all possible

More information

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero? 2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018) Radomized Algorithms I, Sprig 08, Departmet of Computer Sciece, Uiversity of Helsiki Homework : Solutios Discussed Jauary 5, 08). Exercise.: Cosider the followig balls-ad-bi game. We start with oe black

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Bertrand s Postulate

Bertrand s Postulate Bertrad s Postulate Lola Thompso Ross Program July 3, 2009 Lola Thompso (Ross Program Bertrad s Postulate July 3, 2009 1 / 33 Bertrad s Postulate I ve said it oce ad I ll say it agai: There s always a

More information

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates Iteratioal Joural of Scieces: Basic ad Applied Research (IJSBAR) ISSN 2307-4531 (Prit & Olie) http://gssrr.org/idex.php?joural=jouralofbasicadapplied ---------------------------------------------------------------------------------------------------------------------------

More information

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22 CS 70 Discrete Mathematics for CS Sprig 2007 Luca Trevisa Lecture 22 Aother Importat Distributio The Geometric Distributio Questio: A biased coi with Heads probability p is tossed repeatedly util the first

More information

Statisticians use the word population to refer the total number of (potential) observations under consideration

Statisticians use the word population to refer the total number of (potential) observations under consideration 6 Samplig Distributios Statisticias use the word populatio to refer the total umber of (potetial) observatios uder cosideratio The populatio is just the set of all possible outcomes i our sample space

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

Math 216A Notes, Week 5

Math 216A Notes, Week 5 Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds

More information

Math F215: Induction April 7, 2013

Math F215: Induction April 7, 2013 Math F25: Iductio April 7, 203 Iductio is used to prove that a collectio of statemets P(k) depedig o k N are all true. A statemet is simply a mathematical phrase that must be either true or false. Here

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

Beurling Integers: Part 2

Beurling Integers: Part 2 Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers

More information

Stat 421-SP2012 Interval Estimation Section

Stat 421-SP2012 Interval Estimation Section Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information