Data-driven calibration of linear estimators with minimal penalties
|
|
- Rosaline Miles
- 6 years ago
- Views:
Transcription
1 Data-drive calibratio of liear estimators with miimal pealties Sylvai Arlot CNRS ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie, F Paris, Frace sylvai.arlot@es.fr Fracis Bach INRIA ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie, F Paris, Frace fracis.bach@es.fr Abstract This paper tackles the problem of selectig amog several liear estimators i o-parametric regressio; this icludes model selectio for liear regressio, the choice of a regularizatio parameter i kerel ridge regressio or splie smoothig, ad the choice of a kerel i multiple kerel learig. We propose a ew algorithm which first estimates cosistetly the variace of the oise, based upo the cocept of miimal pealty which was previously itroduced i the cotext of model selectio. The, pluggig our variace estimate i Mallows C L pealty is proved to lead to a algorithm satisfyig a oracle iequality. Simulatio experimets with kerel ridge regressio ad multiple kerel learig show that the proposed algorithm ofte improves sigificatly existig calibratio procedures such as 10-fold cross-validatio or geeralized cross-validatio. 1 Itroductio Kerel-based methods are ow well-established tools for supervised learig, allowig to perform various tasks, such as regressio or biary classificatio, with liear ad o-liear predictors 1, 2. A cetral issue commo to all regularizatio frameworks is the choice of the regularizatio parameter: while most practitioers use cross-validatio procedures to select such a parameter, data-drive procedures ot based o cross-validatio are rarely used. The choice of the kerel, a seemigly urelated issue, is also importat for good predictive performace: several techiques exist, either based o cross-validatio, Gaussia processes or multiple kerel learig 3, 4, 5. I this paper, we cosider least-squares regressio ad cast these two problems as the problem of selectig amog several liear estimators, where the goal is to choose a estimator with a quadratic risk which is as small as possible. This problem icludes for istace model selectio for liear regressio, the choice of a regularizatio parameter i kerel ridge regressio or splie smoothig, ad the choice of a kerel i multiple kerel learig (see Sectio 2). The mai cotributio of the paper is to exted the otio of miimal pealty 6, 7 to all discrete classes of liear operators, ad to use it for defiig a fully data-drive selectio algorithm satisfyig a o-asymptotic oracle iequality. Our ew theoretical results preseted i Sectio 4 exted similar results which were limited to uregularized least-squares regressio (i.e., projectio operators). Fially, i Sectio 5, we show that our algorithm improves the performaces of classical selectio procedures, such as GCV 8 ad 10-fold cross-validatio, for kerel ridge regressio or multiple kerel learig, for moderate values of the sample size. arlot/ fbach/ 1
2 2 Liear estimators I this sectio, we defie the problem we aim to solve ad give several examples of liear estimators. 2.1 Framework ad otatio Let us assume that oe observes Y i = f(x i ) + ε i R for i = 1..., where ε 1,...,ε are i.i.d. cetered radom variables with Eε 2 i = σ2 ukow, f is a ukow measurable fuctio X R ad x 1,...,x X are determiistic desig poits. No assumptio is made o the set X. The goal is to recostruct the sigal F = (f(x i )) 1 i R, with some estimator F R, depedig oly o (x 1,Y 1 ),...,(x,y ), ad havig a small quadratic risk 1 F F 2 2, where t R, we deote by t 2 the l 2 -orm of t, defied as t 2 2 := i=1 t2 i. I this paper, we focus o liear estimators F that ca be writte as a liear fuctio of Y = (Y 1,...,Y ) R, that is, F = AY, for some (determiistic) matrix A. Here ad i the rest of the paper, vectors such as Y or F are assumed to be colum-vectors. We preset i Sectio 2.2 several importat families of estimators of this form. The matrix A may deped o x 1,...,x (which are kow ad determiistic), but ot o Y, ad may be parameterized by certai quatities usually regularizatio parameter or kerel combiatio weights. 2.2 Examples of liear estimators I this paper, our theoretical results apply to matrices A which are symmetric positive semi-defiite, such as the oes defied below. Ordiary least-squares regressio / model selectio. If we cosider liear predictors from a desig matrix X R p, the F = AY with A = X(X X) 1 X, which is a projectio matrix (i.e., A A = A); F = AY is ofte called a projectio estimator. I the variable selectio settig, oe wats to select a subset J {1,...,p}, ad matrices A are parameterized by J. Kerel ridge regressio / splie smoothig. We assume that a positive defiite kerel k : X X R is give, ad we are lookig for a fuctio f : X R i the associated reproducig kerel Hilbert space (RKHS) F, with orm F. If K deotes the kerel matrix, defied by K ab = k(x a,x b ), the the ridge regressio estimator a.k.a. splie smoothig estimator for splie kerels 9 is obtaied by miimizig with respect to f F 2: 1 (Y i f(x i )) 2 + λ f 2 F. i=1 The uique solutio is equal to f = i=1 α ik(,x i ), where α = (K +λi) 1 Y. This leads to the smoothig matrix A λ = K(K + λi ) 1, parameterized by the regularizatio parameter λ R +. Multiple kerel learig / Group Lasso / Lasso. We ow assume that we have p differet kerels k j, feature spaces F j ad feature maps Φ j : X F j, j = 1,...,p. The group Lasso 10 ad multiple kerel learig 11, 5 frameworks cosider the followig objective fuctio ( J(f 1,...,f p )= 1 yi p f j,φ j (x i ) ) 2 p p +2λ f j Fj = L(f 1,...,f p )+2λ f j Fj. i=1 Note that whe Φ j (x) is simply the j-th coordiate of x R p, we get back the pealizatio by the l 1 -orm ad thus the regular Lasso 12. Usig a 1/2 1 = mi b 0 2 { a b + b}, we obtai a variatioal formulatio of the sum of orms 2 p f { } p fj j = mi 2 η R p + η j + η j. Thus, miimizig J(f 1,...,f p ) with respect to (f 1,...,f p ) is equivalet to miimizig with respect to η R p + (see 5 for more details): mi L(f 1,...,f p ) + λ f 1,...,f p p f j 2 + λ η j p η j = 1 y ( p η ) 1y p jk j + λi + λ η j, 2
3 where I is the idetity matrix. Moreover, give η, this leads to a smoothig matrix of the form A η,λ = ( p η jk j )( p η jk j + λi ) 1, (1) parameterized by the regularizatio parameter λ R + ad the kerel combiatios i R p + ote that it depeds oly o λ 1 η, which ca be grouped i a sigle parameter i R p +. Thus, the Lasso/group lasso ca be see as particular (covex) ways of optimizig over η. I this paper, we propose a o-covex alterative with better statistical properties (oracle iequality i Theorem 1). Note that i our settig, fidig the solutio of the problem is hard i geeral sice the optimizatio is ot covex. However, while the model selectio problem is by ature combiatorial, our optimizatio problems for multiple kerels are all differetiable ad are thus ameable to gradiet descet procedures which oly fid local optima. No symmetric liear estimators. Other liear estimators are commoly used, such as earesteighbor regressio or the Nadaraya-Watso estimator 13; those however lead to o symmetric matrices A, ad are ot etirely covered by our theoretical results. 3 Liear estimator selectio I this sectio, we first describe the statistical framework of liear estimator selectio ad itroduce the otio of miimal pealty. 3.1 Ubiased risk estimatio heuristics Usually, several estimators of the form F = AY ca be used. The problem that we cosider i this paper is the to select oe of them, that is, to choose a matrix A. Let us assume that a family of matrices (A λ ) λ Λ is give (examples are show i Sectio 2.2), hece a family of estimators ( F λ ) λ Λ ca be used, with F λ := A λ Y. The goal is to choose from data some λ Λ, so that the quadratic risk of F bλ is as small as possible. The best choice would be the oracle: λ arg mi λ Λ { 1 F } λ F 2 2 which caot be used sice it depeds o the ukow sigal F. Therefore, the goal is to defie a data-drive λ satisfyig a oracle iequality 1 F { bλ F 2 2 C if 1 F } λ F R, (2) λ Λ with large probability, where the leadig costat C should be close to 1 (at least for large ) ad the remaider term R should be egligible compared to the risk of the oracle. May classical selectio methods are built upo the ubiased risk estimatio heuristics: If λ miimizes a criterio crit(λ) such that λ Λ, Ecrit(λ) E 1 F λ F 2 2, the λ satisfies a oracle iequality such as i Eq. (2) with large probability. For istace, crossvalidatio 14, 15 ad geeralized cross-validatio (GCV) 8 are built upo this heuristics. Oe way of implemetig this heuristics is pealizatio, which cosists i miimizig the sum of the empirical risk ad a pealty term, i.e., usig a criterio of the form: crit(λ) = 1 F λ Y pe(λ). The ubiased risk estimatio heuristics, also called Mallows heuristics, the leads to the ideal (determiistic) pealty pe id (λ) := E 1 F λ F 2 2 E 1 F λ Y 2 2., 3
4 Whe F λ = A λ Y, we have: F λ F 2 2 = (A λ I )F A λε A λε, (A λ I )F, (3) F λ Y 2 2 = F λ F ε ε, A λε + 2 ε, (I A λ )F, (4) where ε = Y F R ad t,u R, t, u = i=1 t iu i. Sice ε is cetered with covariace matrix σ 2 I, Eq. (3) ad Eq. (4) imply that pe id (λ) = 2σ2 tr(a λ ), (5) up to the term E 1 ε 2 2= σ 2, which ca be dropped off sice it does ot vary with λ. Note that df(λ) = tr(a λ ) is called the effective dimesioality or degrees of freedom 16, so that the ideal pealty i Eq. (5) is proportioal to the dimesioality associated with the matrix A λ for projectio matrices, we get back the dimesio of the subspace, which is classical i model selectio. The expressio of the ideal pealty i Eq. (5) led to several selectio procedures, i particular Mallows C L (called C p i the case of projectio estimators) 17, where σ 2 is replaced by some estimator σ 2. The estimator of σ 2 usually used with C L is based upo the value of the empirical risk at some λ 0 with df(λ 0 ) large; it has the drawback of overestimatig the risk, i a way which depeds o λ 0 ad F 18. GCV, which implicitly estimates σ 2, has the drawback of overfittig if the family (A λ ) λ Λ cotais a matrix too close to I 19; GCV also overestimates the risk eve more tha C L for most A λ (see (7.9) ad Table 4 i 18). I this paper, we defie a estimator of σ 2 directly related to the selectio task which does ot have similar drawbacks. Our estimator relies o the cocept of miimal pealty, itroduced by Birgé ad Massart 6 ad further studied i Miimal ad optimal pealties We deduce from Eq. (3) the bias-variace decompositio of the risk: E 1 F λ F 2 2 = 1 (A λ I )F 22 + tr(a λ A λ)σ 2 = bias + variace, (6) ad from Eq. (4) the expectatio of the empirical risk: E 1 F ( λ Y 2 2 ε 2 2 = 1 (A λ I )F 22 2tr(Aλ ) tr(a λ A λ) ) σ 2. (7) Note that the variace term i Eq. (6) is ot proportioal to the effective dimesioality df(λ) = tr(a λ ) but to tr(a λ A λ). Although several papers argue these terms are of the same order (for istace, they are equal whe A λ is a projectio matrix), this may ot hold i geeral. If A λ is symmetric with a spectrum Sp(A λ ) 0,1, as i all the examples of Sectio 2.2, we oly have 0 tr(a λ A λ ) tr(a λ ) 2tr(A λ ) tr(a λ A λ ) 2tr(A λ ). (8) I order to give a first ituitive iterpretatio of Eq. (6) ad Eq. (7), let us cosider the kerel ridge regressio example ad assume that the risk ad the empirical risk behave as their expectatios i Eq. (6) ad Eq. (7); see also Fig. 1. Completely rigorous argumets based upo cocetratio iequalities are developed i 20 ad summarized i Sectio 4, leadig to the same coclusio as the preset iformal reasoig. First, as proved i 20, the bias 1 (A λ I )F 2 2 is a decreasig fuctio of the dimesioality df(λ) = tr(a λ ), ad the variace tr(a λ A λ)σ 2 1 is a icreasig fuctio of df(λ), as well as 2tr(A λ ) tr(a λ A λ). Therefore, Eq. (6) shows that the optimal λ realizes the best trade-off betwee bias (which decreases with df(λ)) ad variace (which icreases with df(λ)), which is a classical fact i model selectio. Secod, the expectatio of the empirical risk i Eq. (7) ca be decomposed ito the bias ad a egative variace term which is the opposite of pe mi (λ) := 1 ( 2tr(A λ ) tr(a λ A λ ) ) σ 2. (9) 4
5 geeralizatio errors σ 2 tra σ 2 tra 2 2σ 2 tra bias variace σ 2 tr A 2 geeralizatio error bias + σ 2 tr A 2 empirical error σ 2 bias+σ 2 tra 2 2σ 2 tr A degrees of freedom ( tr A ) Figure 1: Bias-variace decompositio of the geeralizatio error, ad miimal/optimal pealties. As suggested by the otatio pe mi, we will show it is a miimal pealty i the followig sese. If { C 0, λmi (C) arg mi 1 F } λ Y C pe mi (λ), λ Λ the, up to cocetratio iequalities that are detailed i Sectio 4.2, λ mi (C) behaves like a miimizer of g C (λ) = E 1 F λ Y C pe mi (λ) 1 σ 2 = 1 (A λ I )F 2 2 +(C 1)pe mi(λ). Therefore, two mai cases ca be distiguished: if C < 1, the g C (λ) decreases with df(λ) so that df( λ mi (C)) is huge: λ mi (C) overfits. if C > 1, the g C (λ) icreases with df(λ) whe df(λ) is large eough, so that df( λ mi (C)) is much smaller tha whe C < 1. As a coclusio, pe mi (λ) is the miimal amout of pealizatio eeded so that a miimizer λ of a pealized criterio is ot clearly overfittig. Followig a idea first proposed i 6 ad further aalyzed or used i several other papers such as 21, 7, 22, we ow propose to use that pe mi (λ) is a miimal pealty for estimatig σ 2 ad plug this estimator ito Eq. (5). This leads to the algorithm described i Sectio 4.1. Note that the miimal pealty give by Eq. (9) is ew; it geeralizes previous results 6, 7 where pe mi (A λ ) = 1 tr(a λ )σ 2 because all A λ were assumed to be projectio matrices, i.e., A λ A λ = A λ. Furthermore, our results geeralize the slope heuristics pe id 2pe mi (oly valid for projectio estimators 6, 7) to geeral liear estimators for which pe id /pe mi (1,2. 4 Mai results I this sectio, we first describe our algorithm ad the preset our theoretical results. 4.1 Algorithm The followig algorithm first computes a estimator of Ĉ of σ2 usig the miimal pealty i Eq. (9), the cosiders the ideal pealty i Eq. (5) for selectig λ. Iput: Λ a fiite set with Card(Λ) K α for some K,α 0, ad matrices A λ. C > 0, compute λ 0 (C) arg mi λ Λ { F λ Y C ( 2tr(A λ ) tr(a λ A λ) ) }. Fid Ĉ such that df( λ 0 (Ĉ)) 3/4,/10. Select λ arg mi λ Λ { F λ Y Ĉ tr(a λ)}. I the steps 1 ad 2 of the above algorithm, i practice, a grid i log-scale is used, ad our theoretical results from the ext sectio suggest to use a step-size of order 1/4. Note that it may ot be 5
6 possible i all cases to fid a C such that df( λ 0 (C)) 3/4,/10 ; therefore, our coditio i step 2, could be relaxed to fidig a Ĉ such that for all C > Ĉ + δ, df( λ 0 (C)) < 3/4 ad for all C < Ĉ δ, df( λ 0 (C)) > /10, with δ = 1/4+ξ, where ξ > 0 is a small costat. Alteratively, usig the same grid i log-scale, we ca select Ĉ with maximal jump betwee successive values of df( λ 0 (C)) ote that our theoretical result the does ot etirely hold, as we show the presece of a jump aroud σ 2, but do ot show the absece of similar jumps elsewhere. 4.2 Oracle iequality Theorem 1 Let Ĉ ad λ be defied as i the algorithm of Sectio 4.1, with Card(Λ) K α for some K,α 0. Assume that λ Λ, A λ is symmetric with Sp(A λ ) 0,1, that ε i are i.i.d. Gaussia with variace σ 2 > 0, ad that λ 1,λ 2 Λ with df(λ 1 ) 2, df(λ 2), ad i {1,2}, 1 (A λi I )F 2 2 σ2 l(). (A 1 2 ) The, a umerical costat C a ad a evet of probability at least 1 8K 2 exist o which, for every C a, ( ( ) l() 1 91(α + 2) )σ 2 Ĉ (α + 2) l() σ 2. (10) 1/4 Furthermore, if κ 1, λ Λ, 1 tr(a λ )σ 2 κe 1 F λ F 2 2, (A 3 ) the, a costat C b depedig oly o κ exists such that for every C b, o the same evet, ( 1 F bλ F κ ) { if 1 l() F } λ F 2 36(κ + α + 2)l()σ (11) λ Λ Theorem 1 is proved i 20. The proof maily follows from the iformal argumets developed i Sectio 3.2, completed with the followig two cocetratio iequalities: If ξ R is a stadard Gaussia radom vector, α R ad M is a real-valued matrix, the for every x 0, ( α, ξ ) 2x α 2 ( P θ > 0, P 1 2e x (12) ) Mξ 2 2 tr(m M) θ tr(m M) + 2(1 + θ 1 ) M 2 x 1 2e x, (13) where M is the operator orm of M. A proof of Eq. (12) ad (13) ca be foud i Discussio of the assumptios of Theorem 1 Gaussia oise. Whe ε is sub-gaussia, Eq. (12) ad Eq. (13) ca be proved for ξ = σ 1 ε at the price of additioal techicalities, which implies that Theorem 1 is still valid. Symmetry. The assumptio that matrices A λ must be symmetric ca certaily be relaxed, sice it is oly used for derivig from Eq. (13) a cocetratio iequality for A λ ξ, ξ. Note that Sp(A λ ) 0,1 barely is a assumptio sice it meas that A λ actually shriks Y. Assumptios (A 1 2 ). (A 1 2 ) holds if max λ Λ {df(λ)} /2 ad the bias is smaller tha cdf(λ) d for some c,d > 0, a quite classical assumptio i the cotext of model selectio. Besides, (A 1 2 ) is much less restrictive ad ca eve be relaxed, see 20. Assumptio (A 3 ). The upper boud (A 3 ) o tr(a λ ) is certaily the strogest assumptio of Theorem 1, but it is oly eeded for Eq. (11). Accordig to Eq. (6), (A 3 ) holds with κ = 1 whe A λ is a projectio matrix sice tr(a λ A λ) = tr(a λ ). I the kerel ridge regressio framework, (A 3 ) holds as soo as the eigevalues of the kerel matrix K decrease like j α see 20. I geeral, (A 3 ) meas that F λ should ot have a risk smaller tha the parametric covergece rate associated with a model of dimesio df(λ) = tr(a λ ). Whe (A 3 ) does ot hold, selectig amog estimators whose risks are below the parametric rate is a rather difficult problem ad it may ot be possible to attai the risk of the oracle i geeral. 6
7 selected degrees of freedom log(c/σ 2 ) miimal pealty optimal pealty / 2 selected degrees of freedom log(c/σ 2 ) optimal/2 miimal (discrete) miimal (cotiuous) Figure 2: Selected degrees of freedom vs. pealty stregth log(c/σ 2 ) : ote that whe pealizig by the miimal pealty, there is a strog jump at C = σ 2, while whe usig half the optimal pealty, this is ot the case. Left: sigle kerel case, Right: multiple kerel case. Nevertheless, a oracle iequality ca still be proved without (A 3 ), at the price of elargig Ĉ slightly ad addig a small fractio of σ 2 1 tr(a λ ) i the right-had side of Eq. (11), see 20. Elargig Ĉ is ecessary i geeral: If tr(a λ A λ) tr(a λ ) for most λ Λ, the miimal pealty is very close to 2σ 2 1 tr(a λ ), so that accordig to Eq. (10), overfittig is likely as soo as Ĉ uderestimates σ 2, eve by a very small amout. 4.4 Mai cosequeces of Theorem 1 ad compariso with previous results Cosistet estimatio of σ 2. The first part of Theorem 1 shows that Ĉ is a cosistet estimator of σ 2 i a geeral framework ad uder mild assumptios. Compared to classical estimators of σ 2, such as the oe usually used with Mallows C L, Ĉ does ot deped o the choice of some model assumed to have almost o bias, which ca lead to overestimatig σ 2 by a ukow amout 18. Oracle iequality. Our algorithm satisfies a oracle iequality with high probability, as show by Eq. (11): The risk of the selected estimator F bλ is close to the risk of the oracle, up to a remaider term which is egligible whe the dimesioality df(λ ) grows with faster tha l(), a typical situatio whe the bias is ever equal to zero, for istace i kerel ridge regressio. Several oracle iequalities have bee proved i the statistical literature for Mallows C L with a cosistet estimator of σ 2, for istace i 23. Nevertheless, except for the model selectio problem (see 6 ad refereces therei), all previous results were asymptotic, meaig that is implicitly assumed to be larged compared to each parameter of the problem. This assumptio ca be problematic for several learig problems, for istace i multiple kerel learig whe the umber p of kerels may grow with. O the cotrary, Eq. (11) is o-asymptotic, meaig that it holds for every fixed as soo as the assumptios explicitly made i Theorem 1 are satisfied. Compariso with other procedures. Accordig to Theorem 1 ad previous theoretical results 23, 19, C L, GCV, cross-validatio ad our algorithm satisfy similar oracle iequalities i various frameworks. This should ot lead to the coclusio that these procedures are completely equivalet. Ideed, secod-order terms ca be large for a give, while they are hidde i asymptotic results ad ot tightly estimated by o-asymptotic results. As showed by the simulatios i Sectio 5, our algorithm yields statistical performaces as good as existig methods, ad ofte quite better. Furthermore, our algorithm ever overfits too much because df( λ) is by costructio smaller tha the effective dimesioality of λ 0 (Ĉ) at which the jump occurs. This is a quite iterestig property compared for istace to GCV, which is likely to overfit if it is ot corrected because GCV miimizes a criterio proportioal to the empirical risk. 5 Simulatios Throughout this sectio, we cosider expoetial kerels o R d, k(x,y)= d i=1 e xi yi, with the x s sampled i.i.d. from a stadard multivariate Gaussia. The fuctios f are the selected radomly as m i=1 α ik(,z i ), where both α ad z are i.i.d. stadard Gaussia (i.e., f belogs to the RKHS). 7
8 fold CV MKL+CV 2.5 GCV 3 GCV 2 mi. pealty kerel sum 2.5 mi. pealty log() log() Figure 3: Compariso of various smoothig parameter selectio (miikerel, GCV, 10-fold cross validatio) for various values of umbers of observatios, averaged over 20 replicatios. Left: sigle kerel, right: multiple kerels. mea( error / error oracle ) mea( error / error Mallows ) Jump. I Figure 2 (left), we cosider data x i R 6, = 1000, ad study the size of the jump i Figure 2 for kerel ridge regressio. With half the optimal pealty (which is used i traditioal variable selectio for liear regressio), we do ot get ay jump, while with the miimal pealty we always do. I Figure 2 (right), we plot the same curves for the multiple kerel learig problem with two kerels o two differet 4-dimesioal variables, with similar results. I additio, we show two ways of optimizig over λ Λ = R 2 +, by discrete optimizatio with differet kerel matrices a situatio covered by Theorem 1 or with cotiuous optimizatio with respect to η i Eq. (1), by gradiet descet a situatio ot covered by Theorem 1. Compariso of estimator selectio methods. I Figure 3, we plot model selectio results for 20 replicatios of data (d = 4, = 500), comparig GCV 8, our miimal pealty algorithm, ad cross-validatio methods. I the left part (sigle kerel), we compare to the oracle (which ca be computed because we ca eumerate Λ), ad use for cross-validatio all possible values of λ. I the right part (multiple kerel), we compare to the performace of Mallows C L whe σ 2 is kow (i.e., pealty i Eq. (5)), ad sice we caot eumerate all λ s, we use the solutio obtaied by MKL with CV 5. We also compare to usig our miimal pealty algorithm with the sum of kerels. 6 Coclusio A ew light o the slope heuristics. Theorem 1 geeralizes some results first proved i 6 where all A λ are assumed to be projectio matrices, a framework where assumptio (A 3 ) is automatically satisfied. To this extet, Birgé ad Massart s slope heuristics has bee modified i a way that sheds a ew light o the magical factor 2 betwee the miimal ad the optimal pealty, as proved i 6, 7. Ideed, Theorem 1 shows that for geeral liear estimators, pe id (λ) pe mi (λ) = 2tr(A λ ) 2tr(A λ ) tr(a λ A λ), (14) which ca take ay value i (1,2 i geeral; this ratio is oly equal to 2 whe tr(a λ ) tr(a λ A λ), hece mostly whe A λ is a projectio matrix. Future directios. I the case of projectio estimators, the slope heuristics still holds whe the desig is radom ad data are heteroscedastic 7; we would like to kow whether Eq. (14) is still valid for heteroscedastic data with geeral liear estimators. I additio, the good empirical performaces of elbow heuristics based algorithms (i.e., based o the sharp variatio of a certai quatity aroud good hyperparameter values) suggest that Theorem 1 ca be geeralized to may learig frameworks (ad potetially to o-liear estimators), probably with small modificatios i the algorithm, but always relyig o the cocept of miimal pealty. Aother iterestig ope problem would be to exted the results of Sectio 4, where Card(Λ) K α is assumed, to cotiuous sets Λ such as the oes appearig aturally i kerel ridge regressio ad multiple kerel learig. We cojecture that Theorem 1 is valid without modificatio for a small cotiuous Λ, such as i kerel ridge regressio where takig a grid of size i log-scale is almost equivalet to takig Λ = R +. O the cotrary, i applicatios such as the Lasso with p variables, the atural set Λ caot be well covered by a grid of cardiality α with α small, ad our miimal pealty algorithm ad Theorem 1 certaily have to be modified. 8
9 Refereces 1 J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, B. Schölkopf ad A. J. Smola. Learig with Kerels. MIT Press, O. Chapelle ad V. Vapik. Model selectio for support vector machies. I Advaces i Neural Iformatio Processig Systems (NIPS), C. E. Rasmusse ad C. Williams. Gaussia Processes for Machie Learig. MIT Press, F. Bach. Cosistecy of the group Lasso ad multiple kerel learig. Joural of Machie Learig Research, 9: , L. Birgé ad P. Massart. Miimal pealties for Gaussia model selectio. Probab. Theory Related Fields, 138(1-2):33 73, S. Arlot ad P. Massart. Data-drive calibratio of pealties for least-squares regressio. J. Mach. Lear. Res., 10: , P. Crave ad G. Wahba. Smoothig oisy data with splie fuctios. Estimatig the correct degree of smoothig by the method of geeralized cross-validatio. Numer. Math., 31(4): , 1978/79. 9 G. Wahba. Splie Models for Observatioal Data. SIAM, M. Yua ad Y. Li. Model selectio ad estimatio i regressio with grouped variables. Joural of The Royal Statistical Society Series B, 68(1):49 67, G. R. G. Lackriet, N. Cristiaii, P. Bartlett, L. El Ghaoui, ad M. I. Jorda. Learig the kerel matrix with semidefiite programmig. J. Mach. Lear. Res., 5:27 72, 2003/ R. Tibshirai. Regressio shrikage ad selectio via the Lasso. Joural of The Royal Statistical Society Series B, 58(1): , T. Hastie, R. Tibshirai, ad J. Friedma. The Elemets of Statistical Learig. Spriger- Verlag, D. M. Alle. The relatioship betwee variable selectio ad data augmetatio ad a method for predictio. Techometrics, 16: , M. Stoe. Cross-validatory choice ad assessmet of statistical predictios. J. Roy. Statist. Soc. Ser. B, 36: , T. Zhag. Learig bouds for kerel regressio usig effective data dimesioality. Neural Comput., 17(9): , C. L. Mallows. Some commets o C p. Techometrics, 15: , B. Efro. How biased is the apparet error rate of a predictio rule? J. Amer. Statist. Assoc., 81(394): , Y. Cao ad Y. Golubev. O oracle iequalities related to smoothig splies. Math. Methods Statist., 15(4): (2007), S. Arlot ad F. Bach. Data-drive calibratio of liear estimators with miimal pealties, September Log versio. arxiv: v1. 21 É. Lebarbier. Detectig multiple chage-poits i the mea of a gaussia process by model selectio. Sigal Proces., 85: , C. Maugis ad B. Michel. Slope heuristics for variable selectio ad clusterig via gaussia mixtures. Techical Report 6550, INRIA, K.-C. Li. Asymptotic optimality for C p, C L, cross-validatio ad geeralized cross-validatio: discrete idex set. A. Statist., 15(3): ,
A survey on penalized empirical risk minimization Sara A. van de Geer
A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationStudy the bias (due to the nite dimensional approximation) and variance of the estimators
2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite
More informationResampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.
Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationSupport vector machine revisited
6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationA Risk Comparison of Ordinary Least Squares vs Ridge Regression
Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationLinear Regression Demystified
Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to
More informationLecture 24: Variable selection in linear models
Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet
More informationLinear Support Vector Machines
Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationSummary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector
Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short
More informationInformation-based Feature Selection
Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with
More informationA RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS
J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationOutline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression
REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques
More informationECON 3150/4150, Spring term Lecture 3
Itroductio Fidig the best fit by regressio Residuals ad R-sq Regressio ad causality Summary ad ext step ECON 3150/4150, Sprig term 2014. Lecture 3 Ragar Nymoe Uiversity of Oslo 21 Jauary 2014 1 / 30 Itroductio
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More informationRandom Variables, Sampling and Estimation
Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationSTATISTICS 593C: Spring, Model Selection and Regularization
STATISTICS 593C: Sprig, 27 Model Selectio ad Regularizatio Jo A. Weller Lecture 2 (March 29): Geeral Notatio ad Some Examples Here is some otatio ad termiology that I will try to use (more or less) systematically
More informationQuantile regression with multilayer perceptrons.
Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 22
CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig
More informationOutput Analysis and Run-Length Control
IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%
More informationAlgebra of Least Squares
October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal
More informationQuestions and answers, kernel part
Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More informationElement sampling: Part 2
Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig
More informationChapter 6 Sampling Distributions
Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More informationThe log-behavior of n p(n) and n p(n)/n
Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity
More informationLecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)
Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell
More informationAdmin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)
Admi Assigmet 5! Starter REGULARIZATION David Kauchak CS 158 Fall 2016 Schedule Midterm ext week, due Friday (more o this i 1 mi Assigmet 6 due Friday before fall break Midterm Dowload from course web
More information1 Inferential Methods for Correlation and Regression Analysis
1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More informationGeometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT
OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca
More informationLecture 3: August 31
36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,
More informationMachine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring
Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor
More informationIntroductory statistics
CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key
More information(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3
MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationMath 216A Notes, Week 5
Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds
More information6.867 Machine learning, lecture 7 (Jaakkola) 1
6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationNYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)
NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationSupplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate
Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We
More informationLecture 3. Properties of Summary Statistics: Sampling Distribution
Lecture 3 Properties of Summary Statistics: Samplig Distributio Mai Theme How ca we use math to justify that our umerical summaries from the sample are good summaries of the populatio? Lecture Summary
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationEstimation for Complete Data
Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of
More information6.867 Machine learning
6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples
More informationProperties and Hypothesis Testing
Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.
More informationLecture 3 The Lebesgue Integral
Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified
More informationRates of Convergence by Moduli of Continuity
Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity
More informationarxiv: v2 [math.st] 20 Mar 2008
Joural of Machie Learig Research 0 (0000) 0 Submitted 3/08; Published 0/00 Data-drive calibratio of pealties for least-squares regressio arxiv:0802.0837v2 [math.st] 20 Mar 2008 Sylvai Arlot Uiv Paris-Sud,
More informationw (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.
2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 3
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe
More informationFACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures
FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 5 STATISTICS II. Mea ad stadard error of sample data. Biomial distributio. Normal distributio 4. Samplig 5. Cofidece itervals
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationDimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector
Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et
More informationCHAPTER 10 INFINITE SEQUENCES AND SERIES
CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece
More information5.1 Review of Singular Value Decomposition (SVD)
MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of
More information62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +
62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of
More informationChapter 7 Isoperimetric problem
Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated
More informationA Note on Adaptive Group Lasso
A Note o Adaptive Group Lasso Hasheg Wag ad Chelei Leg Pekig Uiversity & Natioal Uiversity of Sigapore July 7, 2006. Abstract Group lasso is a atural extesio of lasso ad selects variables i a grouped maer.
More informationBinary classification, Part 1
Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete
More informationRiesz-Fischer Sequences and Lower Frame Bounds
Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.
More informationarxiv: v1 [math.pr] 13 Oct 2011
A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,
More informationRegularization methods for large scale machine learning
Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale
More informationAdvanced Stochastic Processes.
Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.
More informationLECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION
Jauary 3 07 LECTURE LEAST SQUARES CROSS-VALIDATION FOR ERNEL DENSITY ESTIMATION Noparametric kerel estimatio is extremely sesitive to te coice of badwidt as larger values of result i averagig over more
More informationOptimization Methods MIT 2.098/6.255/ Final exam
Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short
More informationCorrelation Regression
Correlatio Regressio While correlatio methods measure the stregth of a liear relatioship betwee two variables, we might wish to go a little further: How much does oe variable chage for a give chage i aother
More informationLecture 7: Density Estimation: k-nearest Neighbor and Basis Approach
STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.
More informationTopics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion
.87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses
More informationLecture 12: February 28
10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationMachine Learning. Ilya Narsky, Caltech
Machie Learig Ilya Narsky, Caltech Lecture 4 Multi-class problems. Multi-class versios of Neural Networks, Decisio Trees, Support Vector Machies ad AdaBoost. Reductio of a multi-class problem to a set
More informationLecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)
Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +
More informationLecture 13: Maximum Likelihood Estimation
ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select
More informationAda Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities
CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We
More informationSequences and Series of Functions
Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges
More informationHarder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose
More information