Data-driven calibration of linear estimators with minimal penalties

Size: px
Start display at page:

Download "Data-driven calibration of linear estimators with minimal penalties"

Transcription

1 Data-drive calibratio of liear estimators with miimal pealties Sylvai Arlot CNRS ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie, F Paris, Frace sylvai.arlot@es.fr Fracis Bach INRIA ; Willow Project-Team Laboratoire d Iformatique de l Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, aveue d Italie, F Paris, Frace fracis.bach@es.fr Abstract This paper tackles the problem of selectig amog several liear estimators i o-parametric regressio; this icludes model selectio for liear regressio, the choice of a regularizatio parameter i kerel ridge regressio or splie smoothig, ad the choice of a kerel i multiple kerel learig. We propose a ew algorithm which first estimates cosistetly the variace of the oise, based upo the cocept of miimal pealty which was previously itroduced i the cotext of model selectio. The, pluggig our variace estimate i Mallows C L pealty is proved to lead to a algorithm satisfyig a oracle iequality. Simulatio experimets with kerel ridge regressio ad multiple kerel learig show that the proposed algorithm ofte improves sigificatly existig calibratio procedures such as 10-fold cross-validatio or geeralized cross-validatio. 1 Itroductio Kerel-based methods are ow well-established tools for supervised learig, allowig to perform various tasks, such as regressio or biary classificatio, with liear ad o-liear predictors 1, 2. A cetral issue commo to all regularizatio frameworks is the choice of the regularizatio parameter: while most practitioers use cross-validatio procedures to select such a parameter, data-drive procedures ot based o cross-validatio are rarely used. The choice of the kerel, a seemigly urelated issue, is also importat for good predictive performace: several techiques exist, either based o cross-validatio, Gaussia processes or multiple kerel learig 3, 4, 5. I this paper, we cosider least-squares regressio ad cast these two problems as the problem of selectig amog several liear estimators, where the goal is to choose a estimator with a quadratic risk which is as small as possible. This problem icludes for istace model selectio for liear regressio, the choice of a regularizatio parameter i kerel ridge regressio or splie smoothig, ad the choice of a kerel i multiple kerel learig (see Sectio 2). The mai cotributio of the paper is to exted the otio of miimal pealty 6, 7 to all discrete classes of liear operators, ad to use it for defiig a fully data-drive selectio algorithm satisfyig a o-asymptotic oracle iequality. Our ew theoretical results preseted i Sectio 4 exted similar results which were limited to uregularized least-squares regressio (i.e., projectio operators). Fially, i Sectio 5, we show that our algorithm improves the performaces of classical selectio procedures, such as GCV 8 ad 10-fold cross-validatio, for kerel ridge regressio or multiple kerel learig, for moderate values of the sample size. arlot/ fbach/ 1

2 2 Liear estimators I this sectio, we defie the problem we aim to solve ad give several examples of liear estimators. 2.1 Framework ad otatio Let us assume that oe observes Y i = f(x i ) + ε i R for i = 1..., where ε 1,...,ε are i.i.d. cetered radom variables with Eε 2 i = σ2 ukow, f is a ukow measurable fuctio X R ad x 1,...,x X are determiistic desig poits. No assumptio is made o the set X. The goal is to recostruct the sigal F = (f(x i )) 1 i R, with some estimator F R, depedig oly o (x 1,Y 1 ),...,(x,y ), ad havig a small quadratic risk 1 F F 2 2, where t R, we deote by t 2 the l 2 -orm of t, defied as t 2 2 := i=1 t2 i. I this paper, we focus o liear estimators F that ca be writte as a liear fuctio of Y = (Y 1,...,Y ) R, that is, F = AY, for some (determiistic) matrix A. Here ad i the rest of the paper, vectors such as Y or F are assumed to be colum-vectors. We preset i Sectio 2.2 several importat families of estimators of this form. The matrix A may deped o x 1,...,x (which are kow ad determiistic), but ot o Y, ad may be parameterized by certai quatities usually regularizatio parameter or kerel combiatio weights. 2.2 Examples of liear estimators I this paper, our theoretical results apply to matrices A which are symmetric positive semi-defiite, such as the oes defied below. Ordiary least-squares regressio / model selectio. If we cosider liear predictors from a desig matrix X R p, the F = AY with A = X(X X) 1 X, which is a projectio matrix (i.e., A A = A); F = AY is ofte called a projectio estimator. I the variable selectio settig, oe wats to select a subset J {1,...,p}, ad matrices A are parameterized by J. Kerel ridge regressio / splie smoothig. We assume that a positive defiite kerel k : X X R is give, ad we are lookig for a fuctio f : X R i the associated reproducig kerel Hilbert space (RKHS) F, with orm F. If K deotes the kerel matrix, defied by K ab = k(x a,x b ), the the ridge regressio estimator a.k.a. splie smoothig estimator for splie kerels 9 is obtaied by miimizig with respect to f F 2: 1 (Y i f(x i )) 2 + λ f 2 F. i=1 The uique solutio is equal to f = i=1 α ik(,x i ), where α = (K +λi) 1 Y. This leads to the smoothig matrix A λ = K(K + λi ) 1, parameterized by the regularizatio parameter λ R +. Multiple kerel learig / Group Lasso / Lasso. We ow assume that we have p differet kerels k j, feature spaces F j ad feature maps Φ j : X F j, j = 1,...,p. The group Lasso 10 ad multiple kerel learig 11, 5 frameworks cosider the followig objective fuctio ( J(f 1,...,f p )= 1 yi p f j,φ j (x i ) ) 2 p p +2λ f j Fj = L(f 1,...,f p )+2λ f j Fj. i=1 Note that whe Φ j (x) is simply the j-th coordiate of x R p, we get back the pealizatio by the l 1 -orm ad thus the regular Lasso 12. Usig a 1/2 1 = mi b 0 2 { a b + b}, we obtai a variatioal formulatio of the sum of orms 2 p f { } p fj j = mi 2 η R p + η j + η j. Thus, miimizig J(f 1,...,f p ) with respect to (f 1,...,f p ) is equivalet to miimizig with respect to η R p + (see 5 for more details): mi L(f 1,...,f p ) + λ f 1,...,f p p f j 2 + λ η j p η j = 1 y ( p η ) 1y p jk j + λi + λ η j, 2

3 where I is the idetity matrix. Moreover, give η, this leads to a smoothig matrix of the form A η,λ = ( p η jk j )( p η jk j + λi ) 1, (1) parameterized by the regularizatio parameter λ R + ad the kerel combiatios i R p + ote that it depeds oly o λ 1 η, which ca be grouped i a sigle parameter i R p +. Thus, the Lasso/group lasso ca be see as particular (covex) ways of optimizig over η. I this paper, we propose a o-covex alterative with better statistical properties (oracle iequality i Theorem 1). Note that i our settig, fidig the solutio of the problem is hard i geeral sice the optimizatio is ot covex. However, while the model selectio problem is by ature combiatorial, our optimizatio problems for multiple kerels are all differetiable ad are thus ameable to gradiet descet procedures which oly fid local optima. No symmetric liear estimators. Other liear estimators are commoly used, such as earesteighbor regressio or the Nadaraya-Watso estimator 13; those however lead to o symmetric matrices A, ad are ot etirely covered by our theoretical results. 3 Liear estimator selectio I this sectio, we first describe the statistical framework of liear estimator selectio ad itroduce the otio of miimal pealty. 3.1 Ubiased risk estimatio heuristics Usually, several estimators of the form F = AY ca be used. The problem that we cosider i this paper is the to select oe of them, that is, to choose a matrix A. Let us assume that a family of matrices (A λ ) λ Λ is give (examples are show i Sectio 2.2), hece a family of estimators ( F λ ) λ Λ ca be used, with F λ := A λ Y. The goal is to choose from data some λ Λ, so that the quadratic risk of F bλ is as small as possible. The best choice would be the oracle: λ arg mi λ Λ { 1 F } λ F 2 2 which caot be used sice it depeds o the ukow sigal F. Therefore, the goal is to defie a data-drive λ satisfyig a oracle iequality 1 F { bλ F 2 2 C if 1 F } λ F R, (2) λ Λ with large probability, where the leadig costat C should be close to 1 (at least for large ) ad the remaider term R should be egligible compared to the risk of the oracle. May classical selectio methods are built upo the ubiased risk estimatio heuristics: If λ miimizes a criterio crit(λ) such that λ Λ, Ecrit(λ) E 1 F λ F 2 2, the λ satisfies a oracle iequality such as i Eq. (2) with large probability. For istace, crossvalidatio 14, 15 ad geeralized cross-validatio (GCV) 8 are built upo this heuristics. Oe way of implemetig this heuristics is pealizatio, which cosists i miimizig the sum of the empirical risk ad a pealty term, i.e., usig a criterio of the form: crit(λ) = 1 F λ Y pe(λ). The ubiased risk estimatio heuristics, also called Mallows heuristics, the leads to the ideal (determiistic) pealty pe id (λ) := E 1 F λ F 2 2 E 1 F λ Y 2 2., 3

4 Whe F λ = A λ Y, we have: F λ F 2 2 = (A λ I )F A λε A λε, (A λ I )F, (3) F λ Y 2 2 = F λ F ε ε, A λε + 2 ε, (I A λ )F, (4) where ε = Y F R ad t,u R, t, u = i=1 t iu i. Sice ε is cetered with covariace matrix σ 2 I, Eq. (3) ad Eq. (4) imply that pe id (λ) = 2σ2 tr(a λ ), (5) up to the term E 1 ε 2 2= σ 2, which ca be dropped off sice it does ot vary with λ. Note that df(λ) = tr(a λ ) is called the effective dimesioality or degrees of freedom 16, so that the ideal pealty i Eq. (5) is proportioal to the dimesioality associated with the matrix A λ for projectio matrices, we get back the dimesio of the subspace, which is classical i model selectio. The expressio of the ideal pealty i Eq. (5) led to several selectio procedures, i particular Mallows C L (called C p i the case of projectio estimators) 17, where σ 2 is replaced by some estimator σ 2. The estimator of σ 2 usually used with C L is based upo the value of the empirical risk at some λ 0 with df(λ 0 ) large; it has the drawback of overestimatig the risk, i a way which depeds o λ 0 ad F 18. GCV, which implicitly estimates σ 2, has the drawback of overfittig if the family (A λ ) λ Λ cotais a matrix too close to I 19; GCV also overestimates the risk eve more tha C L for most A λ (see (7.9) ad Table 4 i 18). I this paper, we defie a estimator of σ 2 directly related to the selectio task which does ot have similar drawbacks. Our estimator relies o the cocept of miimal pealty, itroduced by Birgé ad Massart 6 ad further studied i Miimal ad optimal pealties We deduce from Eq. (3) the bias-variace decompositio of the risk: E 1 F λ F 2 2 = 1 (A λ I )F 22 + tr(a λ A λ)σ 2 = bias + variace, (6) ad from Eq. (4) the expectatio of the empirical risk: E 1 F ( λ Y 2 2 ε 2 2 = 1 (A λ I )F 22 2tr(Aλ ) tr(a λ A λ) ) σ 2. (7) Note that the variace term i Eq. (6) is ot proportioal to the effective dimesioality df(λ) = tr(a λ ) but to tr(a λ A λ). Although several papers argue these terms are of the same order (for istace, they are equal whe A λ is a projectio matrix), this may ot hold i geeral. If A λ is symmetric with a spectrum Sp(A λ ) 0,1, as i all the examples of Sectio 2.2, we oly have 0 tr(a λ A λ ) tr(a λ ) 2tr(A λ ) tr(a λ A λ ) 2tr(A λ ). (8) I order to give a first ituitive iterpretatio of Eq. (6) ad Eq. (7), let us cosider the kerel ridge regressio example ad assume that the risk ad the empirical risk behave as their expectatios i Eq. (6) ad Eq. (7); see also Fig. 1. Completely rigorous argumets based upo cocetratio iequalities are developed i 20 ad summarized i Sectio 4, leadig to the same coclusio as the preset iformal reasoig. First, as proved i 20, the bias 1 (A λ I )F 2 2 is a decreasig fuctio of the dimesioality df(λ) = tr(a λ ), ad the variace tr(a λ A λ)σ 2 1 is a icreasig fuctio of df(λ), as well as 2tr(A λ ) tr(a λ A λ). Therefore, Eq. (6) shows that the optimal λ realizes the best trade-off betwee bias (which decreases with df(λ)) ad variace (which icreases with df(λ)), which is a classical fact i model selectio. Secod, the expectatio of the empirical risk i Eq. (7) ca be decomposed ito the bias ad a egative variace term which is the opposite of pe mi (λ) := 1 ( 2tr(A λ ) tr(a λ A λ ) ) σ 2. (9) 4

5 geeralizatio errors σ 2 tra σ 2 tra 2 2σ 2 tra bias variace σ 2 tr A 2 geeralizatio error bias + σ 2 tr A 2 empirical error σ 2 bias+σ 2 tra 2 2σ 2 tr A degrees of freedom ( tr A ) Figure 1: Bias-variace decompositio of the geeralizatio error, ad miimal/optimal pealties. As suggested by the otatio pe mi, we will show it is a miimal pealty i the followig sese. If { C 0, λmi (C) arg mi 1 F } λ Y C pe mi (λ), λ Λ the, up to cocetratio iequalities that are detailed i Sectio 4.2, λ mi (C) behaves like a miimizer of g C (λ) = E 1 F λ Y C pe mi (λ) 1 σ 2 = 1 (A λ I )F 2 2 +(C 1)pe mi(λ). Therefore, two mai cases ca be distiguished: if C < 1, the g C (λ) decreases with df(λ) so that df( λ mi (C)) is huge: λ mi (C) overfits. if C > 1, the g C (λ) icreases with df(λ) whe df(λ) is large eough, so that df( λ mi (C)) is much smaller tha whe C < 1. As a coclusio, pe mi (λ) is the miimal amout of pealizatio eeded so that a miimizer λ of a pealized criterio is ot clearly overfittig. Followig a idea first proposed i 6 ad further aalyzed or used i several other papers such as 21, 7, 22, we ow propose to use that pe mi (λ) is a miimal pealty for estimatig σ 2 ad plug this estimator ito Eq. (5). This leads to the algorithm described i Sectio 4.1. Note that the miimal pealty give by Eq. (9) is ew; it geeralizes previous results 6, 7 where pe mi (A λ ) = 1 tr(a λ )σ 2 because all A λ were assumed to be projectio matrices, i.e., A λ A λ = A λ. Furthermore, our results geeralize the slope heuristics pe id 2pe mi (oly valid for projectio estimators 6, 7) to geeral liear estimators for which pe id /pe mi (1,2. 4 Mai results I this sectio, we first describe our algorithm ad the preset our theoretical results. 4.1 Algorithm The followig algorithm first computes a estimator of Ĉ of σ2 usig the miimal pealty i Eq. (9), the cosiders the ideal pealty i Eq. (5) for selectig λ. Iput: Λ a fiite set with Card(Λ) K α for some K,α 0, ad matrices A λ. C > 0, compute λ 0 (C) arg mi λ Λ { F λ Y C ( 2tr(A λ ) tr(a λ A λ) ) }. Fid Ĉ such that df( λ 0 (Ĉ)) 3/4,/10. Select λ arg mi λ Λ { F λ Y Ĉ tr(a λ)}. I the steps 1 ad 2 of the above algorithm, i practice, a grid i log-scale is used, ad our theoretical results from the ext sectio suggest to use a step-size of order 1/4. Note that it may ot be 5

6 possible i all cases to fid a C such that df( λ 0 (C)) 3/4,/10 ; therefore, our coditio i step 2, could be relaxed to fidig a Ĉ such that for all C > Ĉ + δ, df( λ 0 (C)) < 3/4 ad for all C < Ĉ δ, df( λ 0 (C)) > /10, with δ = 1/4+ξ, where ξ > 0 is a small costat. Alteratively, usig the same grid i log-scale, we ca select Ĉ with maximal jump betwee successive values of df( λ 0 (C)) ote that our theoretical result the does ot etirely hold, as we show the presece of a jump aroud σ 2, but do ot show the absece of similar jumps elsewhere. 4.2 Oracle iequality Theorem 1 Let Ĉ ad λ be defied as i the algorithm of Sectio 4.1, with Card(Λ) K α for some K,α 0. Assume that λ Λ, A λ is symmetric with Sp(A λ ) 0,1, that ε i are i.i.d. Gaussia with variace σ 2 > 0, ad that λ 1,λ 2 Λ with df(λ 1 ) 2, df(λ 2), ad i {1,2}, 1 (A λi I )F 2 2 σ2 l(). (A 1 2 ) The, a umerical costat C a ad a evet of probability at least 1 8K 2 exist o which, for every C a, ( ( ) l() 1 91(α + 2) )σ 2 Ĉ (α + 2) l() σ 2. (10) 1/4 Furthermore, if κ 1, λ Λ, 1 tr(a λ )σ 2 κe 1 F λ F 2 2, (A 3 ) the, a costat C b depedig oly o κ exists such that for every C b, o the same evet, ( 1 F bλ F κ ) { if 1 l() F } λ F 2 36(κ + α + 2)l()σ (11) λ Λ Theorem 1 is proved i 20. The proof maily follows from the iformal argumets developed i Sectio 3.2, completed with the followig two cocetratio iequalities: If ξ R is a stadard Gaussia radom vector, α R ad M is a real-valued matrix, the for every x 0, ( α, ξ ) 2x α 2 ( P θ > 0, P 1 2e x (12) ) Mξ 2 2 tr(m M) θ tr(m M) + 2(1 + θ 1 ) M 2 x 1 2e x, (13) where M is the operator orm of M. A proof of Eq. (12) ad (13) ca be foud i Discussio of the assumptios of Theorem 1 Gaussia oise. Whe ε is sub-gaussia, Eq. (12) ad Eq. (13) ca be proved for ξ = σ 1 ε at the price of additioal techicalities, which implies that Theorem 1 is still valid. Symmetry. The assumptio that matrices A λ must be symmetric ca certaily be relaxed, sice it is oly used for derivig from Eq. (13) a cocetratio iequality for A λ ξ, ξ. Note that Sp(A λ ) 0,1 barely is a assumptio sice it meas that A λ actually shriks Y. Assumptios (A 1 2 ). (A 1 2 ) holds if max λ Λ {df(λ)} /2 ad the bias is smaller tha cdf(λ) d for some c,d > 0, a quite classical assumptio i the cotext of model selectio. Besides, (A 1 2 ) is much less restrictive ad ca eve be relaxed, see 20. Assumptio (A 3 ). The upper boud (A 3 ) o tr(a λ ) is certaily the strogest assumptio of Theorem 1, but it is oly eeded for Eq. (11). Accordig to Eq. (6), (A 3 ) holds with κ = 1 whe A λ is a projectio matrix sice tr(a λ A λ) = tr(a λ ). I the kerel ridge regressio framework, (A 3 ) holds as soo as the eigevalues of the kerel matrix K decrease like j α see 20. I geeral, (A 3 ) meas that F λ should ot have a risk smaller tha the parametric covergece rate associated with a model of dimesio df(λ) = tr(a λ ). Whe (A 3 ) does ot hold, selectig amog estimators whose risks are below the parametric rate is a rather difficult problem ad it may ot be possible to attai the risk of the oracle i geeral. 6

7 selected degrees of freedom log(c/σ 2 ) miimal pealty optimal pealty / 2 selected degrees of freedom log(c/σ 2 ) optimal/2 miimal (discrete) miimal (cotiuous) Figure 2: Selected degrees of freedom vs. pealty stregth log(c/σ 2 ) : ote that whe pealizig by the miimal pealty, there is a strog jump at C = σ 2, while whe usig half the optimal pealty, this is ot the case. Left: sigle kerel case, Right: multiple kerel case. Nevertheless, a oracle iequality ca still be proved without (A 3 ), at the price of elargig Ĉ slightly ad addig a small fractio of σ 2 1 tr(a λ ) i the right-had side of Eq. (11), see 20. Elargig Ĉ is ecessary i geeral: If tr(a λ A λ) tr(a λ ) for most λ Λ, the miimal pealty is very close to 2σ 2 1 tr(a λ ), so that accordig to Eq. (10), overfittig is likely as soo as Ĉ uderestimates σ 2, eve by a very small amout. 4.4 Mai cosequeces of Theorem 1 ad compariso with previous results Cosistet estimatio of σ 2. The first part of Theorem 1 shows that Ĉ is a cosistet estimator of σ 2 i a geeral framework ad uder mild assumptios. Compared to classical estimators of σ 2, such as the oe usually used with Mallows C L, Ĉ does ot deped o the choice of some model assumed to have almost o bias, which ca lead to overestimatig σ 2 by a ukow amout 18. Oracle iequality. Our algorithm satisfies a oracle iequality with high probability, as show by Eq. (11): The risk of the selected estimator F bλ is close to the risk of the oracle, up to a remaider term which is egligible whe the dimesioality df(λ ) grows with faster tha l(), a typical situatio whe the bias is ever equal to zero, for istace i kerel ridge regressio. Several oracle iequalities have bee proved i the statistical literature for Mallows C L with a cosistet estimator of σ 2, for istace i 23. Nevertheless, except for the model selectio problem (see 6 ad refereces therei), all previous results were asymptotic, meaig that is implicitly assumed to be larged compared to each parameter of the problem. This assumptio ca be problematic for several learig problems, for istace i multiple kerel learig whe the umber p of kerels may grow with. O the cotrary, Eq. (11) is o-asymptotic, meaig that it holds for every fixed as soo as the assumptios explicitly made i Theorem 1 are satisfied. Compariso with other procedures. Accordig to Theorem 1 ad previous theoretical results 23, 19, C L, GCV, cross-validatio ad our algorithm satisfy similar oracle iequalities i various frameworks. This should ot lead to the coclusio that these procedures are completely equivalet. Ideed, secod-order terms ca be large for a give, while they are hidde i asymptotic results ad ot tightly estimated by o-asymptotic results. As showed by the simulatios i Sectio 5, our algorithm yields statistical performaces as good as existig methods, ad ofte quite better. Furthermore, our algorithm ever overfits too much because df( λ) is by costructio smaller tha the effective dimesioality of λ 0 (Ĉ) at which the jump occurs. This is a quite iterestig property compared for istace to GCV, which is likely to overfit if it is ot corrected because GCV miimizes a criterio proportioal to the empirical risk. 5 Simulatios Throughout this sectio, we cosider expoetial kerels o R d, k(x,y)= d i=1 e xi yi, with the x s sampled i.i.d. from a stadard multivariate Gaussia. The fuctios f are the selected radomly as m i=1 α ik(,z i ), where both α ad z are i.i.d. stadard Gaussia (i.e., f belogs to the RKHS). 7

8 fold CV MKL+CV 2.5 GCV 3 GCV 2 mi. pealty kerel sum 2.5 mi. pealty log() log() Figure 3: Compariso of various smoothig parameter selectio (miikerel, GCV, 10-fold cross validatio) for various values of umbers of observatios, averaged over 20 replicatios. Left: sigle kerel, right: multiple kerels. mea( error / error oracle ) mea( error / error Mallows ) Jump. I Figure 2 (left), we cosider data x i R 6, = 1000, ad study the size of the jump i Figure 2 for kerel ridge regressio. With half the optimal pealty (which is used i traditioal variable selectio for liear regressio), we do ot get ay jump, while with the miimal pealty we always do. I Figure 2 (right), we plot the same curves for the multiple kerel learig problem with two kerels o two differet 4-dimesioal variables, with similar results. I additio, we show two ways of optimizig over λ Λ = R 2 +, by discrete optimizatio with differet kerel matrices a situatio covered by Theorem 1 or with cotiuous optimizatio with respect to η i Eq. (1), by gradiet descet a situatio ot covered by Theorem 1. Compariso of estimator selectio methods. I Figure 3, we plot model selectio results for 20 replicatios of data (d = 4, = 500), comparig GCV 8, our miimal pealty algorithm, ad cross-validatio methods. I the left part (sigle kerel), we compare to the oracle (which ca be computed because we ca eumerate Λ), ad use for cross-validatio all possible values of λ. I the right part (multiple kerel), we compare to the performace of Mallows C L whe σ 2 is kow (i.e., pealty i Eq. (5)), ad sice we caot eumerate all λ s, we use the solutio obtaied by MKL with CV 5. We also compare to usig our miimal pealty algorithm with the sum of kerels. 6 Coclusio A ew light o the slope heuristics. Theorem 1 geeralizes some results first proved i 6 where all A λ are assumed to be projectio matrices, a framework where assumptio (A 3 ) is automatically satisfied. To this extet, Birgé ad Massart s slope heuristics has bee modified i a way that sheds a ew light o the magical factor 2 betwee the miimal ad the optimal pealty, as proved i 6, 7. Ideed, Theorem 1 shows that for geeral liear estimators, pe id (λ) pe mi (λ) = 2tr(A λ ) 2tr(A λ ) tr(a λ A λ), (14) which ca take ay value i (1,2 i geeral; this ratio is oly equal to 2 whe tr(a λ ) tr(a λ A λ), hece mostly whe A λ is a projectio matrix. Future directios. I the case of projectio estimators, the slope heuristics still holds whe the desig is radom ad data are heteroscedastic 7; we would like to kow whether Eq. (14) is still valid for heteroscedastic data with geeral liear estimators. I additio, the good empirical performaces of elbow heuristics based algorithms (i.e., based o the sharp variatio of a certai quatity aroud good hyperparameter values) suggest that Theorem 1 ca be geeralized to may learig frameworks (ad potetially to o-liear estimators), probably with small modificatios i the algorithm, but always relyig o the cocept of miimal pealty. Aother iterestig ope problem would be to exted the results of Sectio 4, where Card(Λ) K α is assumed, to cotiuous sets Λ such as the oes appearig aturally i kerel ridge regressio ad multiple kerel learig. We cojecture that Theorem 1 is valid without modificatio for a small cotiuous Λ, such as i kerel ridge regressio where takig a grid of size i log-scale is almost equivalet to takig Λ = R +. O the cotrary, i applicatios such as the Lasso with p variables, the atural set Λ caot be well covered by a grid of cardiality α with α small, ad our miimal pealty algorithm ad Theorem 1 certaily have to be modified. 8

9 Refereces 1 J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, B. Schölkopf ad A. J. Smola. Learig with Kerels. MIT Press, O. Chapelle ad V. Vapik. Model selectio for support vector machies. I Advaces i Neural Iformatio Processig Systems (NIPS), C. E. Rasmusse ad C. Williams. Gaussia Processes for Machie Learig. MIT Press, F. Bach. Cosistecy of the group Lasso ad multiple kerel learig. Joural of Machie Learig Research, 9: , L. Birgé ad P. Massart. Miimal pealties for Gaussia model selectio. Probab. Theory Related Fields, 138(1-2):33 73, S. Arlot ad P. Massart. Data-drive calibratio of pealties for least-squares regressio. J. Mach. Lear. Res., 10: , P. Crave ad G. Wahba. Smoothig oisy data with splie fuctios. Estimatig the correct degree of smoothig by the method of geeralized cross-validatio. Numer. Math., 31(4): , 1978/79. 9 G. Wahba. Splie Models for Observatioal Data. SIAM, M. Yua ad Y. Li. Model selectio ad estimatio i regressio with grouped variables. Joural of The Royal Statistical Society Series B, 68(1):49 67, G. R. G. Lackriet, N. Cristiaii, P. Bartlett, L. El Ghaoui, ad M. I. Jorda. Learig the kerel matrix with semidefiite programmig. J. Mach. Lear. Res., 5:27 72, 2003/ R. Tibshirai. Regressio shrikage ad selectio via the Lasso. Joural of The Royal Statistical Society Series B, 58(1): , T. Hastie, R. Tibshirai, ad J. Friedma. The Elemets of Statistical Learig. Spriger- Verlag, D. M. Alle. The relatioship betwee variable selectio ad data augmetatio ad a method for predictio. Techometrics, 16: , M. Stoe. Cross-validatory choice ad assessmet of statistical predictios. J. Roy. Statist. Soc. Ser. B, 36: , T. Zhag. Learig bouds for kerel regressio usig effective data dimesioality. Neural Comput., 17(9): , C. L. Mallows. Some commets o C p. Techometrics, 15: , B. Efro. How biased is the apparet error rate of a predictio rule? J. Amer. Statist. Assoc., 81(394): , Y. Cao ad Y. Golubev. O oracle iequalities related to smoothig splies. Math. Methods Statist., 15(4): (2007), S. Arlot ad F. Bach. Data-drive calibratio of liear estimators with miimal pealties, September Log versio. arxiv: v1. 21 É. Lebarbier. Detectig multiple chage-poits i the mea of a gaussia process by model selectio. Sigal Proces., 85: , C. Maugis ad B. Michel. Slope heuristics for variable selectio ad clusterig via gaussia mixtures. Techical Report 6550, INRIA, K.-C. Li. Asymptotic optimality for C p, C L, cross-validatio ad geeralized cross-validatio: discrete idex set. A. Statist., 15(3): ,

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

ECON 3150/4150, Spring term Lecture 3

ECON 3150/4150, Spring term Lecture 3 Itroductio Fidig the best fit by regressio Residuals ad R-sq Regressio ad causality Summary ad ext step ECON 3150/4150, Sprig term 2014. Lecture 3 Ragar Nymoe Uiversity of Oslo 21 Jauary 2014 1 / 30 Itroductio

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

STATISTICS 593C: Spring, Model Selection and Regularization

STATISTICS 593C: Spring, Model Selection and Regularization STATISTICS 593C: Sprig, 27 Model Selectio ad Regularizatio Jo A. Weller Lecture 2 (March 29): Geeral Notatio ad Some Examples Here is some otatio ad termiology that I will try to use (more or less) systematically

More information

Quantile regression with multilayer perceptrons.

Quantile regression with multilayer perceptrons. Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Questions and answers, kernel part

Questions and answers, kernel part Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

Chapter 6 Sampling Distributions

Chapter 6 Sampling Distributions Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead) Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell

More information

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min) Admi Assigmet 5! Starter REGULARIZATION David Kauchak CS 158 Fall 2016 Schedule Midterm ext week, due Friday (more o this i 1 mi Assigmet 6 due Friday before fall break Midterm Dowload from course web

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor

More information

Introductory statistics

Introductory statistics CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Math 216A Notes, Week 5

Math 216A Notes, Week 5 Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 3. Properties of Summary Statistics: Sampling Distribution Lecture 3 Properties of Summary Statistics: Samplig Distributio Mai Theme How ca we use math to justify that our umerical summaries from the sample are good summaries of the populatio? Lecture Summary

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

arxiv: v2 [math.st] 20 Mar 2008

arxiv: v2 [math.st] 20 Mar 2008 Joural of Machie Learig Research 0 (0000) 0 Submitted 3/08; Published 0/00 Data-drive calibratio of pealties for least-squares regressio arxiv:0802.0837v2 [math.st] 20 Mar 2008 Sylvai Arlot Uiv Paris-Sud,

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 5 STATISTICS II. Mea ad stadard error of sample data. Biomial distributio. Normal distributio 4. Samplig 5. Cofidece itervals

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

A Note on Adaptive Group Lasso

A Note on Adaptive Group Lasso A Note o Adaptive Group Lasso Hasheg Wag ad Chelei Leg Pekig Uiversity & Natioal Uiversity of Sigapore July 7, 2006. Abstract Group lasso is a atural extesio of lasso ad selects variables i a grouped maer.

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Regularization methods for large scale machine learning

Regularization methods for large scale machine learning Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION Jauary 3 07 LECTURE LEAST SQUARES CROSS-VALIDATION FOR ERNEL DENSITY ESTIMATION Noparametric kerel estimatio is extremely sesitive to te coice of badwidt as larger values of result i averagig over more

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

Correlation Regression

Correlation Regression Correlatio Regressio While correlatio methods measure the stregth of a liear relatioship betwee two variables, we might wish to go a little further: How much does oe variable chage for a give chage i aother

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Machine Learning. Ilya Narsky, Caltech

Machine Learning. Ilya Narsky, Caltech Machie Learig Ilya Narsky, Caltech Lecture 4 Multi-class problems. Multi-class versios of Neural Networks, Decisio Trees, Support Vector Machies ad AdaBoost. Reductio of a multi-class problem to a set

More information

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +

More information

Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose

More information