Regularization with the Smooth-Lasso procedure

Size: px
Start display at page:

Download "Regularization with the Smooth-Lasso procedure"

Transcription

1 Regularizatio with the Smooth-Lasso procedure Mohamed Hebiri To cite this versio: Mohamed Hebiri. Regularizatio with the Smooth-Lasso procedure <hal v2> HAL Id: hal Submitted o 15 Oct 2008 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 Regularizatio with the Smooth-Lasso procedure Mohamed Hebiri Laboratoire de Probabilités et Modèles Aléatoires, CNRS-UMR 7599, Uiversité Paris 7 - Diderot, UFR de Mathématiques, 175 rue de Chevaleret F Paris, Frace. Abstract We cosider the liear regressio problem. We propose the S-Lasso procedure to estimate the ukow regressio parameters. This estimator ejoys sparsity of the represetatio while takig ito accout correlatio betwee successive covariates (or predictors). The study covers the case whe p, i.e. the umber of covariates is much larger tha the umber of observatios. I the theoretical poit of view, for fixed p, we establish asymptotic ormality ad cosistecy i variable selectio results for our procedure. Whe p, we provide variable selectio cosistecy results ad show that the S-Lasso achieved a Sparsity Iequality, i.e., a boud i term of the umber of o-zero compoets of the oracle vector. It appears that the S-Lasso has ice variable selectio properties compared to its challegers. Furthermore, we provide a estimator of the effective degree of freedom of the S-Lasso estimator. A simulatio study shows that the S-Lasso performs better tha the Lasso as far as variable selectio is cocered especially whe high correlatios betwee successive covariates exist. This procedure also appears to be a good challeger to the Elastic-Net [36]. Keywords: Lasso, LARS, Sparsity, Variable selectio, Regularizatio paths, Mutual coherece, High-dimesioal data. AMS 2000 subject classificatios: Primary 62J05, 62J07; Secodary 62H20, 62F12. hebiri@math.jussieu.fr 1

3 1 Itroductio We focus o the usual liear regressio model: y i = x i β + ε i, i = 1,...,, (1) where the desig x i = (x i,1,...,x i,p ) R p is determiistic, β = (β 1,...,β p) R p is the ukow parameter ad ε 1,...,ε, are idepedet idetically distributed (i.i.d.) cetered Gaussia radom variables with kow variace σ 2. We wish to estimate β i the sparse case, that is whe may of its ukow compoets equal zero. Thus oly a subset of the desig covariates (ξ j ) j is truly of iterest where ξ j = (x 1,j,...,x,j ), j = 1,...,p. Moreover the case p is ot excluded so that we ca cosider p depedig o. I such a framework, two mai issues arise: i) the iterpretability of the resultig predictio; ii) the cotrol of the variace i the estimatio. Regularizatio is therefore eeded. For this purpose we use selectio type procedures of the followig form: β = Argmi { Y Xβ 2 + pe(β) }, (2) β R p where X = (x 1,...,x ), Y = (y 1,..., y ) ad pe : R p R is a positive covex fuctio called the pealty. For ay vector a = (a 1,..., a ), we have adopted the otatio a 2 = 1 i=1 a i 2 (we deote by <, > the correspodig ier product i R ). The choice of the pealty appears to be crucial. Although wellsuited for variable selectio purpose, Cocave-type pealties ([12], [27] ad [6]) are ofte computatioally hard to optimize. Lasso-type procedures (modificatios of the l 1 pealized least square (Lasso) estimator itroduced by Tibshirai [25]) have bee extesively studied durig the last few years. Betwee may others, see [2, 4, 34] ad refereces iside. Such procedures seem to respod to our objective as they perform both regressio parameters estimatio ad variable selectio with low computatioal cost. We will explore this type of procedures i our study. I the paper, we propose a ovel modificatio of the Lasso we call the Smoothlasso (S-lasso) estimator. It is defied as the solutio of the optimizatio problem (2) whe the pealty fuctio is a combiatio of the Lasso pealty (i.e., p j=1 β j ) ad the l 2 -fusio pealty (i.e., p j=2 (β j β j 1 ) 2 ). The l 2 -fusio pealty was first itroduced i [15]. We add it to the Lasso procedure i order to overcome the variable selectio problems observed by the Lasso estimator. Ideed the Lasso estimator has good selectio properties but fails i some situatios. More precisely, i several works ([2, 16, 18, 29, 32, 34, 35] amog others) coditios for the cosistecy i variable selectio of the Lasso procedure are give. It was show that the Lasso is 2

4 ot cosistet whe high correlatios exist betwee the covariates. We give similar cosistecy coditios for the S-Lasso procedure ad show that it is cosistet i variable selectio i much more situatios tha the Lasso estimator. From a practical poit of view, problems are also ecoutered whe we solve the Lasso criterio with the Lasso modificatio of the LARS algorithm [10]. Ideed this algorithm teds to select oly oe represetig covariates i each group of correlated covariates. We attempt to respod to this problem i the case where the covariates are raked so that high correlatios ca exist betwee successive covariates. We will see through simulatios that such situatios support the use of the S-lasso estimator. This estimator is ispired by the Fused-Lasso [26]. Both S-Lasso ad Fused-Lasso combie a l 1 -pealty with a fusio term [15]. The fusio term is suggested to catch correlatios betwee covariates. More relevat covariates ca the be selected due to correlatios betwee them. The mai differece betwee the two procedures is that we use the l 2 distace betwee the successive coefficiets (i.e., the l 2 -fusio pealty) whereas the Fused-Lasso uses the l 1 distace (i.e., the l 1 -fusio pealty: p j=2 β j β j 1 ). Hece, compared to the Fused-Lasso, we sacrifice sparsity betwee successive coefficiets i the estimatio of β i favor of a easier optimizatio due to the strict covexity of the l 2 distace. However, sice sparsity is yet esured by the Lasso pealty. The l 2 -fusio pealty helps us to catch correlatios betwee covariates. Cosequetly, eve if there is o perfect match betwee successive coefficiets our result are still iterpretable. Moreover, whe successive coefficiets are sigificatly differet, a perfect match seems to be ot really adapted. I the theoretical poit of view, The l 2 distace also helps us to provide theoretical properties for the S-Lasso which i some situatios appears to outperforms the Lasso ad the Elastic-Net [36], aother Lasso-type procedure. Let us metio that variable selectio cosistecy of the Fused-Lasso ad the correspodig Fused adaptive Lasso has also bee studied i [20] but i a differet cotext from the oe i the preset paper. The result obtaied i [20] are established ot oly uder the sparsity assumptio, but the model is also supposed to be blocky, that is the o-zero coefficiets are represeted i a block fashio with equal values iside each block. May techiques have bee proposed to solve the weakesses of the Lasso. The Fused-Lasso procedure is oe of them ad we give here some of the most popular methods; the Adaptive Lasso was itroduced i [35], which is similar to the Lasso but with adaptive weights used to pealize each regressio coefficiet separately. This procedure reaches Oracles Properties (i.e. cosistecy i variable selectio ad asymptotic ormality). Aother approach is used i the Relaxed Lasso [17] ad aims to doubly-cotrol the Lasso estimate: oe parameter to cotrol variable selectio ad the other to cotrol shrikage of the selected coefficiets. To overcome 3

5 the problem due to the correlatio betwee covariates, group variable selectio has bee proposed by Yua ad Li [31] with the Group-Lasso procedure which selects groups of correlated covariates istead of sigle covariates at each step. A first step to the cosistecy study has bee proposed i [1] ad Sparsity Iequalities were give i [5]. Aother choice of pealty has bee proposed with the Elastic-Net [36]. It is i the same spirit that we shall treat the S-Lasso from a some theoretical poit of view. The paper is orgaized as follows. I the ext sectio, we preset oe way to solve the S-Lasso problem with the attractive property of piecewise liearity of its regularizatio path. Sectio 3 gives theoretical performaces of the cosidered estimator such as cosistecy i variable selectio ad asymptotic ormality whe p whereas cosistecy i estimatio ad variable selectio i the high dimesioal case are cosidered i Sectio 4. We also give a estimate of the effective degree of freedom of the S-Lasso estimator i Sectio 5. The, we provide a way to cotrol the variace of the estimator by scalig i Sectio 6 where a coectio with soft-thresholdig is also established. A geeralizatio ad comparative study to the Elastic-Net is doe i Sectio 7. We fially give experimetal results i Sectio 8 showig the S-Lasso performaces agaist some popular methods. All proofs are postpoed to a Appedix sectio. 2 The S-Lasso procedure As described above, we defie the S-Lasso estimator ˆβ SL as the solutio of the optimizatio problem (2) whe the pealty fuctio is: pe(β) = λ β 1 + µ p (β j β j 1 ) 2, (3) where λ ad µ are two positive parameters that cotrol the smoothess of our estimator. For ay vector a = (a 1,...,a p ), we have used the otatio a 1 = p j=1 a j. Note that whe µ = 0, the solutio is the Lasso estimator so that it appears as a special case of the S-Lasso estimator. Now we deal with the resolutio of the S-Lasso problem (2)-(3) ad its computatioal cost. From ow o, we suppose w.l.o.g. that X = (x 1,...,x ) is stadardized (that is 1 i=1 x2 i,j = 1 ad 1 i=1 x i,j = 0) ad Y = (y 1,...,y ) is cetered (that is 1 i=1 y i = 0). The followig lemma shows that the S-Lasso criterio ca be expressed as a Lasso criterio by augmetig the data artificially. j=2 4

6 Lemma 1. Give the data set (X, Y ) ad (λ, µ). Defie the exteded dataset ( X, Ỹ ) by ( ) ( 1 X Y X = ad Ỹ =, 1 + µ µj 0) where 0 is a vector of size p cotaiig oly zeros ad J is the p p matrix J = (4) Let r = λ/ 1 + µ ad b = 1 + µ β. The the S-Lasso criterio ca be writte Ỹ Xb 2 + r b 1. Let ˆb be the miimizer of this Lasso-criterio, the ˆβ SL = µ ˆb. This result is a cosequece of simple algebra. Lemma 1 motivates the followig commets o the S-Lasso procedure. Remark 1 (Regularizatio paths). The S-Lasso modificatio of the LARS is a iterative algorithm. For a fixed µ (appearig (3)), it costructs at each step a estimator based o the correlatio betwee covariates ad the curret residue. Each step correspods to a value of λ. The for a fixed µ, we get the evolutio of the S-Lasso estimator coefficiets values whe λ varies. This evolutio describes the regularizatio paths of the S-Lasso estimator which are piecewise liear [21]. This property implies that the S-Lasso problem ca be solved with the same computatioal cost as the ordiary least square (OLS) estimate usig the Lasso modificatio versio of the LARS algorithm. Remark 2 (Implemetatio). The umber of covariates that the LARS algorithm ad its Lasso versio ca select is limited by the umber of rows i the matrix X. Applied to the augmeted data ( X, Ỹ ) itroduced i Lemma 1, the Lasso modificatio of the LARS algorithm is able to select all the p covariates. The we are o loger limited by the sample size as for the Lasso [10]. 5

7 3 Theoretical properties of the S-Lasso estimator whe p I this sectio we itroduce the theoretical results accordig to the S-Lasso with a moderate sample size (p ). We first provide rates of covergece of the S-Lasso estimator ad show how through a cotrol o the regularizatio parameters we ca establish root- cosistecy ad asymptotic ormality. The we look for variable selectio cosistecy. More precisely, we give coditios uder which the S-Lasso estimator succeeds i fidig the set of the o-zero regressio coefficiets. We show that with a suitable choice of the tuig parameter (λ, µ), the S-Lasso is cosistet i variable selectio. All the results of this sectio are proved i Appedix A. 3.1 Asymptotic Normality I this sectio, we allow the tuig parameters (λ, µ) to deped o the sample size. We emphasize this depedece by addig a subscript to these parameters. We also fix the umber of covariates p. Let us ote I( ) the idicator fuctio ad defie the sig fuctio such that for ay x R, Sg(x) equals 1, 1 or 0 respectively whe x is bigger, smaller or equals 0. Kight ad Fu [14] gave the asymptotic distributio of the Lasso estimator. We provide here the asymptotic distributio to the S-Lasso. Let C = 1 X X, be Gram matrix, the Theorem 1. Give the data set (X, Y ), assume the correlatio matrix verifies C C, whe, i probability where C is a positive defiite matrix. If there exists a sequece v such that v 0 ad the regularizatio parameters verify λ v 1 λ 0 ad µ v 1 µ 0. The, if ( v ) 1 κ 0, we have where v 1 (ˆβ SL β ) D Argmi V (u), whe, u R p V (u) = 2κu T W + u T Cu + λ with W N(0, σ 2 C). + 2µ p { uj Sg(βj )I(βj 0) + u j I(βj = 0) } j=1 p { (uj u j 1 )(βj β j 1 )I(β j β j 1 )}, j=2 6

8 Remark 3. Whe κ 0 is a fiite costat: i this case v 1 is O( ) so that the estimator ˆβ SL is root- cosistet. Moreover whe λ = µ = 0, we obtai the followig stadard regressor asymptotic ormality: (ˆβ SL β ) D N(0, σ 2 C 1 ). Whe κ = 0: i this case, the rate of covergece is slower tha so that we o loger have the optimal rate. Moreover the limit is ot radom aymore. Note first that the correlatio pealty does ot alter the asymptotic bias whe successive regressio coefficiets are equal. We also remark that the sequece v must be chose properly as it determies our covergece rate. We would like v to be as close as possible to 1/. This sequece is calibrated by the user such that λ /v λ ad µ /v µ. 3.2 Cosistecy i variable selectio I this sectio, variable selectio cosistecy of the S-Lasso estimator is cosidered. For this purpose, we itroduce the followig sparsity sets: A = {j : βj 0} ad A = {j : ˆβSL j 0}. The set A cosists of the o-zero coefficiets i the vector of the oracle regressio vector β. The set A cosists of the o-zero coefficiets i SL the S-Lasso estimator ˆβ j ad is also called the active set of this estimator. Before statig our result, let us itroduce some otatios. For ay vector a R p ad ay set of idexes B {1,...,p}, deote by a B the restrictio of the vector a to the idexes i B. I the same way, if we ote B the cardial of the set B, the for ay s q matrix M, we use the followig covetio: i) M B,B is the B B matrix cosistig of the lies ad rows of M whose idexes are i B; ii) M.,B is the s B matrix cosistig of the rows of M whose idexes are i B; iii) M B,. is the B q matrix cosistig of the lies of M whose idexes are i B. Moreover, we defie J the p p matrix J J where J was defied i (4). Fially we defie for j {1,..., p}, the quatity Ω j = Ω j (λ, µ, A, β ) by Ω j = C j,a (C A,A + µ J ( A,A ) Sg(βA ) + µ λ J ) A,A β A µ λ J j,a βa, (5) where C is defied as i Theorem 1. Now cosider the followig coditios: for every j (A ) c Ω j (λ, µ, A, β ) < 1, (6) Ω j (λ, µ, A, β ) 1. (7) These coditios o the correlatio matrix C ad the regressio vector βa are the aalogues respectively of the sufficiet ad ecessary coditios derived for the Lasso ([35], [34] ad [32]). Now we state the cosistecy results 7

9 Theorem 2. If coditio (6) holds, the for every couple of regularizatio parameters (λ, µ ) such that λ 0, λ 1/2 SL ad µ 0, the S-Lasso estimator ˆβ as defied i (2)-(3) is cosistet i variable selectio. That is P(A = A ) 1, whe. Theorem 3. If there exist sequeces (λ, µ ) such that β SL coverges to β ad A coverges to A i probability, the coditio (7) is satisfied. We just have established ecessary ad sufficiet coditios to the selectio cosistecy of the S-Lasso estimator. Due to the assumptios eeded i Theorem 2 (more precisely λ 1/2 ), root- cosistecy ad variable selectio cosistecy caot be treated here simultaeously. We may wat to kow if the S-Lasso estimator ca be cosistet with a slower rate tha 1/2 ad cosistet i variable selectio i the same time. Remark 4. Here are special cases of coditios (6)- (7). Whe µ = 0 ad µ/λ = 0: these coditios are exactly the sufficiet ad ecessary coditios of the Lasso estimator. I this case Yua ad Li [32] showed that the coditio (6) becomes ecessary ad sufficiet for the Lasso estimator cosistecy i variable selectio. Whe µ = 0 ad µ/λ = γ 0: i this case, coditio (6) becomes sup C j,a C 1 A,A (2 1 Sg(βA ) + γ J A,A β A ) γ J j,a βa < 1. j (A ) c Here a good calibratio of γ leads to cosistecy i variable selectio: if (C j,a C 1 J A,A A,A J j,a )βa > 0, the γ must be chose betwee C j,a C 1 A,A Sg(β A ) (C j,a C 1 J A,A A,A J ad C j,a CA,A Sg(β A ) j,a )βa (C j,a C 1 J A,A A,A J. j,a )βa if (C j,a C 1 J A,A A,A J j,a )βa < 0, the γ must be chose betwee the same quatities but with iversio i their order. Whe µ 0 ad µ/λ = γ 0: this case is similar to the previous. I additio, it allows to have aother cotrol o the coditio through a calibratio with µ, so that coditio (6) ca be satisfied with a better cotrol. We coclude that if we sacrifice the optimal rate of covergece (i.e. root- cosistecy), we are able through a proper choice of the tuig parameters (λ, µ ) 8

10 to get cosistecy i variable selectio. Note that Zou [35] showed that the Lasso estimator caot be cosistet i variable selectio eve with a slower rate of covergece tha. He the added weights to the Lasso (i.e. the adaptive Lasso estimator) i order to get Oracles Properties (that is both asymptotic ormality ad variable selectio cosistecy). Note that we ca easily adapt techiques used i the adaptive Lasso to provide a weighted S-Lasso estimator which achieved the Oracles Properties. 4 Theoretical results whe dimesio p is larger tha sample size I this sectio, we propose to study the performace of the S-Lasso estimator i the high dimesioal case. I particular, we provide a o-asymptotic boud o the squared risk. We also provide boud o the estimatio risk uder the sup-orm (i.e., the l -orm: ˆβ SL β = sup j ˆβ j SL βj ). This last result helps us to provide a variable selectio cosistet estimator obtaied through thresholdig the S-Lasso estimator. The results of this sectio are proved i Appedix B. 4.1 Sparsity Iequality Now we establish a Sparsity Iequality (SI) achieved by the S-Lasso estimator, that is a boud o the squared risk that takes ito accout the sparsity of the oracle regressio vector β. More precisely, we prove that the rate of covergece is A log()/. For this purpose, we eed some assumptios o the Gram matrix C which is ormalized i our settig. Recall that ξ j = (x 1,j,...,x,j ). The we defie the regularizatio parameters λ ad µ i the followig forms: log(p) log(p) λ = κ 1 σ, ad µ = κ 2 σ 2, (8) where κ 1 > 2 2 ad κ 2 is positive costats. Let us defie the maximal correlatio quatity ρ 1 = max j A max k {1,...,p} (C ) j,k. Usig these otatios, we formulate k j the followig assumptios: Assumptio (A1). The true regressio vector β is such that there exists a fiite costat L 1 such that: β A J A,A β A L 1 log(p) A, (9) 9

11 where J = J J where J was defied i (4). Assumptio (A2). We have: ρ A. (10) Note that Assumptio (A1) is ot restrictive. A sufficiet coditio is that the larger o-zero compoet of βa is bouded by L 1 log(p) which ca be very large. Assumptio (A2) is the well-kow coherece coditio cosidered i [3], which has bee itroduced i [7]. Most of SIs provided i the literature use such a coditio. We refer to [3] for more details. Theorem 4 below provides a upper boud for the squared error of the estimator ˆβ SL ad for its l 1 estimatio error which takes ito accout the sparsity idex A. Theorem 4. Let us cosider the liear regressio model (1). Let ˆβ SL be S-Lasso estimator. Let A be the sparsity set. Suppose that p (ad eve p ). If Assumptios (A1) (A2) hold, the with probability greater tha 1 u,p, we have ad X ˆβ SL Xβ 2 log(p) A c 2, (11) ˆβ SL β 1 c 1 log(p) A, (12) where c 2 = (16κ L 1κ 2 )σ 2, c 1 = (16κ 1 + L 1 κ 1 1 κ 2)σ ad where u,p = p 1 κ2 1 /8 with κ 1 ad κ 2, the costats appearig i (8). The proof of Theorem 4 is based o the argmi defiitio of the estimator ad some techical cocetratio iequalities. Similar bouds were provided for the Lasso estimator i [4]. Let us metio that the costats c 1 ad c 2 are ot optimal. We focused our attetio o the depedecy o (ad the o p ad A ). It turs out that our results are ear optimal. For istace, for the l 2 risk, the S-Lasso estimator reaches early the optimal rate A log( p +1) up to a logarithmic factor A [3, Theorem 5.1]. 4.2 Sup-orm boud ad variable selectio Now we provide a boud o the sup-orm β ˆβ SL. Thaks to this result, oe may be able to defie a rule i order to get a variable selectio cosistet estimator 10 ˆβ SL

12 whe p. That is, we ca costruct a estimator which succeeds to recover the support of β i high dimesioal settigs. Small modificatios are to be imposed to provide our selectio results i this sectio. Let K be the symmetric p p matrix defied by K = C + µ J. Istead of Assumptio (A2), we will cosider the followig Assumptio (A3). We assume that max j, k {1,...,p} k j (K ) j,k 1 16 A. Remark 5. Note that the matrix J is tridiagoal with its off-diagoal terms equal to 1. If we do ot cosider the diagoal terms, we remark that C ad K differ oly i the terms o the secod diagoals (i.e., (K ) j 1,j (C ) j 1,j for j = 2,..., p as soo as µ 0). The, as we do ot cosider the diagoal terms i Assumptios (A2) ad (A3), they differ oly i the restrictio they impose to terms o the secod diagoals. Terms i the secod diagoals of C correspod to correlatios betwee successive covariates. The whe high correlatios exist betwee successive covariates, a suitable choice of µ makes Assumptio (A3) satisfied while Assumptios (A2) does ot. Hece, Assumptio (A3) fits better with setup cosidered i the paper. I the sequel, a coveiet choice of the tuig parameter µ is µ = κ 3 σ/ log (p), where κ 3 > 0 is a costat. Moreover, from Assumptio (A1), we have βa J A,A β A L 1 log (p) A. This iequality guaratees the existece of a costat L 2 > 0 such that Jβ L 2 log (p). Theorem 5. Let us cosider the liear regressio model (1). Let λ = κ 1 σ log(p)/ ad µ = κ 3 σ/ log (p) with κ 1 > 2 2 ad κ 3 > 0. Suppose that p (ad eve p ). Uder Assumptios (A1) ad (A3) ad with probability greater tha 2 1 p 1 κ 1 8, we have where c equals to log (p) ˆβ SL β c, ( ) Bσ α 1 + 4L 1B 9α 2 A 2 + 2L 1B 3αA 2 + 2L 1 B 3α(α 1)A 2 + 8L 1 L 2 B 2 9α(α 1)A 4 λ + (4L 2B 3A 2 + L 2B A 2 )λ. 11

13 Note that the leadig term i c is L 1B + 2L 1B 2L + 1 B. Oe may 4 α 1 9α 2 A 2 3αA 2 3α(α 1)A 2 fid back the result obtaied for the Lasso by settig L 1 to zero [16]. Secodly, the calibratio of µ aims at makig the covergece rate uder the sup-orm equal to log (p)/. O oe had, the proof of Theorem 5 allows us to choose this parameter with a faster covergece to zero without affectig the rate of covergece. O the other had, a more restrictive Assumptio (A1) o βa J A,A β A ad Jβ ca be formulated i order to make µ coverge slower to zero. If we let βa J A,A β A L 1 A i Assumptio (A1), we ca set µ as O( log (p)/), the slower covergece we ca get for µ. Let us ow provide a cosistet versio of the S-Lasso estimator. Cosider ˆβ ThSL, the thresholded S-Lasso estimator defied by ˆβ ThSL = ˆβ SL I(ˆβ SL c log (p)/) where c is give i Theorem 5. This estimator cosists of the S-Lasso estimator with its small coefficiets reduced to zero. We the eforce the selectio property of the S-Lasso estimator. Variable selectio cosistecy of this estimator is established uder oe more restrictio: Assumptio (A4). The smallest o-zero coefficiet of β is such that there exists a costat c l > 0 with log (p) mi j A β j > c l. Assumptio (A4) bouds from below the smallest regressio coefficiet i β. This is a commo assumptio to provide sig cosistecy i the high dimesioal case. This coditio appears i [19, 29, 33, 34] but with a larger (i term of sample size depedece) ad the more restrictive threshold. We refer to [16] for a loger discussio. A equivalet lower boud i the oracle regressio coefficiets ca be foud i [2, 16]. With this ew assumptio, we ca state the followig sig cosistecy result. Theorem 6. Let us cosider the thresholded S-Lasso estimator ˆβ ThSL as described above. Choose moreover λ = κ 1 σ log(p)/ ad µ = κ 3 σ/ log (p) with the positive costats κ 1 > 2 2 ad κ 3. Uder Assumptios (A1), (A3) ad (A4), if c l > 2 c with c is give by Theorem 5, with probability greater tha 1 p 1 κ 1 8, we have 2 ad the as + Sg(ˆβ ThSL ) = Sg(β ), (13) P(Sg(ˆβ ThSL ) = Sg(β )) 1. (14) 12

14 Remark 6. As observed i Remark 5, Assumptio (A3) is more easily satisfied whe correlatio exists betwee successive covariates. The i situatios where the correlatio matrix C is tridiagoal with its off-diagoal terms equal to δ with δ [0, 1], the costat κ 3 appearig i the defiitio of µ ca be adjusted i order to get Assumptio (A3) satisfied. 5 Model Selectio As already said [Remark 1 i Sectio 2], each step of the S-Lasso versio of the LARS algorithm provides a estimator of β. I this sectio, we are iterested i the choice of the best estimator accordig to its predictio accuracy. For a ew p matrix x ew of istaces (idepedet of X), deote ŷ SL = ˆβ SL x ew the estimator of its ukow respose value y ew ad m = E(y ew x ew ). We aim to miimize the true risk E { m ŷ SL 2 }. First, we easily obtai E { } m ŷ SL 2 = E{ Y ŷ SL 2 σ Cov(y i, ŷi SL )}, where the expectatio is take over the radom variable Y. The last term i this equatio was called optimism [9]. Moreover, Tibshirai [25] liks this quatity to the degree of freedom df(ŷ SL ) of the estimator ŷ SL, so that the above equality becomes E { m ŷ SL 2 } = E { Y ŷ SL 2 σ df(ŷ SL )σ 2}. (15) This fial expressio ivolves the degree of freedom which is ukow. Various methods exist to estimate the degree of freedom as bootstrap [11] or data perturbatio methods [24]. We give a explicit form to the degree of freedom i order to reduce the computatioal cost as i [10] ad [37]. Degrees of freedom: the degree of freedom is a quatity of iterest i model selectio. Before statig our result, let us itroduce some useful properties about the regularizatio paths of the S-Lasso estimator: Give a respose Y, ad a regularizatio parameter µ 0, there is a fiite sequece 0 = λ (K) < λ (K 1) <... < λ (0) such that ˆβ SL = 0 for every λ λ (0). I this otatio, superscripts correspod to the steps of the S-Lasso versio of the LARS algorithm. Give a respose Y, ad a regularizatio parameter µ 0, for λ (λ (k+1), λ (k) ), the same covariates are used to costruct the estimator. Let us ote A ζ the active set for a fixed couple ζ = (λ, µ) ad X.,Aζ the correspodig desig matrix. 13 i=1

15 I what follows, we will use the subscript ζ to emphasize the fact that the cosidered quatity depeds o ζ. Theorem 7. For fixed µ 0 ad λ > 0, a ubiased estimate of the effective degree of freedom of the S-Lasso estimate is give by df(ŷ ζ SL ) = Tr [X.,Aζ (X.,Aζ X.,Aζ + µ J ) ] 1 Aζ,Aζ X.,Aζ, where J = J J is defied by J = (16) As the estimatio give i Theorem 7 has a importat computatioal cost, we propose the followig estimator of the degree of freedom of the S-Lasso estimator: df(ŷ SL ζ ) = A ζ µ µ, (17) which is very easy to compute. Let I s be the s s idetity matrix where s is a iteger. We foud the former approximatio of the degree of freedom uder the orthogoal covariace matrix assumptio (that is 1 X X = I p ). Moreover we approximate the matrix (I Aλ + µ J Aλ,A λ ) by the diagoal matrix with 1 + µ i the first ad the last terms, ad 1 + 2µ i the others. Remark 7 (Compariso to the Lasso ad the Elastic-Net). A similar work leads L to a estimatio of the degree of freedom of the Lasso: df(ŷ ζ ) = A ζ ad to a EN estimatio of the degree of freedom of the Elastic-Net estimator: df(ŷ ζ ) = A ζ /(1+ µ). These approximatios of the degrees of freedom provide the followig compariso SL EN for a fixed ζ: df(ŷ ζ ) df(ŷ ζ ) df(ŷ ζ L ). A coclusio is that the S-Lasso estimator is the oe which pealizes the smaller models, ad the Lasso estimator the larger. As a cosequece, the S-Lasso estimator should select larger models tha the Lasso or the Elastic-Net estimator. 14

16 6 The Normalized S-Lasso estimator I this sectio, we look for a scaled S-Lasso estimator which would have better empirical performace tha the origial S-Lasso preseted above. The idea behid this study is to better cotrol shrikage. Ideed, usig the S-Lasso procedure (2)-(3) iduces double shrikage: oe usig the Lasso pealty ad the other usig the fusio pealty. We wat to udo the shrikage implied by the fusio pealty as shrikage is already esured by the Lasso pealty. We the suggest to study the S-Lasso criterio (2)-(3) without the Lasso pealty (i.e. with oly the l 2 -fusio pealty) i order to fid the costat we have to scale with. Defie β = Argmi β R p Y Xβ 2 + µ p (β j β j 1 ) 2. We easily obtai β = ((X X)/ + µ J) 1 (X Y )/ := L 1 (X Y )/ where J is give by (16). Moreover as the desig matrix X is stadardized, the symmetric matrix L ca be writte 1 + µ ξ 1 ξ 2 µ ξ 1 ξ 3 ξ 1... ξp 1 + 2µ 1 ξ 2ξ 3 µ.... L = ξ p 2 ξp µ ξ p 1 ξp µ 1 + µ I order to get rid of the shrikage due to the fusio pealty, we force L to have oes (or close to a diagoal of oes) i its diagoal elemets. The we scale the estimator β by a factor c. Here are two choice we will use i the followig of the paper: i) the first is c = 1 + µ so that the first ad the last diagoal elemets of L 1 become equal to oe; ii) the secod is c = 1 + 2µ which offers the advatage that all the diagoal elemets of L 1 become equal to oe except the first ad the last. This secod choice seems to be more appropriate to udo this extra shrikage ad specially i high dimesioal problem. We first give a geeralizatio of Lemma 1. Lemma 2. Give the dataset (X, Y ) ad (λ 1, µ). Defie the augmeted dataset ( X, Ỹ ) by ( ) ( ) X = ν 1 X Y 1 ad Ỹ =, µj 0 15 j=2

17 where ν 1 is a costat which depeds oly o µ ad J is give by (4). Let r = λ/ν 1 ad b = (ν 2 /c)β where ν 2 is a costat which depeds oly o µ, ad c is the scalig costat which appears i the previous study. The the S-Lasso criterio ca be writte Ỹ Xb 2 + r b 1. (18) Let ˆb be the miimizer of this Lasso-criterio, the we defie the Scaled Smooth Lasso (SS-Lasso) by ˆβ SSL = ˆβ SSL (ν 1, ν 2, c) = (c/ν 2 )ˆb. Moreover, let J = J J. The we have { ν 2 β ν 1 ˆβ SSL = Argmi β R p ( X X + µ J ) β 2 Y X c } p β + λ β j. (19) Equatio (19) is oly a rearragemet of the Lasso criterio (18). The SS-Lasso expressio (19) emphasizes the importace of the scalig costat c. I a way, the SS-Lasso estimator stabilizes the Lasso estimator ˆβ L (criterio (18) based i (X, Y ) istead of ( X, Ỹ )) as we have ˆβ L = Argmi β R p { β ( X X ) β 2 Y X j=1 } p β + λ β j. The choice of ν 1 ad ν 2 should be liked to this scalig costat c i order to get better empirical performaces ad to have less parameters to calibrate. Let us defie some specific cases. i) Case 1: Whe ν 1 = ν 2 = 1 + µ ad c = 1: this is the origial S-Lasso estimator as see i Sectio 2. ii) Case 2: Whe ν 1 = ν 2 = 1 + µ ad c = 1 + µ: we call this scaled S-Lasso estimator Normalized Smooth Lasso (NS- Lasso) ad we ote it ˆβ NSL. I this case, we have ˆβ NSL = (1 + µˆβ SL ). iii) Case 3: Whe ν 1 = ν 2 = 1 + 2µ ad c = 1 + 2µ: we call this scaled versio Highly Normalized Smooth Lasso (HS-Lasso) ad we ote it ˆβ HSL. Others choices are possible for ν 1 ad ν 2 i order to better cotrol shrikage. For istace we ca cosider a compromise betwee the NS-Lasso ad the HS-Lasso by defiig ν 1 = 1 + µ ad ν 2 = 1 + 2µ. Remark 8 (Coectio with Soft Thresholdig). Let us cosider the limit case of the NS-Lasso estimator. Note = lim ˆβNSL µ, the usig (19), we have ˆβ NSL ˆβ NSL j=1 = Argmi{β β 2Y Xβ + λ β 1 }. β 16

18 As a cosequece, (ˆβ NSL ) j = ( ) Y ξ j λ Sg(Y ξ 2 + j ) which is the Uivariate Soft Thresholdig [8]. Hece, whe µ, the NS-Lasso works as if all the covariates were idepedet. The Lasso, which correspods to the NS-Lasso whe µ = 0, ofte fails to select covariates whe high correlatios exist betwee relevat ad irrelevat covariates. It seems that the NS-Lasso is able to avoid such problem by icreasig µ ad workig as if all the covariates were idepedet. The for a fixed λ, the cotrol of the regularizatio parameter µ appears to be crucial. Whe we vary it, the NS-Lasso bridges the Lasso ad the Soft Thresholdig. 7 Extesio ad compariso All results obtaied i the preset paper ca be geeralized to all pealized least square estimators for which the pealty term ca be writte as: pe(β) = λ β 1 + β Mβ, (20) where M is p p matrix. I particular, our study ca be exteded for istace to the Elastic-Net estimator with the special choice M = I p. Such a observatio uderlies the superiority of the S-Lasso estimator o the Elastic-Net i some situatios. Ideed, let us cosider the variable selectio cosistecy i the high dimesioal settig (cf. Sectio 4.2). Regardig the Elastic-Net, Assumptio (A3) becomes Assumptio (A3-EN). We assume that max j, k {1,...,p} k j (C ) j,k + µ I p 1 16 A. (21) Sice the idetity matrix is diagoal ad sice the maximum i (21) is take over idexes k j, coditio (21) reduces to max j, k {1,...,p} (C ) j,k 1. This makes 16 A k j Assumptio (A3-EN) similar to the assumptio eeded to get the variable selectio cosistecy of the Lasso estimator [2]. Hece, we get o gai to use the Elastic-Net i a variable selectio cosistecy poit of view i our framework. This ables us to thik that the S-Lasso outperforms the Elastic-Net at least o examples as the oe i Remark 6. Recetly, Jia ad Yu [13] studied the variable selectio cosistecy of the Elastic-Net uder a assumptio called Elastic Irrepresetable Coditio: (EIC). There exists a positive costat θ such that for ay j (A ) c C j,a (C A,A + µi A ) 1 ( 2 1 Sg(β A ) + µ λ β A ) 1 θ. 17

19 This coditio ca be see as a geeralizatio of the Irrepresetable Coditio ivolved i the Lasso variable selectio cosistecy. Let us discuss how the two assumptios ca be compared i the case p. First, ote that Assumptio (A3-EN), as well as EIC suggests low correlatios betwee covariates. Moreover Assumptio (A1), (A4) ad (A3-EN) seem more restrictive tha EIC as all the correlatios are costraied i (21). However, EIC is harder to iterpret i term of the coefficiets of the regressio vector β. It also depeds o the sig of β. The mai differece is that the cosistecy result i the preset paper holds uiformly o the solutios of the Elastic Net criterio while the result from [13] higes upo the existece of a cosistet solutio for variable selectio. Obviously, this is more restrictive as we are certai to provide the sig-cosistet solutio uder the EIC. Fially, we have also provided results o the sup-orm ad sparsity iequalities o the squared risk of our estimators. Such results are ew for estimators defied with the pealty (20), icludig the S-Lasso ad the Elastic-Net. 8 Experimetal results I the preset sectio we illustrate the good predictio ad selectio properties of the NS-Lasso ad the HS-Lasso estimators. For this purpose, we compare it to the Lasso ad the Elastic-Net. It appears that S-Lasso is a good challeger to the Elastic-Net [36] eve whe large correlatios betwee covariates exist. We further show that i most cases, our procedure outperforms the Elastic-Net ad the Lasso whe we cosider the ratio betwee the relevat selected covariates ad irrelevat selected covariates. Simulatios: Data. Four simulatios are geerated accordig to the liear regressio model y = xβ + σε, ε N(0, 1), x = (ξ 1,...,ξ p ) R p. The first ad the secod examples were itroduced i the origial Lasso paper [25]. The third simulatio creates a grouped covariates situatio. It was itroduced i [36] ad aims to poit the efficiecy of the Elastic-Net compared to the Lasso. The last simulatio itroduces large correlatio betwee successive covariates. 18

20 (a) I this example, we simulate 20 observatios with 8 covariates. The true regressio vector is β = (3, 1.5, 0, 0, 2, 0, 0, 0) so that oly three covariates are truly relevat. Let σ = 3 ad the correlatio betwee ξ j ad ξ k such that Cov(ξ j, ξ k ) = 2 j k. (b) The secod example is the same as the first oe, except that we geerate 50 observatios ad that β j = 0.85 for every j {1,..., 8} so that all the covariates are relevat. (c) I the third example, we simulate 50 data with 40 covariates. The true regressio vector is such that βj = 3 for j = 1,..., 15 ad β j = 0 for j = 16,...,40. Let σ = 15 ad the covariates geerated as follows: ξ j = Z 1 + ε j, Z 1 N(0, 1), j = 1,..., 5, ξ j = Z 2 + ε j, Z 2 N(0, 1), j = 6,..., 10, ξ j = Z 3 + ε j, Z 3 N(0, 1), j = 11,..., 15, where ε j, j = 1,...,15, are i.i.d. N(0, 0.01) variables. Moreover for j = 16,...,40, the ξ j s are i.i.d N(0, 1) variables. (d) I the last example, we geerate 50 data with 30 covariates. The true regressio vector is such that β j = 3 0.1j j = 1,..., 10, β j = j j = 20,..., 25, β j = 0 for the others j. The oise is such that σ = 9 ad the correlatios are such that Cov(ξ j, ξ k ) = exp ( j k ) for (j, k) {11,..., 25} 2 ad the others covariates are i.i.d. 2 N(0, 1), also idepedet from ξ 11,...,ξ 25. I this model there are big correlatio betwee relevat covariates ad eve betwee relevat ad irrelevat covariates. Validatio. The selectio of the tuig parameters λ ad µ is based o the miimizatio of a BIC-type criterio [22]. For a give ˆβ the associated BIC error is defied as: BIC(ˆβ) = Y X ˆβ 2 + log()σ2 df(ˆβ), where df(ˆβ) is give by (17) if we cosider the S-Lasso ad deotes its aalogous quatities if we cosider the Lasso or the Elastic-Net. Such a criterio provides a 19

21 Method Example (a) Example (b) Example (c) Example (d) Lasso 3.8 [±0.1] 6.5 [±0.1] 6 [±0.1] 18.4 [±0.2] E-Net 4.9 [±0.1] 6.9 [±0.1] 15.9 [±0.1] 20.5 [±0.2] NS-Lasso 3.9 [±0.1] 6.5 [±0.1] 15.3 [±0.2] 18.9 [±0.2] HS-Lasso 3.5 [±0.1] 5.9 [±0.1] 15 [±0.1] 18.1 [±0.2] Table 1: Mea of the umber of o-zero coefficiets [ad its stadard error] selected respectively by the Lasso, the Elastic-Net (E-Net), the Normalized Smooth Lasso (NS-Lasso) ad the Highly Smooth Lasso (HS-Lasso) procedures. accurate estimator which ejoys good variable selectio properties ([23] ad [30]). I simulatio studies, for each replicatio, we also provide the Mea Square Error (MSE) of the selected estimator o a ew ad idepedet dataset with the same size as traiig set (that is ). This gives a iformatio o the robustess of the procedures. Iterpretatios. All the results exposed here are based o 200 replicatios. Figure 1 ad Figure 2 give respectively the BIC error ad the test error of the cosidered procedures i each example. Accordig to the selectio part, Figure 3 shows the frequecies of selectio of each covariate for all the procedures, ad Table 1 shows the mea of the umber of o-zeros coefficiets that each procedure selected. Fially for each procedure, Table 2 gives the ratio betwee the umber of relevat covariates ad the umber of oise covariates that the procedures selected. Let us call SNR this ratio. The we ca express this ratio as j A SNR = I(j A ) j A I(j / A ). This is a good idicatio of the selectio power of the procedures. As the Lasso is a special case of the S-Lasso ad the Elastic-Net, the Lasso BIC error (Figure 1) is always larger tha the BIC error for the other methods. These two seem to have equivalet BIC errors. Whe cosiderig the test error (Figure 2), it seems agai that all the procedures are similar i all of the examples. They maage to produce good predictio idepedetly of the sparsity of the model. The more attractive aspect cocers variable selectio. For this purpose we treat each example separately. Example (a): the Elastic-Net selects a model which is too large (Table 1). This is reflected by the worst SNR (Table 2). As a cosequece, we ca observe i Figure 3 20

22 Method Example (a) Example (c) Example (d) Lasso 2.3 [±0.1] 2.9 [±0.1] 4.7 [±0.2] E-Net 1.7 [±0.1] 13.1 [±0.3] 3.4 [±0.2] NS-Lasso 2.5 [±0.1] 13.5 [±0.3] 6.8 [±0.3] HS-Lasso 1.79 [±0.1] 11.4 [±0.3] 6.4 [±0.3] Table 2: Mea of the ratio betwee the umber of relevat covariates ad the umber of oise covariates (SNR) [ad its stadard error] that each of the Lasso, the Elastic- Net, the NS-Lasso ad the HS-Lasso procedures selected. that it also icludes the secod covariate more ofte tha the other procedures. This is due to the groupig effect as the first covariate is relevat. For similar reasos, the S-Lasso ofte selects the secod covariate. However, this covariate is less selected tha by the Elastic-Net as the S-Lasso seems to be a little bit disturbed by the third covariate which is irrelevat. This aspect of the S-Lasso procedure is also preset i the selectio of the covariate 5 as its eighbor covariates 4 ad 6 are irrelevat. We ca also observe that the S-Lasso procedure is the oe which selects less ofte irrelevat covariates whe these covariates are far away from relevat oes (i term of idices distace). Fially, eve if the Lasso procedure selects less ofte the relevat covariates tha the Elastic-Net ad the S-Lasso procedures, it also has as good SNR. The Lasso presets good selectio performaces i this example. Example (b): we ca see i Figure 3 how the S-Lasso ad Elastic-Net selectio depeds o how the covariates are raked. They both select more covariates i the middle (that is covariates 2 to 7) tha the oes i the borders (covariates 1 ad 8) tha the Lasso. We also remark that this aspect is more emphasized for the S-Lasso tha for the Elastic-Net. Example (c): the Lasso procedure performs poorly. It selects more oise covariates ad less relevat oes tha the other procedures (Figure 3). It also has the worst SNR (Table 2). I this example, Figure 3 also shows that the Elastic-Net selects more ofte relevat covariates tha the S-Lasso procedures but it also selects more oise covariates tha the NS-lasso procedure. The eve if the Elastic-Net has very good performace i variable selectio, the NS-Lasso procedure has similar performaces with a close SNR (Table 2). The NS-Lasso appears to have very good performace i this example. However, it selects agai less ofte relevat covariates at the border tha the Elastic-Net. Example (d): we decompose the study ito two parts. First, the idepedet part which cosiders covariates ξ 1,...,ξ 10 ad ξ 26,...,ξ 30. The secod part cosiders the 21

23 Example (a) Example (b) BIC Error BIC Error procedure Example (c) procedure Example (d) BIC Error BIC Error procedure procedure Figure 1: BIC error i each example. For each plot, we costruct the boxplot for the procedure 1 = Lasso; 2 = Elastic-Net; 3 = NS-Lasso; 4 = HS-Lasso other covariates which are depedet. Regardig the idepedet covariates, Figure 3 shows that all the procedures perform roughly i the same way, though the S-Lasso procedure ejoys a slightly better selectio (i both relevat ad oise group of covariates). For the depedet ad relevat covariates, the Lasso performs worst tha the other procedures. It selects clearly less ofte these relevat covariates. As i example (c), the reaso is that the Lasso modificatio of the LARS algorithm teds to select oly oe represetative of a group of highly correlated covariates. The high value of the SNR for the Lasso (whe compared to the Elastic-Net) is explaied by its good performace whe it treat oise covariates. I this example the Elastic-Net correctly selects relevat covariates but it is also the procedure which selects the more oise covariates ad has the worst SNR. We also ote that both the NS-Lasso ad HS-Lasso outperform the Lasso ad Elastic-Net. This gai is emphasized especially i the ceter of the groups. Observe that for the covariates ξ 20, ξ 21, ξ 25 ad ξ 26 (that 22

24 35 Example (a) 20 Example (b) 30 MSE Test MSE Test procedure procedure Example (c) Example (d) MSE Test MSE Test procedure procedure Figure 2: Test Error i each example. For each plot, we costruct the boxplot for the procedure 1 = Lasso; 2 = Elastic-Net; 3 = NS-Lasso; 4 = HS-Lasso is the borders), the NS-Lasso ad HS-Lasso have slightly worst performace tha i the ceter of the groups. This is agai due to the attractio we imposed by the fusio pealty (3) i the S-Lasso criterio. Coclusio of the experimets. The S-Lasso procedure seems to respod to our expectatios. Ideed, whe successive correlatios exist, it teds to select the whole group of these relevat covariates ad ot oly oe represetig the group as doe by the Lasso procedure. It also appears that the S-Lasso procedure has very good selectio properties accordig to both relevat ad oise covariates. However it has slightly worst performace i the borders tha i the ceters of groups of covariates (due to attractios of irrelevat covariates). It almost always has a better SNR tha the Elastic-Net, so we ca take it as a good challeger for this procedure. 23

25 Lasso 200 NS Lasso detectio umber of time detectio umber of time idice k Elastic Net idice k HS Lasso detectio umber of time detectio umber of time detectio umber of time idice k idice k Lasso 20 Elastic Net NS Lasso HS Lasso idex k 200 Lasso 200 NS Lasso detectio umber of time detectio umber of time idice k Elastic Net idice k HS Lasso detectio umber of time detectio umber of time detectio umber of time idice k idice k 20 0 Lasso Elastic Net NS Lasso HS Lasso idice k Figure 3: Number of covariates detectios for each procedure i all the examples (Top-Left: Example (a); Top-Right: Example (b); Bottom-Left: Example (a); Bottom-Right: Example (b)) 24

26 9 Coclusio I this paper, we itroduced a ew procedure called the Smooth-Lasso which takes ito accout correlatio betwee successive covariates. We established several theoretical results. The mai coclusios are that whe p, the S-Lasso is cosistet i variable selectio ad asymptotically ormal with a rate lower tha. I the high dimesioal settig, we provided a coditio related to the coherece mutual coditio, uder which the thresholded versio of the Smooth-Lasso is cosistet i variable selectio. This coditio is fulfilled whe correlatios betwee successive covariates exist. Moreover, simulatio studies showed that ormalized versios of the Smooth-Lasso have ice properties of variable selectio which are emphasized whe high correlatios exist betwee successive covariates. It appears that the Smooth-Lasso almost always outperforms the Lasso ad is a good challeger of the Elastic-Net. Appedix A. Sice the matrix C + µ J plays a crucial role i the proves, we use to shorte the otatio K = C + µ J ad whe p we defie K = C + µ J, its limit. I this appedix we prove the results whe p. Proof of Theorem 1. Let Ψ be p Ψ (u) = Y X(β + v u) 2 + λ βj + v u j j=1 p ( + µ β j βj 1 + v (u j u j 1 ) )2, for u = (u 1,...,u p ) R p ad let û = Argmi u Ψ (u). Let ε = (ε 1,...,ε ), we the j=2 25

27 have Ψ (u) Ψ (0) =: V (u) = v 2 u ( X X = v 2 +v µ p [ + µ v j=2 u ( X X p j=2 = v 2 V (u). ) u 2 v ε X u + v λ p j=1 ( v 1 β j + v u j βj ) { (β v 1 j βj 1 + v (u j u j 1 ) ) 2 ( ) } β j βj 1 2 ) u 2 ε X u + λ p ( v v 1 β v j + v u j βj ) j=1 { (β v 1 j β j 1 + v (u j u j 1 ) )2 ( βj ) } ] 2 β j 1 Note that û = Argmi u Ψ (u) = Argmi u V (u), we the have to cosider the limit distributio of V (u). First, we have X X C. Moreover, as 1/(v ) κ ad as give X, the radom variable ε X D W, with W N(0, σ 2 C), the Slutsky theorem implies that 2 ε X u D 2κW u. v Now we treat the last two terms. If βj 0, ( β j + v u j βj ) u j Sg(βj ), v 1 ad is equal to u j otherwise. The, as λ v p j=1 ( v 1 β j + v u j βj ) λ p { uj Sg(βj )I(β j 0) + u j I(βj = 0)}, j=1 For the remaiig term, we show that if β j β j 1, { (β v 1 j β j 1 + v (u j u j 1 ) )2 ( βj ) } 2 β j 1 2(u j u j 1 )(βj β j 1 ), ad is equal to (u j u j 1 ) 2 p µ v j=2 otherwise. But µ coverge to 0, implies that { (β v 1 j βj 1 + v (u j u j 1 ) ) 2 ( ) } β j βj

28 2µ p { (uj u j 1 )(βj β j 1 )I(β j β j 1 )}. j=2 Therefore we have V (u) V (u) i probability, for every u R p. Ad sice C is a positive defied matrix, V (u) has a uique miimizer. Moreover as V (u) is covex, stadard M-estimatio results [28] lead to: û Argmi u V (u). Proof of Theorem 2. We begi by givig two results which we will use i our proof. The first oe cocers the optimality coditios of the S-Lasso estimator. Recall that by defiitio ˆβ SL = Argmi Y Xβ 2 + λ β 1 + µ β Jβ. β R p Note f(a) a=a0 the evaluatio of the fuctio f at the poit a 0. As the above problem is a o-differetiable covex problem, classical tools lead to the followig optimality coditios for the S-Lasso estimator: Lemma 3. The vector ˆβ SL = (ˆβ 1 SL SL,..., ˆβ p ) is the S-Lasso estimate as defied i (2)-(3) if ad oly if Y Xβ 2 + µ β Jβ = λ Sg(ˆβ j SL ) for j : dβ j ˆβSL j 0, (22) βj =ˆβ j SL Y Xβ 2 + µ β Jβ dβ j λ for j : ˆβSL j = 0. (23) βj =ˆβ SL j Recall that A = {j : βj 0}, the secod result states that if we restrict ourselves to the covariates which we are after (i.e. idexes i A ), we get a cosistet estimate as soo as the regularizatio parameters λ ad µ are properly chose. Lemma 4. Let β A a miimizer of Y X A β A 2 + λ j A β j + µ β A J A,A β A. If λ 0 ad µ 0, the β A coverges to βa i probability. 27

29 This lemma ca be see as a special ad restricted case of Theorem 1. We ow prove Theorem 2. Let β A as i Lemma 4. We defie a estimator β by extedig β A by zeros o (A ) c. Hece, cosistecy of β is esure as a simple cosequece of Lemma 4. Now we eed to prove that with probability tedig to oe, this estimator is optimal for the problem (2)-(3). That is the optimal coditios (22)-(23) are fulfilled with probability tedig to oe. From ow o, we deote A for A. By defiitio of β A, the optimality coditio (22) is satisfied. We ow must check the optimality coditio (23). Combiig the fact that Y = Xβ + ε ad the covergece of the matrix X X/ ad the vector ε X/, we have 1 (X Y X X A βa ) = C.,A (β A β A ) + O p ( 1/2 ). (24) Moreover, the optimality coditio (22) for the estimator β ca be writte as 1 (X.,A Y X.,A X.,A β A ) = λ 2 Sg( β A ) µ JA,A (β A β A ) + µ JA,A β A. (25) Combiig (24) ad (25), we easily obtai ( ) (βa β A ) = (C A,A + µ JA,A ) 1 λ 2 Sg( β A ) + µ JA,A βa + O p ( 1/2 ). Sice β is cosistet ad λ 1/2, for each j A c, the left had side i the optimality coditio (23) 1 λ (ξ jy ξ jx.,a βa ) µ λ Jj,A βa =: L () j, coverges i probability to C j,a (K A,A ) (2 1 1 Sg(βA ) + µ λ J ) A,A βa µ λ J j,a βa =: L j. By coditio (6), this quatity is strictly smaller tha oe. The ( ) lim P j A c, L () j 1 P ( L j 1) = 1, j A c which eds the proof. 28

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

Improvement of Generic Attacks on the Rank Syndrome Decoding Problem

Improvement of Generic Attacks on the Rank Syndrome Decoding Problem Improvemet of Geeric Attacks o the Rak Sydrome Decodig Problem Nicolas Arago, Philippe Gaborit, Adrie Hauteville, Jea-Pierre Tillich To cite this versio: Nicolas Arago, Philippe Gaborit, Adrie Hauteville,

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

ECON 3150/4150, Spring term Lecture 3

ECON 3150/4150, Spring term Lecture 3 Itroductio Fidig the best fit by regressio Residuals ad R-sq Regressio ad causality Summary ad ext step ECON 3150/4150, Sprig term 2014. Lecture 3 Ragar Nymoe Uiversity of Oslo 21 Jauary 2014 1 / 30 Itroductio

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

On the behavior at infinity of an integrable function

On the behavior at infinity of an integrable function O the behavior at ifiity of a itegrable fuctio Emmauel Lesige To cite this versio: Emmauel Lesige. O the behavior at ifiity of a itegrable fuctio. The America Mathematical Mothly, 200, 7 (2), pp.75-8.

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

A Simple Proof of the Shallow Packing Lemma

A Simple Proof of the Shallow Packing Lemma A Simple Proof of the Shallow Packig Lemma Nabil Mustafa To cite this versio: Nabil Mustafa. A Simple Proof of the Shallow Packig Lemma. Discrete ad Computatioal Geometry, Spriger Verlag, 06, 55 (3), pp.739-743.

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

A Note on Adaptive Group Lasso

A Note on Adaptive Group Lasso A Note o Adaptive Group Lasso Hasheg Wag ad Chelei Leg Pekig Uiversity & Natioal Uiversity of Sigapore July 7, 2006. Abstract Group lasso is a atural extesio of lasso ad selects variables i a grouped maer.

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES

TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES M Sghiar To cite this versio: M Sghiar. TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES. Iteratioal

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

Gini Index and Polynomial Pen s Parade

Gini Index and Polynomial Pen s Parade Gii Idex ad Polyomial Pe s Parade Jules Sadefo Kamdem To cite this versio: Jules Sadefo Kamdem. Gii Idex ad Polyomial Pe s Parade. 2011. HAL Id: hal-00582625 https://hal.archives-ouvertes.fr/hal-00582625

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

Asymptotic Results for the Linear Regression Model

Asymptotic Results for the Linear Regression Model Asymptotic Results for the Liear Regressio Model C. Fli November 29, 2000 1. Asymptotic Results uder Classical Assumptios The followig results apply to the liear regressio model y = Xβ + ε, where X is

More information

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Rank tests and regression rank scores tests in measurement error models

Rank tests and regression rank scores tests in measurement error models Rak tests ad regressio rak scores tests i measuremet error models J. Jurečková ad A.K.Md.E. Saleh Charles Uiversity i Prague ad Carleto Uiversity i Ottawa Abstract The rak ad regressio rak score tests

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

Complex Analysis Spring 2001 Homework I Solution

Complex Analysis Spring 2001 Homework I Solution Complex Aalysis Sprig 2001 Homework I Solutio 1. Coway, Chapter 1, sectio 3, problem 3. Describe the set of poits satisfyig the equatio z a z + a = 2c, where c > 0 ad a R. To begi, we see from the triagle

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15 17. Joit distributios of extreme order statistics Lehma 5.1; Ferguso 15 I Example 10., we derived the asymptotic distributio of the maximum from a radom sample from a uiform distributio. We did this usig

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Technical Proofs for Homogeneity Pursuit

Technical Proofs for Homogeneity Pursuit Techical Proofs for Homogeeity Pursuit bstract This is the supplemetal material for the article Homogeeity Pursuit, submitted for publicatio i Joural of the merica Statistical ssociatio. B Proofs B. Proof

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Quantile regression with multilayer perceptrons.

Quantile regression with multilayer perceptrons. Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

MA Advanced Econometrics: Properties of Least Squares Estimators

MA Advanced Econometrics: Properties of Least Squares Estimators MA Advaced Ecoometrics: Properties of Least Squares Estimators Karl Whela School of Ecoomics, UCD February 5, 20 Karl Whela UCD Least Squares Estimators February 5, 20 / 5 Part I Least Squares: Some Fiite-Sample

More information

Testing the number of parameters with multidimensional MLP

Testing the number of parameters with multidimensional MLP Testig the umber of parameters with multidimesioal MLP Joseph Rykiewicz To cite this versio: Joseph Rykiewicz. Testig the umber of parameters with multidimesioal MLP. ASMDA 2005, 2005, Brest, Frace. pp.561-568,

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain Assigmet 9 Exercise 5.5 Let X biomial, p, where p 0, 1 is ukow. Obtai cofidece itervals for p i two differet ways: a Sice X / p d N0, p1 p], the variace of the limitig distributio depeds oly o p. Use the

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Beurling Integers: Part 2

Beurling Integers: Part 2 Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers

More information

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n Review of Power Series, Power Series Solutios A power series i x - a is a ifiite series of the form c (x a) =c +c (x a)+(x a) +... We also call this a power series cetered at a. Ex. (x+) is cetered at

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS Jauary 25, 207 INTRODUCTION TO MATHEMATICAL STATISTICS Abstract. A basic itroductio to statistics assumig kowledge of probability theory.. Probability I a typical udergraduate problem i probability, we

More information

Coefficient of variation and Power Pen s parade computation

Coefficient of variation and Power Pen s parade computation Coefficiet of variatio ad Power Pe s parade computatio Jules Sadefo Kamdem To cite this versio: Jules Sadefo Kamdem. Coefficiet of variatio ad Power Pe s parade computatio. 20. HAL Id: hal-0058658

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information