Regularization with the Smooth-Lasso procedure

Size: px

Start display at page:

Download "Regularization with the Smooth-Lasso procedure"

Eileen Hodge
5 years ago
Views:

Regularizatio with the Smooth-Lasso procedure Mohamed Hebiri To cite this versio: Mohamed Hebiri.

fr/hal-00260816v2 Submitted o 15 Oct 2008 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of

The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters.

1 Regularizatio with the Smooth-Lasso procedure Mohamed Hebiri To cite this versio: Mohamed Hebiri. Regularizatio with the Smooth-Lasso procedure <hal v2> HAL Id: hal Submitted o 15 Oct 2008 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 Regularizatio with the Smooth-Lasso procedure Mohamed Hebiri Laboratoire de Probabilités et Modèles Aléatoires, CNRS-UMR 7599, Uiversité Paris 7 - Diderot, UFR de Mathématiques, 175 rue de Chevaleret F Paris, Frace. Abstract We cosider the liear regressio problem. We propose the S-Lasso procedure to estimate the ukow regressio parameters. This estimator ejoys sparsity of the represetatio while takig ito accout correlatio betwee successive covariates (or predictors). The study covers the case whe p, i.e. the umber of covariates is much larger tha the umber of observatios. I the theoretical poit of view, for fixed p, we establish asymptotic ormality ad cosistecy i variable selectio results for our procedure. Whe p, we provide variable selectio cosistecy results ad show that the S-Lasso achieved a Sparsity Iequality, i.e., a boud i term of the umber of o-zero compoets of the oracle vector. It appears that the S-Lasso has ice variable selectio properties compared to its challegers. Furthermore, we provide a estimator of the effective degree of freedom of the S-Lasso estimator. A simulatio study shows that the S-Lasso performs better tha the Lasso as far as variable selectio is cocered especially whe high correlatios betwee successive covariates exist. This procedure also appears to be a good challeger to the Elastic-Net [36]. Keywords: Lasso, LARS, Sparsity, Variable selectio, Regularizatio paths, Mutual coherece, High-dimesioal data. AMS 2000 subject classificatios: Primary 62J05, 62J07; Secodary 62H20, 62F12. hebiri@math.jussieu.fr 1

3 1 Itroductio We focus o the usual liear regressio model: y i = x i β + ε i, i = 1,...,, (1) where the desig x i = (x i,1,...,x i,p ) R p is determiistic, β = (β 1,...,β p) R p is the ukow parameter ad ε 1,...,ε, are idepedet idetically distributed (i.i.d.) cetered Gaussia radom variables with kow variace σ 2. We wish to estimate β i the sparse case, that is whe may of its ukow compoets equal zero. Thus oly a subset of the desig covariates (ξ j ) j is truly of iterest where ξ j = (x 1,j,...,x,j ), j = 1,...,p. Moreover the case p is ot excluded so that we ca cosider p depedig o. I such a framework, two mai issues arise: i) the iterpretability of the resultig predictio; ii) the cotrol of the variace i the estimatio. Regularizatio is therefore eeded. For this purpose we use selectio type procedures of the followig form: β = Argmi { Y Xβ 2 + pe(β) }, (2) β R p where X = (x 1,...,x ), Y = (y 1,..., y ) ad pe : R p R is a positive covex fuctio called the pealty. For ay vector a = (a 1,..., a ), we have adopted the otatio a 2 = 1 i=1 a i 2 (we deote by <, > the correspodig ier product i R ). The choice of the pealty appears to be crucial. Although wellsuited for variable selectio purpose, Cocave-type pealties ([12], [27] ad [6]) are ofte computatioally hard to optimize. Lasso-type procedures (modificatios of the l 1 pealized least square (Lasso) estimator itroduced by Tibshirai [25]) have bee extesively studied durig the last few years. Betwee may others, see [2, 4, 34] ad refereces iside. Such procedures seem to respod to our objective as they perform both regressio parameters estimatio ad variable selectio with low computatioal cost. We will explore this type of procedures i our study. I the paper, we propose a ovel modificatio of the Lasso we call the Smoothlasso (S-lasso) estimator. It is defied as the solutio of the optimizatio problem (2) whe the pealty fuctio is a combiatio of the Lasso pealty (i.e., p j=1 β j ) ad the l 2 -fusio pealty (i.e., p j=2 (β j β j 1 ) 2 ). The l 2 -fusio pealty was first itroduced i [15]. We add it to the Lasso procedure i order to overcome the variable selectio problems observed by the Lasso estimator. Ideed the Lasso estimator has good selectio properties but fails i some situatios. More precisely, i several works ([2, 16, 18, 29, 32, 34, 35] amog others) coditios for the cosistecy i variable selectio of the Lasso procedure are give. It was show that the Lasso is 2

4 ot cosistet whe high correlatios exist betwee the covariates. We give similar cosistecy coditios for the S-Lasso procedure ad show that it is cosistet i variable selectio i much more situatios tha the Lasso estimator. From a practical poit of view, problems are also ecoutered whe we solve the Lasso criterio with the Lasso modificatio of the LARS algorithm [10]. Ideed this algorithm teds to select oly oe represetig covariates i each group of correlated covariates. We attempt to respod to this problem i the case where the covariates are raked so that high correlatios ca exist betwee successive covariates. We will see through simulatios that such situatios support the use of the S-lasso estimator. This estimator is ispired by the Fused-Lasso [26]. Both S-Lasso ad Fused-Lasso combie a l 1 -pealty with a fusio term [15]. The fusio term is suggested to catch correlatios betwee covariates. More relevat covariates ca the be selected due to correlatios betwee them. The mai differece betwee the two procedures is that we use the l 2 distace betwee the successive coefficiets (i.e., the l 2 -fusio pealty) whereas the Fused-Lasso uses the l 1 distace (i.e., the l 1 -fusio pealty: p j=2 β j β j 1 ). Hece, compared to the Fused-Lasso, we sacrifice sparsity betwee successive coefficiets i the estimatio of β i favor of a easier optimizatio due to the strict covexity of the l 2 distace. However, sice sparsity is yet esured by the Lasso pealty. The l 2 -fusio pealty helps us to catch correlatios betwee covariates. Cosequetly, eve if there is o perfect match betwee successive coefficiets our result are still iterpretable. Moreover, whe successive coefficiets are sigificatly differet, a perfect match seems to be ot really adapted. I the theoretical poit of view, The l 2 distace also helps us to provide theoretical properties for the S-Lasso which i some situatios appears to outperforms the Lasso ad the Elastic-Net [36], aother Lasso-type procedure. Let us metio that variable selectio cosistecy of the Fused-Lasso ad the correspodig Fused adaptive Lasso has also bee studied i [20] but i a differet cotext from the oe i the preset paper. The result obtaied i [20] are established ot oly uder the sparsity assumptio, but the model is also supposed to be blocky, that is the o-zero coefficiets are represeted i a block fashio with equal values iside each block. May techiques have bee proposed to solve the weakesses of the Lasso. The Fused-Lasso procedure is oe of them ad we give here some of the most popular methods; the Adaptive Lasso was itroduced i [35], which is similar to the Lasso but with adaptive weights used to pealize each regressio coefficiet separately. This procedure reaches Oracles Properties (i.e. cosistecy i variable selectio ad asymptotic ormality). Aother approach is used i the Relaxed Lasso [17] ad aims to doubly-cotrol the Lasso estimate: oe parameter to cotrol variable selectio ad the other to cotrol shrikage of the selected coefficiets. To overcome 3

5 the problem due to the correlatio betwee covariates, group variable selectio has bee proposed by Yua ad Li [31] with the Group-Lasso procedure which selects groups of correlated covariates istead of sigle covariates at each step. A first step to the cosistecy study has bee proposed i [1] ad Sparsity Iequalities were give i [5]. Aother choice of pealty has bee proposed with the Elastic-Net [36]. It is i the same spirit that we shall treat the S-Lasso from a some theoretical poit of view. The paper is orgaized as follows. I the ext sectio, we preset oe way to solve the S-Lasso problem with the attractive property of piecewise liearity of its regularizatio path. Sectio 3 gives theoretical performaces of the cosidered estimator such as cosistecy i variable selectio ad asymptotic ormality whe p whereas cosistecy i estimatio ad variable selectio i the high dimesioal case are cosidered i Sectio 4. We also give a estimate of the effective degree of freedom of the S-Lasso estimator i Sectio 5. The, we provide a way to cotrol the variace of the estimator by scalig i Sectio 6 where a coectio with soft-thresholdig is also established. A geeralizatio ad comparative study to the Elastic-Net is doe i Sectio 7. We fially give experimetal results i Sectio 8 showig the S-Lasso performaces agaist some popular methods. All proofs are postpoed to a Appedix sectio. 2 The S-Lasso procedure As described above, we defie the S-Lasso estimator ˆβ SL as the solutio of the optimizatio problem (2) whe the pealty fuctio is: pe(β) = λ β 1 + µ p (β j β j 1 ) 2, (3) where λ ad µ are two positive parameters that cotrol the smoothess of our estimator. For ay vector a = (a 1,...,a p ), we have used the otatio a 1 = p j=1 a j. Note that whe µ = 0, the solutio is the Lasso estimator so that it appears as a special case of the S-Lasso estimator. Now we deal with the resolutio of the S-Lasso problem (2)-(3) ad its computatioal cost. From ow o, we suppose w.l.o.g. that X = (x 1,...,x ) is stadardized (that is 1 i=1 x2 i,j = 1 ad 1 i=1 x i,j = 0) ad Y = (y 1,...,y ) is cetered (that is 1 i=1 y i = 0). The followig lemma shows that the S-Lasso criterio ca be expressed as a Lasso criterio by augmetig the data artificially. j=2 4

6 Lemma 1. Give the data set (X, Y ) ad (λ, µ). Defie the exteded dataset ( X, Ỹ ) by ( ) ( 1 X Y X = ad Ỹ =, 1 + µ µj 0) where 0 is a vector of size p cotaiig oly zeros ad J is the p p matrix J = (4) Let r = λ/ 1 + µ ad b = 1 + µ β. The the S-Lasso criterio ca be writte Ỹ Xb 2 + r b 1. Let ˆb be the miimizer of this Lasso-criterio, the ˆβ SL = µ ˆb. This result is a cosequece of simple algebra. Lemma 1 motivates the followig commets o the S-Lasso procedure. Remark 1 (Regularizatio paths). The S-Lasso modificatio of the LARS is a iterative algorithm. For a fixed µ (appearig (3)), it costructs at each step a estimator based o the correlatio betwee covariates ad the curret residue. Each step correspods to a value of λ. The for a fixed µ, we get the evolutio of the S-Lasso estimator coefficiets values whe λ varies. This evolutio describes the regularizatio paths of the S-Lasso estimator which are piecewise liear [21]. This property implies that the S-Lasso problem ca be solved with the same computatioal cost as the ordiary least square (OLS) estimate usig the Lasso modificatio versio of the LARS algorithm. Remark 2 (Implemetatio). The umber of covariates that the LARS algorithm ad its Lasso versio ca select is limited by the umber of rows i the matrix X. Applied to the augmeted data ( X, Ỹ ) itroduced i Lemma 1, the Lasso modificatio of the LARS algorithm is able to select all the p covariates. The we are o loger limited by the sample size as for the Lasso [10]. 5

7 3 Theoretical properties of the S-Lasso estimator whe p I this sectio we itroduce the theoretical results accordig to the S-Lasso with a moderate sample size (p ). We first provide rates of covergece of the S-Lasso estimator ad show how through a cotrol o the regularizatio parameters we ca establish root- cosistecy ad asymptotic ormality. The we look for variable selectio cosistecy. More precisely, we give coditios uder which the S-Lasso estimator succeeds i fidig the set of the o-zero regressio coefficiets. We show that with a suitable choice of the tuig parameter (λ, µ), the S-Lasso is cosistet i variable selectio. All the results of this sectio are proved i Appedix A. 3.1 Asymptotic Normality I this sectio, we allow the tuig parameters (λ, µ) to deped o the sample size. We emphasize this depedece by addig a subscript to these parameters. We also fix the umber of covariates p. Let us ote I( ) the idicator fuctio ad defie the sig fuctio such that for ay x R, Sg(x) equals 1, 1 or 0 respectively whe x is bigger, smaller or equals 0. Kight ad Fu [14] gave the asymptotic distributio of the Lasso estimator. We provide here the asymptotic distributio to the S-Lasso. Let C = 1 X X, be Gram matrix, the Theorem 1. Give the data set (X, Y ), assume the correlatio matrix verifies C C, whe, i probability where C is a positive defiite matrix. If there exists a sequece v such that v 0 ad the regularizatio parameters verify λ v 1 λ 0 ad µ v 1 µ 0. The, if ( v ) 1 κ 0, we have where v 1 (ˆβ SL β ) D Argmi V (u), whe, u R p V (u) = 2κu T W + u T Cu + λ with W N(0, σ 2 C). + 2µ p { uj Sg(βj )I(βj 0) + u j I(βj = 0) } j=1 p { (uj u j 1 )(βj β j 1 )I(β j β j 1 )}, j=2 6

8 Remark 3. Whe κ 0 is a fiite costat: i this case v 1 is O( ) so that the estimator ˆβ SL is root- cosistet. Moreover whe λ = µ = 0, we obtai the followig stadard regressor asymptotic ormality: (ˆβ SL β ) D N(0, σ 2 C 1 ). Whe κ = 0: i this case, the rate of covergece is slower tha so that we o loger have the optimal rate. Moreover the limit is ot radom aymore. Note first that the correlatio pealty does ot alter the asymptotic bias whe successive regressio coefficiets are equal. We also remark that the sequece v must be chose properly as it determies our covergece rate. We would like v to be as close as possible to 1/. This sequece is calibrated by the user such that λ /v λ ad µ /v µ. 3.2 Cosistecy i variable selectio I this sectio, variable selectio cosistecy of the S-Lasso estimator is cosidered. For this purpose, we itroduce the followig sparsity sets: A = {j : βj 0} ad A = {j : ˆβSL j 0}. The set A cosists of the o-zero coefficiets i the vector of the oracle regressio vector β. The set A cosists of the o-zero coefficiets i SL the S-Lasso estimator ˆβ j ad is also called the active set of this estimator. Before statig our result, let us itroduce some otatios. For ay vector a R p ad ay set of idexes B {1,...,p}, deote by a B the restrictio of the vector a to the idexes i B. I the same way, if we ote B the cardial of the set B, the for ay s q matrix M, we use the followig covetio: i) M B,B is the B B matrix cosistig of the lies ad rows of M whose idexes are i B; ii) M.,B is the s B matrix cosistig of the rows of M whose idexes are i B; iii) M B,. is the B q matrix cosistig of the lies of M whose idexes are i B. Moreover, we defie J the p p matrix J J where J was defied i (4). Fially we defie for j {1,..., p}, the quatity Ω j = Ω j (λ, µ, A, β ) by Ω j = C j,a (C A,A + µ J ( A,A ) Sg(βA ) + µ λ J ) A,A β A µ λ J j,a βa, (5) where C is defied as i Theorem 1. Now cosider the followig coditios: for every j (A ) c Ω j (λ, µ, A, β ) < 1, (6) Ω j (λ, µ, A, β ) 1. (7) These coditios o the correlatio matrix C ad the regressio vector βa are the aalogues respectively of the sufficiet ad ecessary coditios derived for the Lasso ([35], [34] ad [32]). Now we state the cosistecy results 7

9 Theorem 2. If coditio (6) holds, the for every couple of regularizatio parameters (λ, µ ) such that λ 0, λ 1/2 SL ad µ 0, the S-Lasso estimator ˆβ as defied i (2)-(3) is cosistet i variable selectio. That is P(A = A ) 1, whe. Theorem 3. If there exist sequeces (λ, µ ) such that β SL coverges to β ad A coverges to A i probability, the coditio (7) is satisfied. We just have established ecessary ad sufficiet coditios to the selectio cosistecy of the S-Lasso estimator. Due to the assumptios eeded i Theorem 2 (more precisely λ 1/2 ), root- cosistecy ad variable selectio cosistecy caot be treated here simultaeously. We may wat to kow if the S-Lasso estimator ca be cosistet with a slower rate tha 1/2 ad cosistet i variable selectio i the same time. Remark 4. Here are special cases of coditios (6)- (7). Whe µ = 0 ad µ/λ = 0: these coditios are exactly the sufficiet ad ecessary coditios of the Lasso estimator. I this case Yua ad Li [32] showed that the coditio (6) becomes ecessary ad sufficiet for the Lasso estimator cosistecy i variable selectio. Whe µ = 0 ad µ/λ = γ 0: i this case, coditio (6) becomes sup C j,a C 1 A,A (2 1 Sg(βA ) + γ J A,A β A ) γ J j,a βa < 1. j (A ) c Here a good calibratio of γ leads to cosistecy i variable selectio: if (C j,a C 1 J A,A A,A J j,a )βa > 0, the γ must be chose betwee C j,a C 1 A,A Sg(β A ) (C j,a C 1 J A,A A,A J ad C j,a CA,A Sg(β A ) j,a )βa (C j,a C 1 J A,A A,A J. j,a )βa if (C j,a C 1 J A,A A,A J j,a )βa < 0, the γ must be chose betwee the same quatities but with iversio i their order. Whe µ 0 ad µ/λ = γ 0: this case is similar to the previous. I additio, it allows to have aother cotrol o the coditio through a calibratio with µ, so that coditio (6) ca be satisfied with a better cotrol. We coclude that if we sacrifice the optimal rate of covergece (i.e. root- cosistecy), we are able through a proper choice of the tuig parameters (λ, µ ) 8

10 to get cosistecy i variable selectio. Note that Zou [35] showed that the Lasso estimator caot be cosistet i variable selectio eve with a slower rate of covergece tha. He the added weights to the Lasso (i.e. the adaptive Lasso estimator) i order to get Oracles Properties (that is both asymptotic ormality ad variable selectio cosistecy). Note that we ca easily adapt techiques used i the adaptive Lasso to provide a weighted S-Lasso estimator which achieved the Oracles Properties. 4 Theoretical results whe dimesio p is larger tha sample size I this sectio, we propose to study the performace of the S-Lasso estimator i the high dimesioal case. I particular, we provide a o-asymptotic boud o the squared risk. We also provide boud o the estimatio risk uder the sup-orm (i.e., the l -orm: ˆβ SL β = sup j ˆβ j SL βj ). This last result helps us to provide a variable selectio cosistet estimator obtaied through thresholdig the S-Lasso estimator. The results of this sectio are proved i Appedix B. 4.1 Sparsity Iequality Now we establish a Sparsity Iequality (SI) achieved by the S-Lasso estimator, that is a boud o the squared risk that takes ito accout the sparsity of the oracle regressio vector β. More precisely, we prove that the rate of covergece is A log()/. For this purpose, we eed some assumptios o the Gram matrix C which is ormalized i our settig. Recall that ξ j = (x 1,j,...,x,j ). The we defie the regularizatio parameters λ ad µ i the followig forms: log(p) log(p) λ = κ 1 σ, ad µ = κ 2 σ 2, (8) where κ 1 > 2 2 ad κ 2 is positive costats. Let us defie the maximal correlatio quatity ρ 1 = max j A max k {1,...,p} (C ) j,k. Usig these otatios, we formulate k j the followig assumptios: Assumptio (A1). The true regressio vector β is such that there exists a fiite costat L 1 such that: β A J A,A β A L 1 log(p) A, (9) 9

11 where J = J J where J was defied i (4). Assumptio (A2). We have: ρ A. (10) Note that Assumptio (A1) is ot restrictive. A sufficiet coditio is that the larger o-zero compoet of βa is bouded by L 1 log(p) which ca be very large. Assumptio (A2) is the well-kow coherece coditio cosidered i [3], which has bee itroduced i [7]. Most of SIs provided i the literature use such a coditio. We refer to [3] for more details. Theorem 4 below provides a upper boud for the squared error of the estimator ˆβ SL ad for its l 1 estimatio error which takes ito accout the sparsity idex A. Theorem 4. Let us cosider the liear regressio model (1). Let ˆβ SL be S-Lasso estimator. Let A be the sparsity set. Suppose that p (ad eve p ). If Assumptios (A1) (A2) hold, the with probability greater tha 1 u,p, we have ad X ˆβ SL Xβ 2 log(p) A c 2, (11) ˆβ SL β 1 c 1 log(p) A, (12) where c 2 = (16κ L 1κ 2 )σ 2, c 1 = (16κ 1 + L 1 κ 1 1 κ 2)σ ad where u,p = p 1 κ2 1 /8 with κ 1 ad κ 2, the costats appearig i (8). The proof of Theorem 4 is based o the argmi defiitio of the estimator ad some techical cocetratio iequalities. Similar bouds were provided for the Lasso estimator i [4]. Let us metio that the costats c 1 ad c 2 are ot optimal. We focused our attetio o the depedecy o (ad the o p ad A ). It turs out that our results are ear optimal. For istace, for the l 2 risk, the S-Lasso estimator reaches early the optimal rate A log( p +1) up to a logarithmic factor A [3, Theorem 5.1]. 4.2 Sup-orm boud ad variable selectio Now we provide a boud o the sup-orm β ˆβ SL. Thaks to this result, oe may be able to defie a rule i order to get a variable selectio cosistet estimator 10 ˆβ SL

12 whe p. That is, we ca costruct a estimator which succeeds to recover the support of β i high dimesioal settigs. Small modificatios are to be imposed to provide our selectio results i this sectio. Let K be the symmetric p p matrix defied by K = C + µ J. Istead of Assumptio (A2), we will cosider the followig Assumptio (A3). We assume that max j, k {1,...,p} k j (K ) j,k 1 16 A. Remark 5. Note that the matrix J is tridiagoal with its off-diagoal terms equal to 1. If we do ot cosider the diagoal terms, we remark that C ad K differ oly i the terms o the secod diagoals (i.e., (K ) j 1,j (C ) j 1,j for j = 2,..., p as soo as µ 0). The, as we do ot cosider the diagoal terms i Assumptios (A2) ad (A3), they differ oly i the restrictio they impose to terms o the secod diagoals. Terms i the secod diagoals of C correspod to correlatios betwee successive covariates. The whe high correlatios exist betwee successive covariates, a suitable choice of µ makes Assumptio (A3) satisfied while Assumptios (A2) does ot. Hece, Assumptio (A3) fits better with setup cosidered i the paper. I the sequel, a coveiet choice of the tuig parameter µ is µ = κ 3 σ/ log (p), where κ 3 > 0 is a costat. Moreover, from Assumptio (A1), we have βa J A,A β A L 1 log (p) A. This iequality guaratees the existece of a costat L 2 > 0 such that Jβ L 2 log (p). Theorem 5. Let us cosider the liear regressio model (1). Let λ = κ 1 σ log(p)/ ad µ = κ 3 σ/ log (p) with κ 1 > 2 2 ad κ 3 > 0. Suppose that p (ad eve p ). Uder Assumptios (A1) ad (A3) ad with probability greater tha 2 1 p 1 κ 1 8, we have where c equals to log (p) ˆβ SL β c, ( ) Bσ α 1 + 4L 1B 9α 2 A 2 + 2L 1B 3αA 2 + 2L 1 B 3α(α 1)A 2 + 8L 1 L 2 B 2 9α(α 1)A 4 λ + (4L 2B 3A 2 + L 2B A 2 )λ. 11

13 Note that the leadig term i c is L 1B + 2L 1B 2L + 1 B. Oe may 4 α 1 9α 2 A 2 3αA 2 3α(α 1)A 2 fid back the result obtaied for the Lasso by settig L 1 to zero [16]. Secodly, the calibratio of µ aims at makig the covergece rate uder the sup-orm equal to log (p)/. O oe had, the proof of Theorem 5 allows us to choose this parameter with a faster covergece to zero without affectig the rate of covergece. O the other had, a more restrictive Assumptio (A1) o βa J A,A β A ad Jβ ca be formulated i order to make µ coverge slower to zero. If we let βa J A,A β A L 1 A i Assumptio (A1), we ca set µ as O( log (p)/), the slower covergece we ca get for µ. Let us ow provide a cosistet versio of the S-Lasso estimator. Cosider ˆβ ThSL, the thresholded S-Lasso estimator defied by ˆβ ThSL = ˆβ SL I(ˆβ SL c log (p)/) where c is give i Theorem 5. This estimator cosists of the S-Lasso estimator with its small coefficiets reduced to zero. We the eforce the selectio property of the S-Lasso estimator. Variable selectio cosistecy of this estimator is established uder oe more restrictio: Assumptio (A4). The smallest o-zero coefficiet of β is such that there exists a costat c l > 0 with log (p) mi j A β j > c l. Assumptio (A4) bouds from below the smallest regressio coefficiet i β. This is a commo assumptio to provide sig cosistecy i the high dimesioal case. This coditio appears i [19, 29, 33, 34] but with a larger (i term of sample size depedece) ad the more restrictive threshold. We refer to [16] for a loger discussio. A equivalet lower boud i the oracle regressio coefficiets ca be foud i [2, 16]. With this ew assumptio, we ca state the followig sig cosistecy result. Theorem 6. Let us cosider the thresholded S-Lasso estimator ˆβ ThSL as described above. Choose moreover λ = κ 1 σ log(p)/ ad µ = κ 3 σ/ log (p) with the positive costats κ 1 > 2 2 ad κ 3. Uder Assumptios (A1), (A3) ad (A4), if c l > 2 c with c is give by Theorem 5, with probability greater tha 1 p 1 κ 1 8, we have 2 ad the as + Sg(ˆβ ThSL ) = Sg(β ), (13) P(Sg(ˆβ ThSL ) = Sg(β )) 1. (14) 12

14 Remark 6. As observed i Remark 5, Assumptio (A3) is more easily satisfied whe correlatio exists betwee successive covariates. The i situatios where the correlatio matrix C is tridiagoal with its off-diagoal terms equal to δ with δ [0, 1], the costat κ 3 appearig i the defiitio of µ ca be adjusted i order to get Assumptio (A3) satisfied. 5 Model Selectio As already said [Remark 1 i Sectio 2], each step of the S-Lasso versio of the LARS algorithm provides a estimator of β. I this sectio, we are iterested i the choice of the best estimator accordig to its predictio accuracy. For a ew p matrix x ew of istaces (idepedet of X), deote ŷ SL = ˆβ SL x ew the estimator of its ukow respose value y ew ad m = E(y ew x ew ). We aim to miimize the true risk E { m ŷ SL 2 }. First, we easily obtai E { } m ŷ SL 2 = E{ Y ŷ SL 2 σ Cov(y i, ŷi SL )}, where the expectatio is take over the radom variable Y. The last term i this equatio was called optimism [9]. Moreover, Tibshirai [25] liks this quatity to the degree of freedom df(ŷ SL ) of the estimator ŷ SL, so that the above equality becomes E { m ŷ SL 2 } = E { Y ŷ SL 2 σ df(ŷ SL )σ 2}. (15) This fial expressio ivolves the degree of freedom which is ukow. Various methods exist to estimate the degree of freedom as bootstrap [11] or data perturbatio methods [24]. We give a explicit form to the degree of freedom i order to reduce the computatioal cost as i [10] ad [37]. Degrees of freedom: the degree of freedom is a quatity of iterest i model selectio. Before statig our result, let us itroduce some useful properties about the regularizatio paths of the S-Lasso estimator: Give a respose Y, ad a regularizatio parameter µ 0, there is a fiite sequece 0 = λ (K) < λ (K 1) <... < λ (0) such that ˆβ SL = 0 for every λ λ (0). I this otatio, superscripts correspod to the steps of the S-Lasso versio of the LARS algorithm. Give a respose Y, ad a regularizatio parameter µ 0, for λ (λ (k+1), λ (k) ), the same covariates are used to costruct the estimator. Let us ote A ζ the active set for a fixed couple ζ = (λ, µ) ad X.,Aζ the correspodig desig matrix. 13 i=1

15 I what follows, we will use the subscript ζ to emphasize the fact that the cosidered quatity depeds o ζ. Theorem 7. For fixed µ 0 ad λ > 0, a ubiased estimate of the effective degree of freedom of the S-Lasso estimate is give by df(ŷ ζ SL ) = Tr [X.,Aζ (X.,Aζ X.,Aζ + µ J ) ] 1 Aζ,Aζ X.,Aζ, where J = J J is defied by J = (16) As the estimatio give i Theorem 7 has a importat computatioal cost, we propose the followig estimator of the degree of freedom of the S-Lasso estimator: df(ŷ SL ζ ) = A ζ µ µ, (17) which is very easy to compute. Let I s be the s s idetity matrix where s is a iteger. We foud the former approximatio of the degree of freedom uder the orthogoal covariace matrix assumptio (that is 1 X X = I p ). Moreover we approximate the matrix (I Aλ + µ J Aλ,A λ ) by the diagoal matrix with 1 + µ i the first ad the last terms, ad 1 + 2µ i the others. Remark 7 (Compariso to the Lasso ad the Elastic-Net). A similar work leads L to a estimatio of the degree of freedom of the Lasso: df(ŷ ζ ) = A ζ ad to a EN estimatio of the degree of freedom of the Elastic-Net estimator: df(ŷ ζ ) = A ζ /(1+ µ). These approximatios of the degrees of freedom provide the followig compariso SL EN for a fixed ζ: df(ŷ ζ ) df(ŷ ζ ) df(ŷ ζ L ). A coclusio is that the S-Lasso estimator is the oe which pealizes the smaller models, ad the Lasso estimator the larger. As a cosequece, the S-Lasso estimator should select larger models tha the Lasso or the Elastic-Net estimator. 14

16 6 The Normalized S-Lasso estimator I this sectio, we look for a scaled S-Lasso estimator which would have better empirical performace tha the origial S-Lasso preseted above. The idea behid this study is to better cotrol shrikage. Ideed, usig the S-Lasso procedure (2)-(3) iduces double shrikage: oe usig the Lasso pealty ad the other usig the fusio pealty. We wat to udo the shrikage implied by the fusio pealty as shrikage is already esured by the Lasso pealty. We the suggest to study the S-Lasso criterio (2)-(3) without the Lasso pealty (i.e. with oly the l 2 -fusio pealty) i order to fid the costat we have to scale with. Defie β = Argmi β R p Y Xβ 2 + µ p (β j β j 1 ) 2. We easily obtai β = ((X X)/ + µ J) 1 (X Y )/ := L 1 (X Y )/ where J is give by (16). Moreover as the desig matrix X is stadardized, the symmetric matrix L ca be writte 1 + µ ξ 1 ξ 2 µ ξ 1 ξ 3 ξ 1... ξp 1 + 2µ 1 ξ 2ξ 3 µ.... L = ξ p 2 ξp µ ξ p 1 ξp µ 1 + µ I order to get rid of the shrikage due to the fusio pealty, we force L to have oes (or close to a diagoal of oes) i its diagoal elemets. The we scale the estimator β by a factor c. Here are two choice we will use i the followig of the paper: i) the first is c = 1 + µ so that the first ad the last diagoal elemets of L 1 become equal to oe; ii) the secod is c = 1 + 2µ which offers the advatage that all the diagoal elemets of L 1 become equal to oe except the first ad the last. This secod choice seems to be more appropriate to udo this extra shrikage ad specially i high dimesioal problem. We first give a geeralizatio of Lemma 1. Lemma 2. Give the dataset (X, Y ) ad (λ 1, µ). Defie the augmeted dataset ( X, Ỹ ) by ( ) ( ) X = ν 1 X Y 1 ad Ỹ =, µj 0 15 j=2

17 where ν 1 is a costat which depeds oly o µ ad J is give by (4). Let r = λ/ν 1 ad b = (ν 2 /c)β where ν 2 is a costat which depeds oly o µ, ad c is the scalig costat which appears i the previous study. The the S-Lasso criterio ca be writte Ỹ Xb 2 + r b 1. (18) Let ˆb be the miimizer of this Lasso-criterio, the we defie the Scaled Smooth Lasso (SS-Lasso) by ˆβ SSL = ˆβ SSL (ν 1, ν 2, c) = (c/ν 2 )ˆb. Moreover, let J = J J. The we have { ν 2 β ν 1 ˆβ SSL = Argmi β R p ( X X + µ J ) β 2 Y X c } p β + λ β j. (19) Equatio (19) is oly a rearragemet of the Lasso criterio (18). The SS-Lasso expressio (19) emphasizes the importace of the scalig costat c. I a way, the SS-Lasso estimator stabilizes the Lasso estimator ˆβ L (criterio (18) based i (X, Y ) istead of ( X, Ỹ )) as we have ˆβ L = Argmi β R p { β ( X X ) β 2 Y X j=1 } p β + λ β j. The choice of ν 1 ad ν 2 should be liked to this scalig costat c i order to get better empirical performaces ad to have less parameters to calibrate. Let us defie some specific cases. i) Case 1: Whe ν 1 = ν 2 = 1 + µ ad c = 1: this is the origial S-Lasso estimator as see i Sectio 2. ii) Case 2: Whe ν 1 = ν 2 = 1 + µ ad c = 1 + µ: we call this scaled S-Lasso estimator Normalized Smooth Lasso (NS- Lasso) ad we ote it ˆβ NSL. I this case, we have ˆβ NSL = (1 + µˆβ SL ). iii) Case 3: Whe ν 1 = ν 2 = 1 + 2µ ad c = 1 + 2µ: we call this scaled versio Highly Normalized Smooth Lasso (HS-Lasso) ad we ote it ˆβ HSL. Others choices are possible for ν 1 ad ν 2 i order to better cotrol shrikage. For istace we ca cosider a compromise betwee the NS-Lasso ad the HS-Lasso by defiig ν 1 = 1 + µ ad ν 2 = 1 + 2µ. Remark 8 (Coectio with Soft Thresholdig). Let us cosider the limit case of the NS-Lasso estimator. Note = lim ˆβNSL µ, the usig (19), we have ˆβ NSL ˆβ NSL j=1 = Argmi{β β 2Y Xβ + λ β 1 }. β 16

18 As a cosequece, (ˆβ NSL ) j = ( ) Y ξ j λ Sg(Y ξ 2 + j ) which is the Uivariate Soft Thresholdig [8]. Hece, whe µ, the NS-Lasso works as if all the covariates were idepedet. The Lasso, which correspods to the NS-Lasso whe µ = 0, ofte fails to select covariates whe high correlatios exist betwee relevat ad irrelevat covariates. It seems that the NS-Lasso is able to avoid such problem by icreasig µ ad workig as if all the covariates were idepedet. The for a fixed λ, the cotrol of the regularizatio parameter µ appears to be crucial. Whe we vary it, the NS-Lasso bridges the Lasso ad the Soft Thresholdig. 7 Extesio ad compariso All results obtaied i the preset paper ca be geeralized to all pealized least square estimators for which the pealty term ca be writte as: pe(β) = λ β 1 + β Mβ, (20) where M is p p matrix. I particular, our study ca be exteded for istace to the Elastic-Net estimator with the special choice M = I p. Such a observatio uderlies the superiority of the S-Lasso estimator o the Elastic-Net i some situatios. Ideed, let us cosider the variable selectio cosistecy i the high dimesioal settig (cf. Sectio 4.2). Regardig the Elastic-Net, Assumptio (A3) becomes Assumptio (A3-EN). We assume that max j, k {1,...,p} k j (C ) j,k + µ I p 1 16 A. (21) Sice the idetity matrix is diagoal ad sice the maximum i (21) is take over idexes k j, coditio (21) reduces to max j, k {1,...,p} (C ) j,k 1. This makes 16 A k j Assumptio (A3-EN) similar to the assumptio eeded to get the variable selectio cosistecy of the Lasso estimator [2]. Hece, we get o gai to use the Elastic-Net i a variable selectio cosistecy poit of view i our framework. This ables us to thik that the S-Lasso outperforms the Elastic-Net at least o examples as the oe i Remark 6. Recetly, Jia ad Yu [13] studied the variable selectio cosistecy of the Elastic-Net uder a assumptio called Elastic Irrepresetable Coditio: (EIC). There exists a positive costat θ such that for ay j (A ) c C j,a (C A,A + µi A ) 1 ( 2 1 Sg(β A ) + µ λ β A ) 1 θ. 17

19 This coditio ca be see as a geeralizatio of the Irrepresetable Coditio ivolved i the Lasso variable selectio cosistecy. Let us discuss how the two assumptios ca be compared i the case p. First, ote that Assumptio (A3-EN), as well as EIC suggests low correlatios betwee covariates. Moreover Assumptio (A1), (A4) ad (A3-EN) seem more restrictive tha EIC as all the correlatios are costraied i (21). However, EIC is harder to iterpret i term of the coefficiets of the regressio vector β. It also depeds o the sig of β. The mai differece is that the cosistecy result i the preset paper holds uiformly o the solutios of the Elastic Net criterio while the result from [13] higes upo the existece of a cosistet solutio for variable selectio. Obviously, this is more restrictive as we are certai to provide the sig-cosistet solutio uder the EIC. Fially, we have also provided results o the sup-orm ad sparsity iequalities o the squared risk of our estimators. Such results are ew for estimators defied with the pealty (20), icludig the S-Lasso ad the Elastic-Net. 8 Experimetal results I the preset sectio we illustrate the good predictio ad selectio properties of the NS-Lasso ad the HS-Lasso estimators. For this purpose, we compare it to the Lasso ad the Elastic-Net. It appears that S-Lasso is a good challeger to the Elastic-Net [36] eve whe large correlatios betwee covariates exist. We further show that i most cases, our procedure outperforms the Elastic-Net ad the Lasso whe we cosider the ratio betwee the relevat selected covariates ad irrelevat selected covariates. Simulatios: Data. Four simulatios are geerated accordig to the liear regressio model y = xβ + σε, ε N(0, 1), x = (ξ 1,...,ξ p ) R p. The first ad the secod examples were itroduced i the origial Lasso paper [25]. The third simulatio creates a grouped covariates situatio. It was itroduced i [36] ad aims to poit the efficiecy of the Elastic-Net compared to the Lasso. The last simulatio itroduces large correlatio betwee successive covariates. 18

20 (a) I this example, we simulate 20 observatios with 8 covariates. The true regressio vector is β = (3, 1.5, 0, 0, 2, 0, 0, 0) so that oly three covariates are truly relevat. Let σ = 3 ad the correlatio betwee ξ j ad ξ k such that Cov(ξ j, ξ k ) = 2 j k. (b) The secod example is the same as the first oe, except that we geerate 50 observatios ad that β j = 0.85 for every j {1,..., 8} so that all the covariates are relevat. (c) I the third example, we simulate 50 data with 40 covariates. The true regressio vector is such that βj = 3 for j = 1,..., 15 ad β j = 0 for j = 16,...,40. Let σ = 15 ad the covariates geerated as follows: ξ j = Z 1 + ε j, Z 1 N(0, 1), j = 1,..., 5, ξ j = Z 2 + ε j, Z 2 N(0, 1), j = 6,..., 10, ξ j = Z 3 + ε j, Z 3 N(0, 1), j = 11,..., 15, where ε j, j = 1,...,15, are i.i.d. N(0, 0.01) variables. Moreover for j = 16,...,40, the ξ j s are i.i.d N(0, 1) variables. (d) I the last example, we geerate 50 data with 30 covariates. The true regressio vector is such that β j = 3 0.1j j = 1,..., 10, β j = j j = 20,..., 25, β j = 0 for the others j. The oise is such that σ = 9 ad the correlatios are such that Cov(ξ j, ξ k ) = exp ( j k ) for (j, k) {11,..., 25} 2 ad the others covariates are i.i.d. 2 N(0, 1), also idepedet from ξ 11,...,ξ 25. I this model there are big correlatio betwee relevat covariates ad eve betwee relevat ad irrelevat covariates. Validatio. The selectio of the tuig parameters λ ad µ is based o the miimizatio of a BIC-type criterio [22]. For a give ˆβ the associated BIC error is defied as: BIC(ˆβ) = Y X ˆβ 2 + log()σ2 df(ˆβ), where df(ˆβ) is give by (17) if we cosider the S-Lasso ad deotes its aalogous quatities if we cosider the Lasso or the Elastic-Net. Such a criterio provides a 19

21 Method Example (a) Example (b) Example (c) Example (d) Lasso 3.8 [±0.1] 6.5 [±0.1] 6 [±0.1] 18.4 [±0.2] E-Net 4.9 [±0.1] 6.9 [±0.1] 15.9 [±0.1] 20.5 [±0.2] NS-Lasso 3.9 [±0.1] 6.5 [±0.1] 15.3 [±0.2] 18.9 [±0.2] HS-Lasso 3.5 [±0.1] 5.9 [±0.1] 15 [±0.1] 18.1 [±0.2] Table 1: Mea of the umber of o-zero coefficiets [ad its stadard error] selected respectively by the Lasso, the Elastic-Net (E-Net), the Normalized Smooth Lasso (NS-Lasso) ad the Highly Smooth Lasso (HS-Lasso) procedures. accurate estimator which ejoys good variable selectio properties ([23] ad [30]). I simulatio studies, for each replicatio, we also provide the Mea Square Error (MSE) of the selected estimator o a ew ad idepedet dataset with the same size as traiig set (that is ). This gives a iformatio o the robustess of the procedures. Iterpretatios. All the results exposed here are based o 200 replicatios. Figure 1 ad Figure 2 give respectively the BIC error ad the test error of the cosidered procedures i each example. Accordig to the selectio part, Figure 3 shows the frequecies of selectio of each covariate for all the procedures, ad Table 1 shows the mea of the umber of o-zeros coefficiets that each procedure selected. Fially for each procedure, Table 2 gives the ratio betwee the umber of relevat covariates ad the umber of oise covariates that the procedures selected. Let us call SNR this ratio. The we ca express this ratio as j A SNR = I(j A ) j A I(j / A ). This is a good idicatio of the selectio power of the procedures. As the Lasso is a special case of the S-Lasso ad the Elastic-Net, the Lasso BIC error (Figure 1) is always larger tha the BIC error for the other methods. These two seem to have equivalet BIC errors. Whe cosiderig the test error (Figure 2), it seems agai that all the procedures are similar i all of the examples. They maage to produce good predictio idepedetly of the sparsity of the model. The more attractive aspect cocers variable selectio. For this purpose we treat each example separately. Example (a): the Elastic-Net selects a model which is too large (Table 1). This is reflected by the worst SNR (Table 2). As a cosequece, we ca observe i Figure 3 20

22 Method Example (a) Example (c) Example (d) Lasso 2.3 [±0.1] 2.9 [±0.1] 4.7 [±0.2] E-Net 1.7 [±0.1] 13.1 [±0.3] 3.4 [±0.2] NS-Lasso 2.5 [±0.1] 13.5 [±0.3] 6.8 [±0.3] HS-Lasso 1.79 [±0.1] 11.4 [±0.3] 6.4 [±0.3] Table 2: Mea of the ratio betwee the umber of relevat covariates ad the umber of oise covariates (SNR) [ad its stadard error] that each of the Lasso, the Elastic- Net, the NS-Lasso ad the HS-Lasso procedures selected. that it also icludes the secod covariate more ofte tha the other procedures. This is due to the groupig effect as the first covariate is relevat. For similar reasos, the S-Lasso ofte selects the secod covariate. However, this covariate is less selected tha by the Elastic-Net as the S-Lasso seems to be a little bit disturbed by the third covariate which is irrelevat. This aspect of the S-Lasso procedure is also preset i the selectio of the covariate 5 as its eighbor covariates 4 ad 6 are irrelevat. We ca also observe that the S-Lasso procedure is the oe which selects less ofte irrelevat covariates whe these covariates are far away from relevat oes (i term of idices distace). Fially, eve if the Lasso procedure selects less ofte the relevat covariates tha the Elastic-Net ad the S-Lasso procedures, it also has as good SNR. The Lasso presets good selectio performaces i this example. Example (b): we ca see i Figure 3 how the S-Lasso ad Elastic-Net selectio depeds o how the covariates are raked. They both select more covariates i the middle (that is covariates 2 to 7) tha the oes i the borders (covariates 1 ad 8) tha the Lasso. We also remark that this aspect is more emphasized for the S-Lasso tha for the Elastic-Net. Example (c): the Lasso procedure performs poorly. It selects more oise covariates ad less relevat oes tha the other procedures (Figure 3). It also has the worst SNR (Table 2). I this example, Figure 3 also shows that the Elastic-Net selects more ofte relevat covariates tha the S-Lasso procedures but it also selects more oise covariates tha the NS-lasso procedure. The eve if the Elastic-Net has very good performace i variable selectio, the NS-Lasso procedure has similar performaces with a close SNR (Table 2). The NS-Lasso appears to have very good performace i this example. However, it selects agai less ofte relevat covariates at the border tha the Elastic-Net. Example (d): we decompose the study ito two parts. First, the idepedet part which cosiders covariates ξ 1,...,ξ 10 ad ξ 26,...,ξ 30. The secod part cosiders the 21

23 Example (a) Example (b) BIC Error BIC Error procedure Example (c) procedure Example (d) BIC Error BIC Error procedure procedure Figure 1: BIC error i each example. For each plot, we costruct the boxplot for the procedure 1 = Lasso; 2 = Elastic-Net; 3 = NS-Lasso; 4 = HS-Lasso other covariates which are depedet. Regardig the idepedet covariates, Figure 3 shows that all the procedures perform roughly i the same way, though the S-Lasso procedure ejoys a slightly better selectio (i both relevat ad oise group of covariates). For the depedet ad relevat covariates, the Lasso performs worst tha the other procedures. It selects clearly less ofte these relevat covariates. As i example (c), the reaso is that the Lasso modificatio of the LARS algorithm teds to select oly oe represetative of a group of highly correlated covariates. The high value of the SNR for the Lasso (whe compared to the Elastic-Net) is explaied by its good performace whe it treat oise covariates. I this example the Elastic-Net correctly selects relevat covariates but it is also the procedure which selects the more oise covariates ad has the worst SNR. We also ote that both the NS-Lasso ad HS-Lasso outperform the Lasso ad Elastic-Net. This gai is emphasized especially i the ceter of the groups. Observe that for the covariates ξ 20, ξ 21, ξ 25 ad ξ 26 (that 22

24 35 Example (a) 20 Example (b) 30 MSE Test MSE Test procedure procedure Example (c) Example (d) MSE Test MSE Test procedure procedure Figure 2: Test Error i each example. For each plot, we costruct the boxplot for the procedure 1 = Lasso; 2 = Elastic-Net; 3 = NS-Lasso; 4 = HS-Lasso is the borders), the NS-Lasso ad HS-Lasso have slightly worst performace tha i the ceter of the groups. This is agai due to the attractio we imposed by the fusio pealty (3) i the S-Lasso criterio. Coclusio of the experimets. The S-Lasso procedure seems to respod to our expectatios. Ideed, whe successive correlatios exist, it teds to select the whole group of these relevat covariates ad ot oly oe represetig the group as doe by the Lasso procedure. It also appears that the S-Lasso procedure has very good selectio properties accordig to both relevat ad oise covariates. However it has slightly worst performace i the borders tha i the ceters of groups of covariates (due to attractios of irrelevat covariates). It almost always has a better SNR tha the Elastic-Net, so we ca take it as a good challeger for this procedure. 23

25 Lasso 200 NS Lasso detectio umber of time detectio umber of time idice k Elastic Net idice k HS Lasso detectio umber of time detectio umber of time detectio umber of time idice k idice k Lasso 20 Elastic Net NS Lasso HS Lasso idex k 200 Lasso 200 NS Lasso detectio umber of time detectio umber of time idice k Elastic Net idice k HS Lasso detectio umber of time detectio umber of time detectio umber of time idice k idice k 20 0 Lasso Elastic Net NS Lasso HS Lasso idice k Figure 3: Number of covariates detectios for each procedure i all the examples (Top-Left: Example (a); Top-Right: Example (b); Bottom-Left: Example (a); Bottom-Right: Example (b)) 24

26 9 Coclusio I this paper, we itroduced a ew procedure called the Smooth-Lasso which takes ito accout correlatio betwee successive covariates. We established several theoretical results. The mai coclusios are that whe p, the S-Lasso is cosistet i variable selectio ad asymptotically ormal with a rate lower tha. I the high dimesioal settig, we provided a coditio related to the coherece mutual coditio, uder which the thresholded versio of the Smooth-Lasso is cosistet i variable selectio. This coditio is fulfilled whe correlatios betwee successive covariates exist. Moreover, simulatio studies showed that ormalized versios of the Smooth-Lasso have ice properties of variable selectio which are emphasized whe high correlatios exist betwee successive covariates. It appears that the Smooth-Lasso almost always outperforms the Lasso ad is a good challeger of the Elastic-Net. Appedix A. Sice the matrix C + µ J plays a crucial role i the proves, we use to shorte the otatio K = C + µ J ad whe p we defie K = C + µ J, its limit. I this appedix we prove the results whe p. Proof of Theorem 1. Let Ψ be p Ψ (u) = Y X(β + v u) 2 + λ βj + v u j j=1 p ( + µ β j βj 1 + v (u j u j 1 ) )2, for u = (u 1,...,u p ) R p ad let û = Argmi u Ψ (u). Let ε = (ε 1,...,ε ), we the j=2 25

27 have Ψ (u) Ψ (0) =: V (u) = v 2 u ( X X = v 2 +v µ p [ + µ v j=2 u ( X X p j=2 = v 2 V (u). ) u 2 v ε X u + v λ p j=1 ( v 1 β j + v u j βj ) { (β v 1 j βj 1 + v (u j u j 1 ) ) 2 ( ) } β j βj 1 2 ) u 2 ε X u + λ p ( v v 1 β v j + v u j βj ) j=1 { (β v 1 j β j 1 + v (u j u j 1 ) )2 ( βj ) } ] 2 β j 1 Note that û = Argmi u Ψ (u) = Argmi u V (u), we the have to cosider the limit distributio of V (u). First, we have X X C. Moreover, as 1/(v ) κ ad as give X, the radom variable ε X D W, with W N(0, σ 2 C), the Slutsky theorem implies that 2 ε X u D 2κW u. v Now we treat the last two terms. If βj 0, ( β j + v u j βj ) u j Sg(βj ), v 1 ad is equal to u j otherwise. The, as λ v p j=1 ( v 1 β j + v u j βj ) λ p { uj Sg(βj )I(β j 0) + u j I(βj = 0)}, j=1 For the remaiig term, we show that if β j β j 1, { (β v 1 j β j 1 + v (u j u j 1 ) )2 ( βj ) } 2 β j 1 2(u j u j 1 )(βj β j 1 ), ad is equal to (u j u j 1 ) 2 p µ v j=2 otherwise. But µ coverge to 0, implies that { (β v 1 j βj 1 + v (u j u j 1 ) ) 2 ( ) } β j βj

28 2µ p { (uj u j 1 )(βj β j 1 )I(β j β j 1 )}. j=2 Therefore we have V (u) V (u) i probability, for every u R p. Ad sice C is a positive defied matrix, V (u) has a uique miimizer. Moreover as V (u) is covex, stadard M-estimatio results [28] lead to: û Argmi u V (u). Proof of Theorem 2. We begi by givig two results which we will use i our proof. The first oe cocers the optimality coditios of the S-Lasso estimator. Recall that by defiitio ˆβ SL = Argmi Y Xβ 2 + λ β 1 + µ β Jβ. β R p Note f(a) a=a0 the evaluatio of the fuctio f at the poit a 0. As the above problem is a o-differetiable covex problem, classical tools lead to the followig optimality coditios for the S-Lasso estimator: Lemma 3. The vector ˆβ SL = (ˆβ 1 SL SL,..., ˆβ p ) is the S-Lasso estimate as defied i (2)-(3) if ad oly if Y Xβ 2 + µ β Jβ = λ Sg(ˆβ j SL ) for j : dβ j ˆβSL j 0, (22) βj =ˆβ j SL Y Xβ 2 + µ β Jβ dβ j λ for j : ˆβSL j = 0. (23) βj =ˆβ SL j Recall that A = {j : βj 0}, the secod result states that if we restrict ourselves to the covariates which we are after (i.e. idexes i A ), we get a cosistet estimate as soo as the regularizatio parameters λ ad µ are properly chose. Lemma 4. Let β A a miimizer of Y X A β A 2 + λ j A β j + µ β A J A,A β A. If λ 0 ad µ 0, the β A coverges to βa i probability. 27

29 This lemma ca be see as a special ad restricted case of Theorem 1. We ow prove Theorem 2. Let β A as i Lemma 4. We defie a estimator β by extedig β A by zeros o (A ) c. Hece, cosistecy of β is esure as a simple cosequece of Lemma 4. Now we eed to prove that with probability tedig to oe, this estimator is optimal for the problem (2)-(3). That is the optimal coditios (22)-(23) are fulfilled with probability tedig to oe. From ow o, we deote A for A. By defiitio of β A, the optimality coditio (22) is satisfied. We ow must check the optimality coditio (23). Combiig the fact that Y = Xβ + ε ad the covergece of the matrix X X/ ad the vector ε X/, we have 1 (X Y X X A βa ) = C.,A (β A β A ) + O p ( 1/2 ). (24) Moreover, the optimality coditio (22) for the estimator β ca be writte as 1 (X.,A Y X.,A X.,A β A ) = λ 2 Sg( β A ) µ JA,A (β A β A ) + µ JA,A β A. (25) Combiig (24) ad (25), we easily obtai ( ) (βa β A ) = (C A,A + µ JA,A ) 1 λ 2 Sg( β A ) + µ JA,A βa + O p ( 1/2 ). Sice β is cosistet ad λ 1/2, for each j A c, the left had side i the optimality coditio (23) 1 λ (ξ jy ξ jx.,a βa ) µ λ Jj,A βa =: L () j, coverges i probability to C j,a (K A,A ) (2 1 1 Sg(βA ) + µ λ J ) A,A βa µ λ J j,a βa =: L j. By coditio (6), this quatity is strictly smaller tha oe. The ( ) lim P j A c, L () j 1 P ( L j 1) = 1, j A c which eds the proof. 28

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short