arxiv: v2 [math.st] 12 Feb 2008

Similar documents
arxiv: v1 [math.st] 8 Jan 2008

Reconstruction from Anisotropic Random Measurements

Sparsity oracle inequalities for the Lasso

The deterministic Lasso

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Least squares under convex constraint

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

Tractable Upper Bounds on the Restricted Isometry Constant

High-dimensional statistics: Some progress and challenges ahead

Hierarchical selection of variables in sparse high-dimensional regression

Hierarchical kernel learning

The lasso, persistence, and cross-validation

arxiv: v1 [math.st] 5 Oct 2009

On Model Selection Consistency of Lasso

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

High-dimensional Statistical Models

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Convex relaxation for Combinatorial Penalties

High-dimensional covariance estimation based on Gaussian graphical models

Sparsity and the Lasso

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

(Part 1) High-dimensional statistics May / 41

Lasso-type recovery of sparse representations for high-dimensional data

A talk on Oracle inequalities and regularization. by Sara van de Geer

Concentration behavior of the penalized least squares estimator

arxiv: v1 [math.st] 21 Sep 2016

High-dimensional graphical model selection: Practical and information-theoretic limits

An iterative hard thresholding estimator for low rank matrix recovery

Saharon Rosset 1 and Ji Zhu 2

High-dimensional Statistics

High-dimensional graphical model selection: Practical and information-theoretic limits

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

About Split Proximal Algorithms for the Q-Lasso

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Guaranteed Sparse Recovery under Linear Transformation

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

D I S C U S S I O N P A P E R

OWL to the rescue of LASSO

An Homotopy Algorithm for the Lasso with Online Observations

Learning discrete graphical models via generalized inverse covariance matrices

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

STAT 200C: High-dimensional Statistics

Does Compressed Sensing have applications in Robust Statistics?

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Bolasso: Model Consistent Lasso Estimation through the Bootstrap

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

Sparsity in Underdetermined Systems

A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees

1 Regression with High Dimensional Data

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Introduction to graphical models: Lecture III

19.1 Problem setup: Sparse linear regression

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

Greedy and Relaxed Approximations to Model Selection: A simulation study

Author Index. Audibert, J.-Y., 75. Hastie, T., 2, 216 Hoeffding, W., 241

Convergence Rate of Nonlinear Switched Systems

Marginal Regression For Multitask Learning

Oracle Inequalities for High-dimensional Prediction

A REMARK ON THE LASSO AND THE DANTZIG SELECTOR

Estimating LASSO Risk and Noise Level

Supplementary material for a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Sparse recovery by thresholded non-negative least squares

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Permutation-invariant regularization of large covariance matrices. Liza Levina

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Probabilistic Graphical Models

Sparse representation classification and positive L1 minimization

Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA)

The Iterated Lasso for High-Dimensional Logistic Regression

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

General principles for high-dimensional estimation: Statistics and computation

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

DASSO: connections between the Dantzig selector and lasso

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts

How Correlations Influence Lasso Prediction

Message passing and approximate message passing

Discussion of High-dimensional autocovariance matrices and optimal linear prediction,

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Discussion of Hypothesis testing by convex optimization

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Tractable performance bounds for compressed sensing.

Basis Pursuit Denoising and the Dantzig Selector

arxiv: v1 [math.st] 13 Aug 2009

Bolasso: Model Consistent Lasso Estimation through the Bootstrap

Combined l 1 and Greedy l 0 Penalized Least Squares for Linear Model Selection

High dimensional thresholded regression and shrinkage effect

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Transcription:

arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators Karim Lounici Laboratoire de Statistiques CREST 3, avenue Pierre Larousse 92240 Malakoff France and Laboratoire de Probabilités et Modèles Aléatoires UMR CNRS 7599, Université Paris 7, 2 pl. Jussieu, BP 702, 7525 Paris Cedex 05, France. e-mail: lounici@math.jussieu.fr Abstract: We derive the l convergence rate simultaneously for Lasso and Dantzig estimators in a high-dimensional linear regression model under a mutual coherence assumption on the Gram matrix of the design and two different assumptions on the noise: Gaussian noise and general noise with finite variance. Then we prove that simultaneously the thresholded Lasso and Dantzig estimators with a proper choice of the threshold enjoy a sign concentration property provided that the non-zero components of the target vector are not too small. AMS 2000 subject classifications: Primary 62J05; secondary 62F2. Keywords and phrases: Linear model, Lasso, Dantzig, Sparsity, Model selection, Sign consistency. Received January 2008.. Introduction The Lasso is an l penalized least squares estimator in linear regression models proposed by Tibshirani [7]. The Lasso enjoys two important properties. First, it is naturally sparse, i.e., it has a large number of zero components. Second, it is computationally feasible even for high-dimensional data Efron et al. [8], Osborne et al. [6] whereas classical procedures such as BIC are not feasible when the number of parameters becomes large. The first property raises the question of model selection consistency of Lasso, i.e., of identification of the subset of non-zero parameters. A closely related problem is sign consistency, i.e., identification of the non-zero parameters and their signs cf. Bunea [2], Meinshausen and Bühlmann [3], Meinshausen and Yu [4], Wainwright [20], Zhao and Yu [22] and the references cited in these papers. Zou [23] has proved estimation and variable selection results for the adaptive Lasso: a variant of Lasso where the weights on the different components in the l penalty vary and are data dependent. We mention also work on the convergence of the Lasso estimator under the prediction loss: Bickel, Ritov and Tsybakov [], Bunea, Tsybakov and Wegkamp [3], Greenshtein and Ritov [9], Koltchinskii [; 2], Van der Geer [8; 9]. 90

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 9 Knight and Fu [0] have proved the estimation consistency of the Lasso estimator in the case where the number of parameters is fixed and smaller than the sample size. The l 2 consistency of Lasso with convergence rate has been proved in Bickel, Ritov and Tsybakov [], Meinshausen and Yu [4], Zhang and Huang [2]. These results trivially imply the l p consistency, with 2 p, however with a suboptimal rate cf., e.g., Theorem 3 in [2]. Bickel, Ritov and Tsybakov [] have proved that the Dantzig selector of Candes and Tao [6] shares a lot of common properties with the Lasso. In particular they have shown simultaneous l p consistency with rates of the Lasso and Dantzig estimators for p 2. To our knowledge, there is no result on the l convergence rate and sign consistency of the Dantzig estimator. The notion of l and sign consistency should be properly defined when the number of parameters is larger than the sample size. We may have indeed an infinity of possible target vectors and solutions to the Lasso and Dantzig minimization problems. This difficulty is not discussed in [2; 3; 4; 20; 2] where either the target vector or the Lasso estimator or both are assumed to be unique. We show that under a sparsity scenario, it is possible to derive l and sign consistency results even when the number of parameters is larger than the sample size. We refer to Theorem 6.3 and the Remark, p. 2, in [] which suggest a way to clarify the difficulty mentioned above. In this paper, we consider a high-dimensional linear regression model where the number of parameters can be much greater than the sample size. We show that under a mutual coherence assumption on the Gram matrix of the design, the target vector which has few non-zero components is unique. We do not assume the Lasso or Dantzig estimators to be unique. We establish the l convergence rate of all the Lasso and Dantzig estimators simultaneously under two different assumptions on the noise. The rate that we get improves upon those obtained for the Lasso in the previous works. Then we show a sign concentration property of all the thresholded Lasso and Dantzig estimators simultaneously for a proper choice of the threshold if we assume that the non-zero components of the sparse target vector are large enough. Our condition on the size of the non-zero components of the target vector is less restrictive than in [20 22]. In addition, we prove analogous results for the Dantzig estimator, which to our knowledge was not done before. The paper is organized as follows. In Section 2 we present the Gaussian linear regression model, the assumptions, the results and we compare them with the existing results in the literature. In Section 3 we consider a general noise with zero mean and finite variance and we show that the results remain essentially the same, up to a slight modification of the convergence rate. In Section 4 we provide the proofs of the results. 2. Model and Results Consider the linear regression model Y = Xθ + W,

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 92 where X is an n M deterministic matrix, θ R M and W = W,...,W n T is a zero-mean random vector such that E[Wi 2] σ2, i n for some σ 2 > 0. For any θ R M, define Jθ = {j : θ j 0}. Let Mθ = Jθ be the cardinality of Jθ and signθ = signθ,..., signθ M T where if t > 0, signt = 0 if t = 0, if t < 0. For any vector θ R M and any subset J of {,..., M}, we denote by θ J the vector in R M which has the same coordinates as θ on J and zero coordinates on the complement J c of J. For any integers d, p < and z = z,..., z d R d d /p,, the l p norm of the vector z is denoted by z p = j= z j p and z = max j d z j. Note that the assumption of uniqueness of θ is not satisfied if M > n. In this case, if a vector θ = θ 0 satisfies, then there exists an affine space Θ = {θ : Xθ = Xθ 0 } of dimension M n of vectors satisfying. So the question of sign consistency becomes problematic when M > n because we can easily find two distinct vectors θ and θ 2 satisfying such that signθ signθ 2. However we will show that under our assumption of sparsity θ is unique. The Lasso and Dantzig estimators ˆθ L, ˆθ D solve respectively the minimization problems and min θ R M n Y Xθ 2 2 + 2r θ, 2 min θ R M θ subject to n XT Y Xθ r, 3 where r > 0 is a constant. A convenient choice in our context will be r = Aσ log M/n, for some A > 0. We denote respectively by ˆΘ L and ˆΘ D the set of solutions to the Lasso and Dantzig minimization problems 2 and 3. The definition of the Lasso minimization problem we use here is not the same as the one in [7], where it is defined as min θ R M n Y Xθ 2 2 subject to θ t, for some t > 0. However these minimization problems are strongly related, cf. [5]. The Dantzig estimator was introduced and studied in [6]. Define Φθ = n Y Xθ 2 2 + 2r θ. A necessary and sufficient condition for a vector θ to minimize Φ is that the zero vector in R M belongs to the subdifferential of Φ at point θ, i.e., { n XT Y Xθ j = signθ j r if θ j 0, n XT Y Xθ j r if θj = 0.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 93 Thus, any vector θ ˆΘ L satisfies the Dantzig constraint n XT Y Xθ r. 4 The Lasso estimator is unique if M < n, since in this case Φθ is strongly convex. However, for M > n it is not necessarily unique. The uniqueness of Dantzig estimator is not granted either. From now on, we set ˆΘ = ˆΘ L or ˆΘ D and ˆθ denotes an element of ˆΘ. Now we state the assumptions on our model. The first assumption concerns the noise variables. Assumption. The random variables W,..., W n are i.i.d. N0, σ 2. We also need assumptions on the Gram matrix Ψ = n XT X. Assumption 2. The elements Ψ i,j of the Gram matrix Ψ satisfy Ψ j,j =, j M, 5 and max Ψ i,j i j α + 2c 0 s, 6 for some integer s and some constant α >, where c 0 = if we consider the Dantzig estimator, and c 0 = 3 if we consider the Lasso estimator. The notion of mutual coherence was introduced in [7] where the authors required that max i j Ψ i,j were sufficiently small. Assumption 2 is stated in a slightly weaker form in []-[4]. Consider two vectors θ and θ 2 satisfying such that Mθ s and Mθ 2 s. Denote θ = θ θ 2 and J = Jθ Jθ 2. We clearly have Xθ = 0 and J 2s. Assume that θ 0. Under Assumption 2, similarly as we derive the inequality in Section 4 below and using the fact that θ 2s θ 2, we get that Xθ 2 2 n θ 2 > 0. 2 This contradicts the fact that Xθ = 0. Thus we have θ = θ 2. We have proved that under Assumption 2 the vector θ satisfying with Mθ s is unique. Our first result concerns the l rate of convergence of Lasso and Dantzig estimators. Theorem. Take r = Aσ log M/n and A > 2 2. Let Assumptions,2 be satisfied. If Mθ s, then P sup ˆθ θ c 2 r M A2 /8, with c 2 = 3 2 ˆθ ˆΘ + +c02 +2c 0α.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 94 Theorem states that in high dimensions M the set of estimators ˆΘ is necessarily well concentrated around the vector θ. Similar phenomenon was already observed in [], cf. Remark, page 2, for concentration in l p norms, p 2. Note that c 2 in Theorem is an absolute constant. Using Theorem, we can easily prove the consistency of the Lasso and Dantzig estimators simultaneously when n. We allow the quantities s, M, ˆΘ, θ to vary with n. In particular, we assume that log M M and lim = 0, n n as n, and that Assumptions,2 hold true for any n. Then we have sup ˆθ θ 0 7 ˆθ ˆΘ in probability, as n. The condition log M/n 0 means that the number of parameters cannot grow arbitrarily fast when n. We have the restriction M = oexpn, which is natural in this context. A result on l consistency of Lasso has been previously stated in Theorem 3 of [2], where ˆθ L was assumed to be unique and under another assumption on the matrix Ψ. It is not directly related to our Assumption 2, but can be deduced from a restricted version of Assumption 2 where α is taken to be substantially larger than. The result in [2] is a trivial consequence of the l 2 consistency, and has therefore the rate ˆθ L θ = O P s /2 r which is slower than the correct rate given in Theorem. In fact, the rate in [2] depends on the unknown sparsity s which is not the case in Theorem. Note also that Theorem 3 in [2] concerns the Lasso only, whereas our result covers simultaneously the Lasso and Dantzig estimators. We now study the sign consistency. We make the following assumption. Assumption 3. There exists an absolute constant c > 0 such that ρ = min j Jθ θ j > c r. We will take r = Aσ log M/n. We can find similar assumptions on ρ in the work on sign consistency of the Lasso estimator mentioned above. More precisely, the lower bound on ρ is of the order s /4 r /2 in [4], n δ/2 with 0 < δ < in [20; 22], log Mn/n in [2] and sr in [2]. Note that our assumption is the less restrictive. We now introduce thresholded Lasso and Dantzig estimators. For any ˆθ ˆΘ the associated thresholded estimator θ R M is defined by θ j = {ˆθj, if ˆθ j > c 2 r, 0 elsewhere. Denote by Θ the set of all such θ. We have first the following non-asymptotic result that we call sign concentration property.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 95 Theorem 2. Take r = Aσ log M/n and A > 2 2. Let Assumptions -3 be satisfied. We assume furthermore that c > 2c 2, where c 2 is defined in Theorem. Then P sign θ = signθ, θ Θ M A2 /8. Theorem 2 guarantees that every vector θ Θ and θ share the same signs with high probability. Letting n and M tend to we can deduce from Theorem 2 an asymptotic result under the following additional assumption. Assumption 4. We have M and lim n log M n = 0, as n. Then the following asymptotic result called sign consistency follows immediately from Theorem 2. Corollary. Let the assumptions of Theorem 2 hold for any n large enough. Let Assumption 4 be satisfied. Then P sign θ = signθ, θ Θ, as n. The sign consistency of Lasso was proved in [3; 22] with the Strong Irrepresentable Condition on the matrix Ψ which is somewhat different from ours. Papers [3; 22] assume a lower bound on ρ of the order n δ/2 with 0 < δ <, whereas our Assumption 3 is less restrictive. Note also that these papers assume ˆθ L to be unique. Wainwright [20] does not assume ˆθ L to be unique and discusses sign consistency of Lasso under a mutual coherence assumption on the matrix Ψ and the following condition on the lower bound: log M/n = oρ as n, which is more restrictive than our Assumption 3. In particular Proposition in [20] states that as n, if the sequence of θ satisfies the above condition for all n large enough, then P ˆθ L ˆΘ L s.t. signˆθ L = signθ. This result does not guarantee sign consistency for all the estimators ˆθ L ˆΘ L but only for some unspecified subsequence that is not necessarily the one chosen in practice. On the contrary, Corollary guarantees that all the thresholded Lasso and Dantzig estimators and θ share the same sign vector asymptotically. It follows from this result that any solution selected by the minimization algorithm is covered and that the case M > n, where the set ˆΘ is not necessarily reduced to an unique estimator, can still be treated. We note also that the papers mentioned above treat the sign consistency for the Lasso only, whereas we prove it simultaneously for Lasso and Dantzig estimators. An improvement in the conditions that we get is probably due to the fact that we consider thresholded Lasso and Dantzig estimators. In addition note that not only the consistency results, but also the exact non-asymptotic bounds are provided by Theorems and 2.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 96 3. Convergence rate and sign consistency under a general noise In the literature on Lasso and Dantzig estimators, the noise is usually assumed to be Gaussian [; 6; 3; 20; 2] or admitting a finite exponential moment [2; 4]. The exception is the paper by Zhao and Yu [22] who proved the sign consistency of the Lasso when the noise admits a finite moment of order 2k where k is an integer. An interesting question is to determine whether the results of the previous section remain valid under less restrictive assumption on the noise. In this section, we only assume that the random variables W i, i =,...,n, are independent with zero mean and finite variance E[Wi 2] σ2. We show that the results remain similar. We need the following assumption Assumption 5. The matrix X is such that for a constant c > 0. n n max X i,j 2 c, j M i= For example, if all X i,j are bounded in absolute value by a constant uniformly in i, j, then Assumption 4 is satisfied. The next theorem gives the l rate of convergence of Lasso and Dantzig estimators under a mild noise assumption. Theorem 3. Assume that W i are independent random variables with E[W i ] = 0, E[Wi 2] σ2 log M, i =,...,n. Take r = σ +δ n, with δ > 0. Let Assumptions 2,5 be satisfied. Then P sup ˆθ θ c c 2 r log M δ, ˆθ ˆΘ where c 2 is defined in Theorem, and c > 0 is a constant depending only on c. Therefore the l convergence rate under the bounded second moment noise assumption is only slightly slower than the one obtained under the Gaussian noise assumption and the concentration phenomenon is less pronounced. If we assume that lim n log M +δ /n = 0 and that Assumptions 2,3 and 5 hold true for any n with r = σ log M +δ /n, then the sign consistency of thresholded Lasso and Dantzig estimators follows from our Theorem 3 similarly as we have proved Theorem 2 and Corollary. Zhao and Yu [22] stated in their Theorem 3 a result on the sign consistency of Lasso under the finite variance assumption on the noise. They assumed ˆθ L to be unique and the matrix X to satisfy the condition max i n M j= X2 i,j /n 0, as n. This condition is rather strong. It does not hold if M > n and all the X i,j are bounded in absolute value by a constant. In addition, [22] assumes that the dimension M = On δ with 0 < δ <, whereas we only need that M = oexpn /+δ with δ > 0. Note also that [22] proves the sign consistency for the Lasso only, whereas we prove it for thresholded Lasso and Dantzig estimators.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 97 4. Proofs We begin by stating and proving two preliminary lemmas. The first lemma originates from Lemma of [3] and Lemma 2 of []. Lemma. Let Assumption and 5 of Assumption 2 be satisfied. Take r = Aσ log M/n. Here ˆΘ denotes either ˆΘ L or ˆΘ D. Then we have, on an event of probability at least M A2 /8, that sup Ψθ ˆθ 3r 2, 8 ˆθ ˆΘ and for all ˆθ ˆΘ, Jθ c c 0 Jθ, 9 where = ˆθ θ, c 0 = for the Dantzig estimator and c 0 = 3 for the Lasso. Proof. Define the random variables Z j = n n i= X i,jw i, j M. Using 5 we get that Z j N0, σ 2 /n, j M. Define the event A = M { Z j r/2}. j= Standard inequalities on the tail of Gaussian variables yield PA c MP Z r/2, M exp n r 2 2σ 2 2 A2 M 8. On the event A, we have n XT W r 2. 0 Any vector ˆθ in ˆΘ L or ˆΘ D satisfies the Dantzig constraint 4. Thus we have on A that sup Ψθ ˆθ 3r 2. ˆθ ˆΘ Now we prove the second inequality. For any ˆθ D ˆΘ D, we have by definition that ˆθ D θ, thus Jθ c = ˆθ j D θj ˆθ j D Jθ. j Jθ c j Jθ Consider now the Lasso estimators. By definition, we have for any ˆθ L ˆΘ L n Y Xˆθ L 2 2 + 2r ˆθ L n W 2 2 + 2r θ.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 98 Developing the left hand side on the above inequality, we get 2r ˆθ L 2r θ + 2 n ˆθ L θ T X T W. On the event A, we have for any ˆθ L ˆΘ L Adding ˆθ L θ on both side, we get 2 ˆθ L 2 θ + ˆθ L θ, ˆθ L θ + 2 ˆθ L 2 θ + 2 ˆθ L θ ˆθ L θ 2 ˆθ L θ + θ ˆθ L, Now we remark that if j Jθ c, then we have ˆθ j L θ j + θ j ˆθ j L = 0. Thus we have on the event A that for any ˆθ L ˆΘ L. Jθ c Jθ 2 Jθ Jθ c 3 Jθ, Lemma 2. Let Assumption 2 be satisfied. Then κs, c 0 = min J {,,M}, J s Xλ 2 min λ 0: λ J c c 0 λ J n λj 2 α > 0. Proof. For any subset J of {,...,M} such that J s and λ R M such that λ J c c 0 λ J, we have Xλ J 2 2 n λ J 2 2 = + λt J Ψ I Mλ J λ J 2 2 M α + 2c 0 s i,j= λ i J λj J λ J 2 2 λ J 2 α + 2c 0 s λ J 2, 2 where we have used Assumption 2 in the second line, I M denotes the M M identity matrix and λ J = λ J,..., λm J denotes the components of the vector λ J. This yields Xλ 2 2 n λ J 2 2 Xλ J 2 2 n λ J 2 2 + 2 λt J XT Xλ J c n λ J 2 2 λ J 2 λ J 2 2 αs + 2c 0 2 λ J λ J c αs + 2c 0 λ J 2 2 λ J 2 αs λ J 2 2 α > 0.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 99 We have used Assumption 2 in the second line, the inequality λ J c c 0 λ J in the third line and the fact that λ J J λ J 2 s λ J 2 in the last line. Proof of Theorem. For all j M, ˆθ ˆΘ we have Assumption 2 yields Ψθ ˆθ j = θ j ˆθ j + Ψθ ˆθ j θ j ˆθ j M i=,i j α + 2c 0 s Ψ i,j θ i ˆθ i. M i=,i j θ i ˆθ i, j. Thus we have θ ˆθ Ψθ ˆθ + α + 2c 0 s θ ˆθ. 2 Set = ˆθ θ. Lemma yields that on an event A of probability at least M A2 /8 we have for any ˆθ ˆΘ and Ψ 3r 2, 3 = Jθ c + Jθ + c 0 Jθ + c 0 s Jθ 2. Thus we have, on the same event A, for any ˆθ ˆΘ. Lemma 2 yields n X 2 2 = T Ψ n X 2 2 Ψ 3r 2 + c 0 s Jθ 2, 4 for any ˆθ ˆΘ. Combining 4 and 5, we obtain that sup ˆθ θ 3 ˆθ ˆΘ 2 Jθ α 2 2, 5 3 2 r + c 0 2 α s, 6 α for any ˆθ ˆΘ. Combining 2, 3 and 6 we obtain that + + c 0 2 + 2c 0 α r.

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 00 Proof of Theorem 2. Theorem yields supˆθ ˆΘ ˆθ θ c 2 r on an event A of probability at least M A2 /8. Take ˆθ ˆΘ. For j Jθ c, we have θ j = 0, and ˆθ j c 2 r on A. For j Jθ, we have by Assumption 3 that θ j c r and θ j ˆθ j θ j ˆθ j c 2 r on A. Since we assume that c > 2c 2, we have on A that ˆθ j c c 2 r > c 2 r. Thus on the event A we have: j Jθ ˆθ j > c 2 r. This yields sign θ j = signˆθ j = signθ j if j Jθ on the event A. If j Jθ, signθ j = 0 and θ j = 0 on A, so that sign θ j = 0. The same reasoning holds true simultaneously for all ˆθ ˆΘ on the event A. Thus we get the result. Proof of Theorem 3. The proof of Theorem 3 is similar to the one of Theorem up to a modification of the bound on PA c in Lemma. Recall that Z j = n n i= X i,jw i, j M and the event A is defined by A = M { Z j r/2} = { max Z j r/2}. j M j= The Markov inequality yields that PA c 4E[max j M Z 2 j ] r 2. Then we use Lemma 3 given below with p = and the random vectors Y i = X i, W i /n,...,x i,m W i /n R M, i =,...,n. We get that PA c c log M r 2 σ 2 n max j M i= X 2 i,j n 2, where c > 0 is an absolute constant. Taking r = σ log M +δ /n and using Assumption 5 yields that PA c where c > 0 is an absolute constant. c log M δ, The following result is Lemma 5.2.2, page 88 of [5]. Lemma 3. Let Y,...,Y n R M be independent random vectors with zero means and finite variance, and let M 3. Then for every p [2, ], we have [ ] n n E Y i 2 p c min[p, log M] E [ Y i 2 p], i= where c > 0 is an absolute constant. i=

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 0 Acknowledgements I wish to thank my advisor, Alexandre Tsybakov, for insightful comments and the time he kindly devoted to me. References [] P.J. Bickel, Y. Ritov and A.B. Tsybakov 2007. Simultaneous analysis of Lasso and Dantzig selector. Submitted to Ann. Statist. Available at http://www.proba.jussieu.fr/pageperso/tsybakov/. [2] F. Bunea 2007. Consistent selection via the Lasso for high dimensional approximating regression models. IMS Lecture Notes-Monograph Series, to appear. [3] F. Bunea, A.B. Tsybakov and M.H. Wegkamp 2007. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 69-94. MR23249 [4] F. Bunea, A.B. Tsybakov and M.H. Wegkamp 2007. Aggregation for Gaussian regression. Ann. Statist. 35 4, 674-697. MR2350 [5] S.S. Chen, D.L. Donoho and M.A. Saunders 999. Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing 20, 33-6. MR854649 [6] E. Candes and T. Tao 2007. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist., to appear. [7] D.L. Donoho, M. Elad and V. Temlyakov 2006. Stable recovery of Sparse Overcomplete representations in the Presence of Noise. IEEE Trans. on Information Theory 52, 6-8. MR2237332 [8] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani 2004. Least angle regression. Ann. Statist. 32, 402-45. MR206066 [9] E. Greenshtein, Y. Ritov 2004. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 0 6, 97-988. MR208039 [0] K. Knight and W. J. Fu 2000. Asymptotics for lasso-type estimators. Ann. Statist. 28, 356-378. MR805787 [] V. Koltchinskii 2006. Sparsity in penalized empirical risk minimization. Manuscript. [2] V. Koltchinskii 2007. Dantzig selector and sparsity oracle inequalities. Manuscript. [3] N. Meinshausen and P. Bühlmann 2006. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 34, 436-462. MR2278363 [4] N. Meinshausen and B. Yu 2006. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist., to appear. [5] A. Nemirovski 2000. Topics in nonparametric statistics. In Lectures on probability theory and statistics Saint Flour, 998, Lecture Notes in Math., vol. 738. Springer, Berlin, 85-277. MR775640

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 02 [6] M.R. Osborne, B. Presnell and B.A. Turlach 2000. On the Lasso and its dual. Journal of Computational and Graphical Statistics 9 39-337. MR822089 [7] R. Tibshirani 996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267-288. MR379242 [8] S.A. Van der Geer 2007. High dimensional generalized linear models and the Lasso. Ann. Statist., to appear. [9] S.A. Van der Geer 2007. The Deterministic Lasso. Tech Report n 40, Seminar für Statistik ETH, Zürich. [20] M.J. Wainwright 2006. Sharp thresholds for noisy and highdimensional recovery of sparsity using l -constrained quadratic programming. Technical report n 709, Department of Statistics, UC Berkeley. [2] C.H. Zhang and J. Huang 2007. The sparsity and biais of the Lasso selection in high-dimensional linear regression. Ann. Statist., to appear. [22] P. Zhao and B. Yu 2007. On model selection consistency of Lasso. Journal of Machine Learning Research 7, 254-2567. MR2274449 [23] H. Zou 2006. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 0 n 476, 48-429. MR2279469