A REMARK ON THE LASSO AND THE DANTZIG SELECTOR

Similar documents
INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Reconstruction from Anisotropic Random Measurements

arxiv: v2 [math.st] 12 Feb 2008

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

(Part 1) High-dimensional statistics May / 41

The deterministic Lasso

Tractable Upper Bounds on the Restricted Isometry Constant

Does Compressed Sensing have applications in Robust Statistics?

An iterative hard thresholding estimator for low rank matrix recovery

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

of Orthogonal Matching Pursuit

Optimisation Combinatoire et Convexe.

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Shifting Inequality and Recovery of Sparse Signals

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

Strengthened Sobolev inequalities for a random subspace of functions

A new method on deterministic construction of the measurement matrix in compressed sensing

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Guaranteed Sparse Recovery under Linear Transformation

Tractable performance bounds for compressed sensing.

Least squares under convex constraint

Convex relaxation for Combinatorial Penalties

An Introduction to Sparse Approximation

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

Necessary and Sufficient Conditions of Solution Uniqueness in 1-Norm Minimization

The Sparsest Solution of Underdetermined Linear System by l q minimization for 0 < q 1

STAT 200C: High-dimensional Statistics

High-dimensional statistics: Some progress and challenges ahead

Compressed Sensing and Sparse Recovery

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Euclidean sections of l N 1 with sublinear randomness and error-correction over the reals

On the l 1 -Norm Invariant Convex k-sparse Decomposition of Signals

Analysis of Greedy Algorithms

Generalized Orthogonal Matching Pursuit- A Review and Some

arxiv: v1 [math.st] 5 Oct 2009

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

19.1 Problem setup: Sparse linear regression

Conditions for Robust Principal Component Analysis

Linear Algebra Massoud Malek

Average-case hardness of RIP certification

A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees

Lecture Notes 9: Constrained Optimization

Signal Recovery From Incomplete and Inaccurate Measurements via Regularized Orthogonal Matching Pursuit

Sparsest Solutions of Underdetermined Linear Systems via l q -minimization for 0 < q 1

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Sparsity in Underdetermined Systems

Recent Developments in Compressed Sensing

DS-GA 1002 Lecture notes 10 November 23, Linear models

GREEDY SIGNAL RECOVERY REVIEW

Compressed Sensing and Robust Recovery of Low Rank Matrices

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

A Comparative Framework for Preconditioned Lasso Algorithms

Uniform Uncertainty Principle and Signal Recovery via Regularized Orthogonal Matching Pursuit

Error Correction via Linear Programming

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Signal Recovery from Permuted Observations

Latent Variable Graphical Model Selection Via Convex Optimization

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

The Pros and Cons of Compressive Sensing

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Dimensionality Reduction Notes 3

Sparsity and the Lasso

Necessary and sufficient conditions of solution uniqueness in l 1 minimization

Adaptive L p (0 <p<1) Regularization: Oracle Property and Applications

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Deterministic constructions of compressed sensing matrices

arxiv: v1 [math.st] 8 Jan 2008

IEOR 265 Lecture 3 Sparse Linear Regression

Stable Signal Recovery from Incomplete and Inaccurate Measurements

Estimating LASSO Risk and Noise Level

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

Stochastic Design Criteria in Linear Models

Limitations in Approximating RIP

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Chapter 3 Transformations

Sparse recovery using sparse random matrices

Gaussian Phase Transitions and Conic Intrinsic Volumes: Steining the Steiner formula

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Universal low-rank matrix recovery from Pauli measurements

Restricted Strong Convexity Implies Weak Submodularity

DASSO: connections between the Dantzig selector and lasso

Towards a Mathematical Theory of Super-resolution

Rui ZHANG Song LI. Department of Mathematics, Zhejiang University, Hangzhou , P. R. China

LIMITATION OF LEARNING RANKINGS FROM PARTIAL INFORMATION. By Srikanth Jagabathula Devavrat Shah

Methods for sparse analysis of high-dimensional data, II

About Split Proximal Algorithms for the Q-Lasso

High-dimensional covariance estimation based on Gaussian graphical models

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

The Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective

Computable Performance Analysis of Sparsity Recovery with Applications

The lasso, persistence, and cross-validation

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

THE DANTZIG SELECTOR: STATISTICAL ESTIMATION WHEN p IS MUCH LARGER THAN n 1

Transcription:

A REMARK ON THE LAO AND THE DANTZIG ELECTOR YOHANN DE CATRO Abstract This article investigates a new parameter for the high-dimensional regression with noise: the distortion This latter has attracted a lot of attention recently with the appearance of new deteristic constructions of almost -Euclidean sections of the L-ball It measures how far is the intersection between the kernel of the design matrix and the unit L-ball from an L-ball We show that the distortion holds enough information to derive oracle inequalities (ie a comparison to an ideal situation where one knows the s largest coefficients of the target) for the lasso and the Dantzig selector Introduction In the past decade much emphasis has been put on recovering a large number of unknown variables from few noisy observations Consider the high-dimensional linear model where one observes a vector y R n such that y = Xβ + ε where X R n p is called the design matrix (known from the experimenter) β R p is an unknown target vector one would like to recover and ε R n is a stochastic error term that contains all the perturbations of the experiment A standard hypothesis in high-dimensional regression HTF09] requires that one can provide a constant λ 0 n R as small as possible such that () X ε l λ 0 n with an overwhelg probability where X R p n denotes the transpose matrix of X In the case of n-multivariate Gaussian distribution it is known that λ 0 n = O(σ n log p) where σn > 0 denotes the standard deviation of the noise; see Lemma A uppose that you have far less observation variables y i than the unknown variables β i For instance let us mention the Compressed ensing problem Don06 CRT06] where one would like to simultaneously acquire and compress a signal using few (non-adaptive) linear measurements ie n p In general terms we are interested in accurately estimating the target vector β and the response Xβ from few and corrupted observations During the past decade this challenging issue has attracted a lot of attention among the statistical society In 996 R Tibshirani introduced the lasso Tib96]: { () β l arg Xβ y + λ β R p l l β } where λ l > 0 denotes a tuning parameter Two decades later this estimator continues to play a key role in our understanding of high-dimensional inverse problems Its popularity might be due to the fact that this estimator is computationally tractable Indeed the lasso can be recasted Date: eptember 8 0 Key words and phrases Lasso; Dantzig selector; Oracle inequality; Almost-Euclidean section; Distortion

YOHANN DE CATRO in a econd Order Cone Program (OCP) that can be solved using an interior point method Recently EJ Candès and T Tao CT07] have introduced the Dantzig selector as (3) β d arg β R p β st X (y Xβ) l λ d where λ d > 0 is a tuning parameter It is known that it can be recasted as a linear program Hence it is also computationally tractable A great statistical challenge is then to find efficiently verifiable conditions on X ensuring that the lasso () or the Dantzig selector (3) would recover most of the information about the target vector β Our goal What do we precisely mean by most of the information about the target? What is the amount of information one could recover from few observations? These are two of the important questions raised by Compressed ensing uppose that you want to find an s-sparse vector (ie a vector with at most s non-zero coefficients) that represents the target then you would probably want that it contains the s largest (in magnitude) coefficients βi More precisely denote by { p} the set of the indices of the s largest coefficients The s-best term approximation vector is β R p where (β ) i = βi if i and 0 otherwise Observe that it is the s-sparse projection in respect to any l q -norm for q < + (ie it imizes the l q -distance to β among all the s-sparse vectors) and then the most natural approximation by an s-sparse vector uppose that someone gives you all the keys to recover β More precisely imagine that you know the subset in advance and that you observe y oracle = Xβ + ε This is an ideal situation referred as the oracle case Assume that the noise ε is a Gaussian white noise of standard deviation σ n ie ε N n (0 σn Id n ) where N n denotes the n-multivariate Gaussian distribution Then the optimal estimator is the ordinary least square β ideal R p on the subset namely: β ideal arg β R p upp(β) Xβ y oracle l where upp(β) { p} denotes the support (ie the set of the indices of the non-zero coefficients) of the vector β It holds β ideal β = β ideal β + β c β ideal β l + β c where β c = β β denotes the error vector of the s-best term approximation A calculation of the solution of the least square estimator shows that: E β ideal β l = E ( X X ) X y oracle β l = E ( X X ) X ε l = Trace (( X X ) ) σ n ( ρ ) σ n s where X R n s denotes the matrix composed by the columns X i R n of the matrix X such that i and ρ is the largest singular value of X It yields that E ] β ideal β / l σ n s ρ In a nutshell the l -distance between the target β and the optimal estimator β ideal can be reasonably said of the order of (4) σ n s + β ρ c

A REMARK ON THE LAO AND THE DANTZIG ELECTOR 3 In this article we say that the lasso satisfies a variable selection oracle inequality of order s if and only if its l -distance to the target namely β l β is bounded by (4) up to a satisfactory multiplicative factor In some situations it could be interesting to have a good approximation of Xβ In the oracle case we have Xβ ideal Xβ l Xβ ideal Xβ l + Xβ l c Xβ ideal Xβ l + ρ β c where ρ denotes the largest singular value of X An easy calculation gives that E Xβ ideal Xβ l = Trace ( X ( X X ) X ) σ n = σ n s Hence a tolerable upper bound is given by (5) σ n s + ρ β c We say that the lasso satisfies an error prediction oracle inequality of order s if and only if its prediction error is upper bounded by (5) up to a satisfactory multiplicative factor (say logarithmic in p) Framework In this article we investigate designs with known distortion We begin with the definition of this latter: Definition A subspace Γ R p has a distortion δ p if and only if x Γ x p xl δ x A long standing issue in approximation theory in Banach spaces is to find almost -Euclidean sections of the unit l -ball ie subspaces with a distortion δ close to and a dimension close to p In particular we recall that it has been established Kas77] that with an overwhelg probability a random subspace of dimension p n (with respect to the Haar measure on the Grassmannian) satisfies ( ) / p( + log(p/n)) (6) δ C n where C > 0 is a universal constant In other words it was shown that for all n p there exists a subspace Γ n of dimension p n such that for all x Γ n ( ) / xl + log(p/n) C x n Remark Hence our framework deals also with unitary invariant random matrices For instance the matrices with iid Gaussian entries Observe that their distortion satisfies (6) Recently new deteristic constructions of almost -Euclidean sections of the l -ball have been given Most of them can be viewed as related to the context of error-correcting codes Indeed the construction of Ind07] is based on amplifying the imum distance of a code using expanders While the construction of GLR08] is based on Low-Density Parity Check (LDPC) codes Finally the construction of I0] is related to the tensor product of error-correcting codes The main reason of this surprising fact is that the vectors of a subspace of low distortion must be well-spread ie a small subset of its coordinates cannot contain most of its l -norm (cf Ind07 GLR08]) This property is required from a good error-correcting code where the weight (ie the l 0 -norm) of each codeword cannot be concentrated on a small subset of its coordinates imilarly this property was intensively studied in Compressed ensing; see for instance the Nullspace Property in CDD09]

4 YOHANN DE CATRO Remark The main point of this article is that all of these deteristic constructions give efficient designs for the lasso and the Dantzig selector 3 The Universal Distortion Property In the past decade numerous conditions have been given to prove oracle inequalities for the lasso and the Dantzig selector An overview of important conditions can be found in vdgb09] We introduce a new condition the Universal Distortion Property (UDP) Definition (UDP( 0 κ 0 )) Given 0 p and 0 < κ 0 < / we say that a matrix X R n p satisfies the universal distortion condition of order 0 magnitude κ 0 and parameter if and only if for all γ R p for all integers s { 0 } for all subsets { p} such that = s it holds (7) γ Xγ l + κ 0 γ Remark Observe that the design X is not normalized Equation (8) in Theorem shows that can depend on the inverse of the smallest singular value of X Hence the quantity Xγ l is scalar invariant The UDP condition is similar to the Magic Condition BLPR] and the Compatibility Condition vdgb09] The main point of this article is that UDP is verifiable as soon as one can give an upper bound on the distortion of the kernel of the design matrix; see Theorem Hence instead of proving that a sufficient condition (such as RIP CRT06] REC BRT09] Compatibility vdgb09] ) holds it is sufficient to compute the distortion and the largest singular value of the design Especially as these conditions can be hard to prove for a given matrix We recall that an open problem is to find a computationally efficient algorithm that can tell if a given matrix satisfies the RIP condition CRT06] or not We call the property Universal Distortion because it is satisfied by all the full rank matrices (Universal) and the parameters 0 and can be expressed in terms of the distortion of the kernel Γ of X: Theorem Let X R n p be a full rank matrix Denote by δ the distortion of its kernel: γ δ = p γ l sup γ ker(x) and ρ n its smallest singular value Then for all γ R p γl δ γ + δ Xγl p ρ n Equivalently we have B := {γ R p (δ/ p) γ + (δ/ρ n ) Xγ l } B p where Bp denotes the Euclidean unit ball in R p This result implies that every full rank matrix satisfies UDP with parameters described as follows Theorem Let X R n p be a full rank matrix Denote by δ the distortion of its kernel and ρ n its smallest singular value Let 0 < κ 0 < / then X satisfies UDP( 0 κ 0 ) where ( κ0 ) p δ (8) 0 = and = δ ρ n This theorem is sharp in the following sense The parameter 0 represents (see Theorem ) the maximum number of coefficients that can be recovered using lasso we call it the sparsity

A REMARK ON THE LAO AND THE DANTZIG ELECTOR 5 level It is known CDD09] that the best bound one could expect is opt n/ log(p/n) up to a multiplicative constant In the case where (6) holds the sparsity level satisfies (9) 0 κ 0 opt It shows that any design matrix with low distortion satisfies UDP with an optimal sparsity level Oracle inequalities The results presented here fold into two parts In the first part we assume only that UDP holds In particular it is not excluded that one can get better upper bounds on the parameters than Theorem As a matter of fact the smaller is the sharper the oracle inequalities are Then we give oracle inequalities in terms of only the distortion of the design Theorem Let X R n p be a full column rank matrix UDP( 0 κ 0 ) and that () holds Then for any (0) λ l > λ 0 n/( κ 0 ) it holds () β l β ( λ ) 0 n λl κ0 ( {p} =s s 0 λ l s + β c ) Assume that X satisfies For every full column rank matrix X R n p for all 0 < κ 0 < / and λ l satisfying (0) we have () β l β ( λ ) (λ 0 l 4 δ n λl κ0 {p} ρ s + β ) c n =s s (κ 0/δ) p where ρ n denotes the smallest singular value of X and δ the distortion of its kernel Consider the case where the noise satisfies the hypothesis of Lemma A and take λ 0 n = λ 0 n() Assume that κ 0 is constant (say κ 0 = /3) and take λ l = 3λ 0 n; then () becomes β l β ( 6 Xl log p σ n s + β c {p} ) =s s 0 which is an oracle inequality up to a multiplicative factor log p In the same way () becomes β l β ( 4 Xl δ log p σ n s + β {p} ρ n ρ c ) n =s s p/9δ which is an oracle inequality up to a multiplicative factor C mult := (δ log p)/ρ n In the optimal case (6) this latter becomes: (3) C mult = C p ( + log(p/n)) log p n ρ n where C > 0 is the same universal constant as in (6) Roughly speaking up to a factor of the order of (3) the lasso is as good as the oracle that knows the 0 -best term approximation of the target Moreover as mentioned in (9) 0 is an optimal sparsity level However this multiplicative constant takes small values for a restrictive range of the parameter n As a matter of fact it is meaningful when n is a constant fraction of p imilarly we shows oracle inequalities in error prediction in terms of the distortion of the kernel of the design

6 YOHANN DE CATRO Theorem Let X R n p be a full column rank matrix UDP( 0 κ 0 ) and that () holds Then for any (0) λ l > λ 0 n/( κ 0 ) it holds (4) Xβ l Xβ l 4λ l β c s + {p} =s s 0 Assume that X satisfies For every full column rank matrix X R n p for all 0 < κ 0 < / and λ l satisfying (0) we have (5) Xβ l Xβ l 4λ l δ s + {p} ρ n δ ρ n β c ] =s s (κ 0/δ) p where ρ n denotes the smallest singular value of X and δ the distortion of its kernel Consider the case where the noise satisfies the hypothesis of Lemma A and take λ 0 n = λ 0 n() Assume that κ 0 is constant (say κ 0 = /3) and take λ l = 3λ 0 n then (4) becomes Xβ l Xβ l 4 X ] β c {p} l log p σ n s + s =s s 0 which is not an oracle inequality stricto sensu because of /( ) in the second term As a matter of fact it tends to lower the s-best term approximation term β c Nevertheless it is almost an oracle inequality up to a multiplicative factor of the order of log p In the same way (5) becomes Xβ l Xβ l 48 X δ log p {p} l σ n s + ρ n ρ n δ ρ n β ] c =s s p/9δ which is an oracle inequality up to a multiplicative factor C mult := (δ log p)/ρ n In the optimal case (6) this latter becomes: (6) C mult = C where C > 0 is the same universal constant as in (6) (p log p ( + log(p/n)))/ ρ n n Results for the Dantzig selector imilarly we derive the same results for the Dantzig selector The only difference is that the parameter κ 0 must be less than /4 Here again the results fold into two parts In the first one we only assume that UDP holds In the second we invoke Theorem to derive results in terms of the distortion of the design Theorem 3 Let X R n p be a full column rank matrix UDP( 0 κ 0 ) with κ 0 < /4 and that () holds Then for any (7) λ d > λ 0 /( 4κ 0 ) it holds (8) β d β 4 ( ) λ 0 λ d 4κ0 ( {p} =s s 0 ] λ d s + β c ) Assume that X satisfies

A REMARK ON THE LAO AND THE DANTZIG ELECTOR 7 For every full column rank matrix X R n p for all 0 < κ 0 < /4 and λ d satisfying (7) we have (9) β d β 4 ( λ 0 λ d ) 4κ0 (λ d 4 δ {p} ρ s + β ) c n =s s (κ 0/δ) p where ρ n denotes the smallest singular value of X and δ the distortion of its kernel The prediction error is given by the following theorem Theorem 4 Let X R n p be a full column rank matrix UDP( 0 κ 0 ) with κ 0 < /4 and that () holds Then for any (7) λ d > λ 0 /( 4κ 0 ) it holds (0) Xβ d Xβ l 4λ d β c s + {p} =s s 0 Assume that X satisfies For every full column rank matrix X R n p for all 0 < κ 0 < /4 and λ d satisfying (0) we have () Xβ d Xβ l 4λ d δ s + {p} ρ n δ ρ n β c ] =s s (κ 0/δ) p where ρ n denotes the smallest singular value of X and δ the distortion of its kernel Observe that the same comments as in the lasso case (eg (3) (6)) hold Eventually every deteristic construction of almost-euclidean sections gives design that satisfies the oracle inequalities above 3 An overview of the standard results Oracle inequalities for the lasso and the Dantzig selector have been established under a variety of different conditions on the design In this section we show that the UDP condition is comparable to the standard conditions (RIP REC and Compatibility) and that our results are relevant in the literature on the high-dimensional regression 3 The standard conditions We recall some sufficient conditions here For all s { p} we denote by Σ s R p the set of all the s-sparse vectors Restricted Isoperimetric Property: A matrix X R n p satisfies RIP (θ ) if and only if there exists 0 < θ < (as small as possible) such that for all s { } for all γ Σ s it holds ( θ ) γ l Xγ l ( + θ ) γ l The constant θ is called the -restricted isometry constant Restricted Eigenvalue Assumption BRT09]: A matrix X satisfies RE( c 0 ) if and only if κ( c 0 ) = {p} γ 0 γ c c 0 γ Xγ l γ l > 0 The constant κ( c 0 ) is called the ( c 0 )-restricted l -eigenvalue ]

8 YOHANN DE CATRO () Compatibility Condition vdgb09]: We say that a matrix X R n p satisfies the condition Compatibility( c 0 ) if and only if Xγ l φ( c 0 ) = > 0 γ {p} γ 0 γ c c 0 γ The constant φ( c 0 ) is called the ( c 0 )-restricted l -eigenvalue H Condition JN0]: X R n p satisfies the H (κ) condition (with κ < /) if and only if for all γ R p and for all { p} such that it holds γ ˆλ Xγ l +κ γ where ˆλ denotes the maximum of the l -norms of the columns in X Remark The first term of the right hand side (ie s Xγ l ) is greater than the first term of the right hand side of the UDP condition (ie Xγ l ) Hence the H condition is weaker than the UDP condition Nevertheless the authors JN0] established limits of performance on their conditions: the condition H s (/3) (that implies H s (/3)) is feasible only in a severe restricted range of the sparsity parameter s Notice that this is not the case of the UDP condition the equality (9) shows that it is feasible for a large range of the sparsity parameter s Moreover a comparison of the two approaches is given in Table Let us emphasize that the above description is not meant to be exhaustive In particular we do not mention the irrepresentable condition ZY06] which ensures exact recovery of the support The next proposition shows that the UDP condition is weaker than the RIP RE and Compatibility conditions Proposition 3 Let X R n p be a full column rank matrix; then the following is true: The RIP(θ 5 ) condition with θ 5 < implies UDP( κ 0 ) for all pairs (κ 0 ) such that θ5 ] ] (3) + < κ 0 < + θ 5 and θ5 + κ 0 ] + θ5 κ 0 The RE( c 0 ) condition implies UDP( c 0 κ( c 0 ) ) The Compatibility( c 0 ) condition implies UDP( c 0 φ( c 0 ) ) Remark The point here is to show that the UDP condition is similar to the standard conditions of the high-dimensional regression For the sake of simplicity we do not study that if the converse of Proposition 3 is true As a matter of fact the UDP RE and Compatibility conditions are expressions with the same flavor: they aim at controlling the eigenvalues of X on a cone: {γ R p { p} st s γ c c γ } where c > 0 is a tunning parameter 3 The results Table shows that our results are similar to standard results in the literature Appendix A Appendix The appendix is devoted to the proof of the different results of this paper Lemma A uppose that ε = (ε i ) n i= is such that the ε i s are iid with respect to a Gaussian distribution with mean zero and variance σ n Choose t and set λ 0 n(t) = ( + t) X l σ n log p

A REMARK ON THE LAO AND THE DANTZIG ELECTOR 9 Reference Condition Risk Prediction CP09] RIP l σ s log p + l L σ s log p + L BRT09] REC l σs log p (*) L σ log p (*) vdgb09] Comp l σs log p (*) L σ log p (*) JN0] H l σs log p + l No This article UDP l σs log p + l L σ log p + l / where notations are given by: Location l l L Right hand side β c β c Xβ ideal l Xβ l Left hand side ˆβ β ˆβ β l X ˆβ Xβ l Table Comparison of results in risk and prediction for the Lasso and the Dantzig selector Observe that all the inequalities are satisfied with an overwhelg probability The notation means that the inequality holds up to a multiplicative factor that may depends on the parameters of the condition The (*) notation means that the result is given for s-sparse targets The ˆβ notation represents the estimator (ie the lasso or the Dantzig selector) The parameters σ and p represent respectively the standard deviation of the noise and the dimension of the target vector β where Xl denotes the maximum l -norm of the columns of X Then P ( X ε l λ 0 n(t) ) / ( + t) ] π log p p (+t) Proof of Lemma A Observe that X ε N p (0 σn X X) Hence j = p N ( 0 σn Xj ) l Using Šidák s inequality id68] it yields P ( ( X ε l λ 0 n) P ε l λ 0 ) p n = P ( ε i λ 0 n) i= X j ε where the ε i s are iid with respect to N ( 0 σn X l Denote by Φ and ϕ respectively the ) cumulative distribution function and the probability density function of the standard normal et θ = ( + t) log p It holds p P ( ε i λ 0 ( n) = P ε λ 0 p n) = (Φ(θ) ) p > ( ϕ(θ)/θ ) p i= using an integration by parts to get Φ(θ) < ϕ(θ)/θ It yields that P ( X ε l This concludes the proof λ 0 ) ( ) p ϕ(θ) n ϕ(θ)/θ p = θ ( + t) π log p p (+t) Proof of Theorem Consider the following singular value decomposition X = U DA where U R n n is such that UU = Id n

0 YOHANN DE CATRO D = Diag(ρ ρ n ) is a diagonal matrix where ρ ρ n > 0 are the singular values of X and A R n p is such that AA = Id n We recall that the only assumption on the design is that it has full column rank which yields that ρ n > 0 Let δ be the distortion of the kernel Γ of the design Denote by π Γ (resp π Γ ) the l -projection onto Γ (resp Γ ) Let γ R p ; then γ = π Γ (γ) + π Γ (γ) An easy calculation shows that π Γ (γ) = A Aγ Let s { } and let { p} be such that = s It holds γ γl = πγ (γ) l + πγ (γ) l δ πγ (γ) p + A Aγ l δ ( γ p + (π Γ (γ)) ) + s Aγ l δ γ + δ A Aγ p l + Aγl δ γ + ( + δ) Aγl p δ γ p + + δ s Xγl ρ n δ γ + δ s Xγl p ρ n using the triangular inequality and the distortion of the kernel Γ Eventually set κ 0 = ( / p) δ and = δ/ρ n This ends the proof Proof of Theorem We recall that λ 0 n denotes an upper bound on the amplification of the noise; see () We begin with a standard result Lemma A Let h = β l β R p and λ l λ 0 n Then for all subsets { p} it holds (A) Xh λ l l + (λ l λ 0 n) ] h h + β c Proof By optimality we have Xβ l y l + λ lβ l Xβ y l + λ lβ It yields Xh l X ε h + λ lβ l λ lβ Let { p}; we have Xh ( l + λ lβ l c λ l β ) β l + λlβ c + X ε h λ h l + λ lβ c + λ 0 n h using () Adding λ lβ c on both sides it holds Xh + (λ l l λ 0 n) h c (λ l + λ 0 n) h + λ l β c

A REMARK ON THE LAO AND THE DANTZIG ELECTOR Adding (λ l λ 0 n) h on both sides we conclude the proof Using (7) and (A) it follows that (A) Xh λ l l + (λ l λ 0 n) ] h Xhl + κ 0h + β c It yields ) ] ( λ0 n h ( κ 0 Xh λ l 4λ l l + ) Xhl + β c λ l s + β c using the fact that the polynomial x (/4λ l ) x + x is not greater than λ l s This concludes the proof Proof of Theorem 3 We begin with a standard result Lemma A3 Let h = β l β R p and λ l λ 0 n Then for all subsets { p} it holds Xh (A3) 4λ l d + (λ d λ 0 n) ] h h + β c Proof et h = β β d Recall that X ε l λ 0 n it yields Hence we get (A4) Xh l X Xh l h = X ( y Xβ d) + X ( Xβ y ) l h (λ d + λ 0 n) h Xh l (λ d + λ 0 n) h c (λ d + λ 0 n) h ince β is feasible it yields β d β Thus β d c l ( β β d ) + β c h + β c l ince h c β d c + β c it yields (A5) h c + h β c Combining (A4) + λ d (A5) we get Xh l +(λ d λ 0 n) h c (3λ d + λ 0 n) h + 4λ d β c Adding (λ d λ 0 n) h on both sides we conclude the proof Using (7) and (A3) it follows that (A6) Xh + (λ 4λ l l λ 0 l n) h ] Xh l + κ 0 h + β c It yields ) ] 4( λ0 n h ( κ 0 Xh λ l 4λ l l + ) Xhl + β c λ l s + β c using the fact that the polynomial x (/4λ l ) x + x is not greater than λ l s This concludes the proof

YOHANN DE CATRO Proof of Theorem and Theorem 4 Using (A) we know that Xh + (λ λ l l l λ 0 n) ] h Xhl + κ 0h + β c It follows that Xh l 4λ l Xhl 4λ l β c This latter is of the form x bx c which implies that x b + c/b Hence Xhl 4λ l β c s + The same analysis holds for Theorem 4 Proof of Proposition 3 One can check that RE( c 0 ) implies UDP( c 0 κ( c 0 ) ) and that Compatibility( c 0 ) implies UDP( c 0 φ( c 0 ) ) Assume that X satisfies RIP(θ 5 ) Let γ R p s { 0 } and T 0 { p} such that T 0 = s Choose a pair (κ 0 ) as in (3) If γ T0 κ 0 γ then γ T0 Xγ l + κ 0 γ uppose that γ T0 > κ 0 γ then (A7) γ T c 0 < κ 0 γ T0 κ 0 Denote by T the set of the indices of the 4s largest coefficients (in absolute value) in T0 c denote by T the set of the indices of the 4s largest coefficients in (T 0 T ) c etc Hence we decompose T0 c into disjoint sets T0 c = T T T l Using (A7) it yields (A8) γ l Ti (4s) / γ Ti = (4s) / γ T c 0 κ 0 γt0 κ 0 s i i Using RIP(θ 5 ) and (A8) it follows that Xγl X(γ(T0 T )) l X(γTi ) l i θ γ(t0 T 5 ) l + θ 5 l γti i θ γt0l 5 κ 0 + θ 5 κ 0 θ5 + κ 0 ] + θ5 κ 0 + θ5 ( θ5 ) ] = + κ 0 κ 0 + θ 5 γ T0 γt0 s ( θ5 ) ] ] γt0 + + θ 5 s The lower bound on κ 0 shows that the right hand side is positive Observe that we took such that this latter is exactly γ T0 /( ) Eventually we get γt0 Xγ l Xγ l + κ 0 γ This ends the proofs Acknowledgments The author would like to thank Jean-Marc Azaïs and Franck Barthe for their support The authors would like to thank the anonymous reviewers for their valuable comments and suggestions

A REMARK ON THE LAO AND THE DANTZIG ELECTOR 3 References BLPR] K Bertin E Le Pennec and V Rivoirard Adaptive Dantzig density estimation Ann Inst Henri Poincaré Probab tat 47 (0) no 43 74 MR 779396 (0a:607) BRT09] PJ Bickel Y Ritov and AB Tsybakov imultaneous analysis of lasso and Dantzig selector Ann tatist 37 (009) no 4 705 73 MR 533469 (00j:68) CDD09] A Cohen W Dahmen and R DeVore Compressed sensing and best k-term approximation J Amer Math oc (009) no 3 MR 449058 (00d:9404) CP09] EJ Candès and Y Plan Near-ideal model selection by l imization Ann tatist 37 (009) no 5A 45 77 MR 543688 (00j:607) CRT06] EJ Candès JK Romberg and T Tao table signal recovery from incomplete and inaccurate measurements Comm Pure Appl Math 59 (006) no 8 07 3 MR 30846 (007f:94007) CT07] EJ Candès and T Tao The Dantzig selector: statistical estimation when p is much larger than n Ann tatist 35 (007) no 6 33 35 MR 38644 (009b:606) DET06] DL Donoho M Elad and VN Temlyakov table recovery of sparse overcomplete representations in the presence of noise IEEE Trans Inform Theory 5 (006) no 6 8 MR 3733 (007d:94007) Don06] DL Donoho Compressed sensing IEEE Trans Inform Theory 5 (006) no 4 89 306 MR 489 (007e:9403) GLR08] V Guruswami J R Lee and A Razborov Almost euclidean subspaces of l via expander codes Proceedings of the nineteenth annual ACM-IAM symposium on Discrete algorithms ociety for Industrial and Applied Mathematics 008 pp 353 36 HTF09] T Hastie R Tibshirani and J Friedman High-dimensional problems: p n The elements of statistical learning (009) 50 Ind07] P Indyk Uncertainty principles extractors and explicit embeddings of l into Proceedings of the thirty-ninth annual ACM symposium on Theory of computing ACM 007 pp 65 60 I0] P Indyk and zarek Almost-euclidean subspaces of l n\ ell ˆ n via tensor products: A simple approach to randomness reduction Approximation Randomization and Combinatorial Optimization Algorithms and Techniques (00) 63 64 JN0] A Juditsky and A Nemirovski Accuracy guarantees for l -recovery Kas77] B Kashin The widths of certain finite-dimensional sets and classes of smooth functions Izv Akad Nauk R er Mat 4 (977) no 334 35 478 MR 04879 (58 #89) KT07] B Kashin and VN Temlyakov A remark on the problem of compressed sensing Mat Zametki 8 (007) no 6 89 837 MR 399963 (009k:407) id68] Z idák On multivariate normal probabilities of rectangles: Their dependence on correlations Ann Math tatist 39 (968) 45 434 MR 030403 (37 #5965) Tib96] R Tibshirani Regression shrinkage and selection via the lasso J Roy tatist oc er B 58 (996) no 67 88 MR 3794 (96j:634) vdgb09] A van de Geer and P Bühlmann On the conditions used to prove oracle results for the Lasso Electron J tat 3 (009) 360 39 MR 57636 ZY06] P Zhao and B Yu On model selection consistency of Lasso J Mach Learn Res 7 (006) 54 563 MR 74449 Yohann de Castro is with the Département de Mathématiques Université Paris-ud Faculté des ciences d Orsay 9405 Orsay France This article was written mostly during his PhD at the Institut de Mathématiques de Toulouse E-mail address: yohanndecastro@mathu-psudfr