THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Size: px

Start display at page:

Download "THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich"

Hubert Bell
5 years ago
Views:

1 Submitted to the Annals of Applied Statistics arxiv: math.pr/ THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional linear models and the l 1-penalized least squares estimator, also known as the Lasso estimator. In literature, oracle inequalities have been derived under restricted eigenvalue or compatibility conditions. In this paper, we complement this with entropy conditions which allow one to improve the dual norm bound, and demonstrate how this leads to new oracle inequalities. The new oracle inequalities show that a smaller choice for the tuning parameter and a trade-off between l 1-norms and small compatibility constants are possible. This implies, in particular for correlated design, improved bounds for the prediction error of the Lasso estimator as compared to the methods based on restricted eigenvalue or compatibility conditions only. 1. Introduction. We derive oracle inequalities for the Lasso estimator for various designs. Results in literature are generally based on restricted eigenvalue or compatibility conditions see Section 3 for definitions). We refer to [], [4], [5], [6], [8], [10], [11]. See also [3] and the references therein. In a sense, compatibility or restricted eigenvalue conditions and the so-called dual norm bound we describe below belong together. In contrast, if compatibility constants or restricted eigenvalues are very small, the design may have high correlations, and then the dual norm bound is too rough. In this paper, we discuss an approach that joins both situations. The work is a follow-up of [1]. It combines results of the latter with the parallel developments in the area based on the dual norm bound. We consider an input space X and p feature mappings ψ j : X R, j = 1,..., p. We let x 1,..., x n ) T X n be a given input vector, and Y := Research supported by SNF 0PA AMS 000 subject classifications: Primary 6J05; secondary 6J99 Keywords and phrases: compatibility, correlation, entropy, high-dimensional model, Lasso 1

2 S. VAN DE GEER ET AL. Y 1,..., Y n ) T R n be an output vector, and consider the linear model p Y = ψ j βj 0 + ɛ, j=1 with ɛ R n a noise vector, and β 0 R p a vector of unknown coefficients. Here, with some abuse of notation, ψ j denotes the vector ψ j = ψ j x 1 ),..., ψ j x n )) T. The design matrix is X := ψ 1,..., ψ p ) and the Gram matrix is ˆΣ := X T X/n. Throughout, we assume that n i=1 ψ j x i) n for all j. We write a linear function with coefficients β as f β := p j=1 ψ jβ j, β R p. The Lasso estimator is ˆβ := arg min β { Y f β /n } + λ β 1. We denote the estimator of the regression function f 0 := f β 0 by ˆf := f ˆβ. Oracle results using compatibility or restricted eigenvalue conditions are based on the dual norm bound Let us define sup ɛ T f β /n = max β 1 =1 1 j p ɛt ψ j /n. n f β n := fβx i )/n = β T ˆΣβ. i=1 The point we make in this paper is that the dual norm bound does not take into account possible small values for f ˆβ f β 0 n. Our results are based on bounds for sup ɛ T f β /n β 1 1, f β n R as function of R > 0. We then apply these to ˆβ β 0 or β 0 here replaced by a sparse approximation). We use an improvement of the dual norm bound, and show in Theorem 4.1 the consequences. The main observation here is that with highly correlated design, one can generally take the tuning parameter λ of much smaller order than the usual log p/n. Moreover, small compatibility constants may be traded off against the l 1 -norm of the coefficients of an oracle.

3 LASSO WITH CORRELATED DESIGN 3. Organization of the paper. In Section 3, we present our notation, and the definitions of compatibility constants and restricted eigenvalues. Section 4 contains the main result, based on a pre-assumed improvement of the dual norm bound. In Section 5, we present a result from empirical process theory, which shows that the improvement of the dual norm bound used in Section 4 holds under entropy conditions on F := {f β : β 1 = 1}. In Section 6, we first give a geometrical interpretation of the compatibility constant and discuss the relation with eigenvalues. The next question to address is then how to read off the entropy conditions directly from the design. We show that a Gram matrix with strongly decreasing eigenvalues leads to a small entropy of F. Alternatively, we derive an an entropy bound for F based on the covering number of the design {ψ j }, a result much in the spirit of [7]. We moreover link these covering numbers with the correlation structure of the design. Section 7 concludes and Section 8 contains proofs. 3. Notation and definitions The compatibility constant. Let S {1,..., p} be an index set with cardinality s. We define for all β R p, β S,j := β j l{j S}, j = 1,..., p, β S c := β β S. Below, we present for constants L > 0 the compatibility constant φl, S) introduced in [10]. For normalized ψ j i.e., ψ j n = 1 for all j), one can view 1 φ 1, S)/ as an l 1 -version of the canonical correlation between the linear space spanned by the variables in S on the one hand, and the linear space of the variables in S c on the other hand. Instead of all linear combinations with normalized l -norm, we now consider all linear combinations with normalized l 1 -norm of the coefficients. For a geometric interpretation, we refer to Section 6. Definition The compatibility constant is φ L, S) := min{s f βs f βs c n : β S 1 = 1, β S c 1 L}. The compatibility constant is closely related to and never smaller than) the restricted eigenvalue as defined in [], which is { φ fβs f βs c n REL, S) = min β S : β S c 1 L β S 1 }.

4 4 S. VAN DE GEER ET AL. See also [8], and see [13] for a discussion of the relation between restricted eigenvalues and compatibility. 3.. Projections. As the true β 0 is perhaps only approximately sparse, we will consider a sparse approximation. The projection of f 0 := f β 0 on the space spanned by the variables in S is f S := arg min f=f βs f f 0 n. The coefficients of f S are denoted by b S, i.e., f S = f b S. Note that f S only has non-zero coefficients inside S, that is, b S ) S = b S. 4. Main result. We let T α be the set T α := { 4 ɛ T f β /n sup β f β n 1 α β α 1 Here, 0 α 1 and λ 0 > 0 are fixed constants. Note that on T α, λ 0 }. sup ɛ T f β /n λ 0 R 1 α /4, β 1 =1, f β n R i.e., we have a refinement of the dual norm bound described in Section 1. Note that for fixed λ 0 and for α < α, it holds that T α T α. This is because by the triangle inequality f β n = j ψ j β j n j ψ j n β j β 1. We want to choose α preferably small, yet keep the probability of the set T α large. For α = 1, one has { } T 1 = max 1 j p 4 ɛt ψ j /n λ 0, by the dual norm bound. Thus, e.g. when ɛ N 0, I), the probability IPT 1 ) of T 1 is large when λ 0 log p/n. We detail in Section 5 how one can lowerbound IPT α ) for a proper value of α depending on the design {ψ j }.

5 LASSO WITH CORRELATED DESIGN 5 Generally, the value for λ 0 will be of order log p/n, as in the case α = 1, or λ log n/n or even λ 0 1/ n. The choice of the tuning parameter λ depends on λ 0. The following technical lemma will be used: Lemma 4.1. Let 0 α 1 and let a, b and λ 0 be positive numbers. Then Here, when α = 1, λ 0 a 1 α b α 1 a + λb + 1 ) ) λ0 = := λ ). λ < λ 0 1 λ = λ 0. 0 λ > λ 0 In the proof of the main result, Theorem 4.1, we invoke Lemma 4.1 to handle the noise part ɛ T f β with β = ˆβ β 0 or actually with β 0 replaced here by a sparse approximation). On T α, it holds that 4 ɛ T f β /n 1 f β n + λ β ), uniformly in β R p. In the right hand side of this inequality, the first term f β n/ can be incorporated in the risk and the second term λ β 1 will be overruled by the penalty. Finally, the third term λ 0 / ) 1 α / governs the choice of the tuning parameter λ. We now come to the main result. We formulate it for an arbitrary index set S partitioned in sets S 1 and S in an arbitrary way. We will elaborate on the choice of S in Remarks 4. and 4.5. Corollaries 4.1 and 4. take for a given S some special choices for the tuning parameter λ and for the partition of S into S 1 and S. Recall that f S is the projection of f 0 = f β 0 and b S are the coefficients of f S. Theorem 4.1. Let S be an arbitrary index set, partitioned into two sets S 1 and S, i.e. S = S 1 S, S 1 S =. Let s 1 be the cardinality of S 1. Let T α be the set { 4 ɛ T f β /n T α := sup β f β n 1 α β α λ 0 }. 1

6 6 S. VAN DE GEER ET AL. Then on T α, ˆf f 0 n+λ ˆβ b S 1 56λ s 1 φ 6, S 1 ) λ bs ) S ) +7 fs 6 f 0 n. Remark 4.1. We did not attempt to optimize the constants we provided in Theorem 4.1. Remark 4.. Given a value of the tuning parameter λ, we can now define the estimation error using the variables in S as ES) := 8λ s 1 min S 1 S, S =S\S 1 φ 6, S 1 ) λ bs ) S 1. The oracle set S is then the set which trades off estimation error and approximation error, i.e, the set S that minimizes ES) + f S f 0 n. Note that S depends on λ, say S = S λ). The best value for the tuning parameter λ is then obtained by minimizing ES λ)) ) + fs λ) f 0 n. Remark 4.3. In practice, the tuning parameter λ can be chosen by crossvalidation. As this method tries to mimic minimization of the prediction error, it can be conjectured that one then arrives at rates at least a good as the ones we discuss here choosing values of λ depending on the design, the unknown) error distribution, and the unknown sparsity. This is however not rigorously proven. Remark 4.4. We have restricted ourselves to improvements of the dual norm bound of the form given by sets T α. The situation can be generalized by considering sets of the form { 4 ɛ T } f β /n sup β G 1 λ 0, f β n / β 1 ) β 1 where G is a given increasing convex function with G0) = 0. Corollary 4.1. a) If we take S =, we have S 1 = S, and s 1 = S =: s. This is a good

7 LASSO WITH CORRELATED DESIGN 7 choice when the compatibility constants are large for all subsets of S. With the choice φ λ λ ) 6, S) 1 α 0, s we get on T α, ˆf f 0 n + λ ˆβ b S 1 = O λ 0 ) s α ) φ + f S f 0 n. 6, S) Recall that the dual norm bound has α = 1. With λ 0 log p/n we then arrive at the usual oracle inequality as provided by, among others, [], [4], [5], [6], [8] [10], [11]. When α < 1, the compatibility constant may be very small, as the design is highly correlated. The effect is however somewhat tempered by the power α in the bound. b) More generally, let φ ) 6, S 1 ) 1 α, λ λ 0 s 1 Then on T α, ˆf f 0 n + λ ˆβ b S 1 ) = O λ s α 1 φ ) 6, S 1 ) 1 α 0 φ + λ 0 b S ) S 1 + f S f 0 n. 6, S 1 ) s 1 Corollary 4.. a) With the choice S 1 =, the result does not involve the compatibility constant. This may be desirable when the design is highly correlated. The result then corresponds to what is sometimes called slow rates, although we will see that when α < 1, the rates can still be much faster than 1/ n. When α = 1, we must take λ > λ 0 due to the term λ 0 / ) 1 α ). When α < 1, we choose λ λ0 b S 1 α 1. We get on T α, ) ˆf f 0 n + λ ˆβ b S 1 = O λ0 b S α 1 + f S f 0 n. b) More generally, let λ λ0 b S ) S 1 α 1.

8 8 S. VAN DE GEER ET AL. Then on T α, 4 λ0 = O b S ) S 1 α) 1 ˆf f 0 n + λ ˆβ b S 1 s 1 φ 6, S 1 ) + λ 0 b S ) S 1 + f S f 0 n. Remark 4.5. Note that by taking S 1 smaller, the value of s 1 /φ 6, S 1 ) will not increase, but on the other hand, the value of b S ) S 1 will become larger. Thus, the best rate will emerge if we trade off these two effects. Indeed, suppose that for some S 1 Then on T α, for we have λ 0 s 1 φ S 1 ) bs ) S 1. s ) 1 α)/ 1 λ λ 0 φ λ0 b S ) S 1 α 6, S 1 ) ˆf f 0 n + λ ˆβ b S 1 = O λ 0 α 1, ) s α ) 1 φ + f S f 0 n 6, S 1 ) ) = O λ0 b S α ) S 1 + f S f 0 n. In particular for the case α < 1, it is however not clear when such a tradeoff is possible. It may well be that for any S 1, s 1 /φ 6, S 1 ) either heavily dominates or is heavily dominated by the l 1 -part b S ) S 1. See Section 6 for a further discussion. 5. Improving the dual norm bound. In this section, we provide probability bounds for the set T α introduced in Section 4. The results follow from empirical process theory, see e.g. and [14] and [15]. Theorem 5.1 is taken from [3]. Definition Let F be a class of real-valued functions on X. Endow F with norm n. Let δ > 0 be some radius. A δ-packing set is a set of functions in F that are each at least δ apart. A δ-covering set is a set of functions {φ 1,..., φ N }, such that sup f F min f φ k n δ. k=1,...,n

9 LASSO WITH CORRELATED DESIGN 9 The δ-covering number Nδ, F, n ) of F is the minimum size of a δ- covering set. The entropy of F is H, F, n ) = log N, F, n ). It is easy to see that Nδ, F, n ) can be bounded by the size of a maximal δ-packing set. We assume the errors are sub-gaussian, that is, for some positive constants K and σ 0, 5.1) K IE exp[ɛ i /K ] 1) σ 0, i = 1,..., n. The following theorem is Corollary 14.6 in [3]. It is in the spirit of a weighted concentration inequality, and uses the notation x + := max{x, 0}. Theorem 5.1. Assume 5.1). Let F be a class of functions with f n 1 for all f F, and with, for some 0 < α < 1 and some constant A, ) ) A α log 1 + Nδ, F, n ), 0 < δ 1. δ Define and [ B := exp A α α 1 α 1) ] 1, K 0 := 3 5 K + σ 0. It holds that IE exp [ sup f F [ ɛ T f / n f n K 0 Aα f α ) ] ] n 1 α 1 + /B. 1 + Corollary 5.1. Assume the conditions of Theorem 5.1. Chebyshev s inequality shows that for all t > 0, IP f F : ɛ T f / n K 0 A α f 1 α n exp[ t ]1 + /B). ) 1 α 1) 1 + K 0 f n t

10 10 S. VAN DE GEER ET AL. Corollary 5.. Consider now linear functions where ψ j n 1. Then p f β := ψ j β j, β R p, j=1 f β n β 1. Hence, {f β / β 1 : β R p } is a class of functions with n -norm bounded by 1. Suppose now ) ) A α log 1 + Nδ, {f β : β 1 = 1}, n ), 0 < δ 1. δ Under the sub-gaussianity condition 5.1), we then have for all t > 0 and for λ 0 = 4K 0 A α ) n 1 α 1 + t, the lower bound IPT α ) 1 exp[ t ]1 + /B). 6. Compatibility, eigenvalues, entropy and correlations. We study the set F := {f β : β 1 = 1}. It is considered as subset of L Q n ), where Q n := n i=1 δ xi /n. The L Q n )- norm is n Geometric interpretation of the compatibility constant. We first look at the minimal l 1 -eigenvalue { } Λ min,1s) := min sβs T ˆΣβ S : β S 1 = 1 as introduced in [3]. Note that Λ min,1 S)/ s is the minimal distance between any point f βs with β S 1 = 1 and the point {0}. We tacitly assume that the {ψ j } j S are linearly independent. The set {f βs : β S 1 = 1} is then an l 1 -version of a sphere: it is the boundary of the convex hull of {ψ j } j S { ψ j } j S in s-dimensional space with {0} in its center. It is a parallelogram when s = see Figure 1) and then a rectangle when the ψ j, j S, have equal length.

11 LASSO WITH CORRELATED DESIGN 11 ψ 1 Λ min, 1 S ) 0 A ψ ψ 1,..., ψ s s Λ min S ) Fig 1. Left panel: the set A = {f βs : β S 1 = 1}. Right panel: l 1- and l -eigenvalues. Let ˆΣ S be the Gram matrix of the variables in S and Λ min S) be the minimal l -)eigenvalue of the matrix ˆΣ S : Then { } Λ mins) := min βs T ˆΣβ S : β S = 1. Λ min,1s) Λ mins) Λ min,1s)/s, One can construct examples where Λ min S) is as small as 3/s ) s > ) and Λ min,1 S) is at least 1/ see [13]), that is, they can differ by the maximal amount s in order of magnitude. See also Figure 1 which is to be understood as representing a case s >. Thus, minimal l 1 -eigenvalues can be much larger than minimal l -)eigenvalues. The normalized compatibility constant φl, S)/ s is the minimal distance between the sets A := {f βs : β S 1 = 1} and B := {f βs c : β S c 1 L}, that is, φl, S) s { } = min a b n : a A, b B. See Figure for an impression of the situation. Observe that A is the boundary of the convex hull of {+ψ j } j S { ψ j } j S, and B is the convex hull of {+ψ j } j S c { ψ j } j S c including its interior, blown up with a factor L typically, the {ψ j } j S c form a linearly dependent system in R n ). Furthermore, since {0} B φl, S) Λ min,1 S). This shows that when l 1 -eigenvalues are small, the compatibility constant is necessarily also small. Small l -eigenvalues may have less of this effect. 6.. Eigenvalues and entropy. We now let ˆΣ = EΩ E T

12 1 S. VAN DE GEER ET AL. B φl,s)/ s A Fig. The compatibility constant be the spectral decomposition of the Gram matrix ˆΣ, E being the matrix of eigenvectors, E T E = EE T = I) and Ω = diagω 1,, ω p) the matrix of l -)eigenvalues. We assume they are in decreasing order: ω 1 ω p. Lemma 6.1. Suppose that for some strictly decreasing function V Then for all δ > 0, ω j+1 V j), j = 1,..., p. ) 3 Hδ, {f β : β 1 = 1}, n ) V 1 δ) log. δ Example 6.1. Suppose that for some positive constants m and C ω j C, j = 1,..., p. jm Then by Lemma 6.1, C Hδ, {f β : β 1 = 1}, n ) δ ) 1 m log 3 δ ). For δ 1/n say) we therefore have C Hδ, {f β : β 1 = 1}, n ) δ ) 1 m log6n). When m > 1/, one can use a minor generalization of Corollary 5., where the entropy bound is only required for values of δ > 1/n. One then takes α = 1 m, A = C m logn)) 1 α,

13 LASSO WITH CORRELATED DESIGN 13 where C m is a constant depending on m and C. Then the value of λ 0 defined there becomes λ 0 = 4K ) 0 Cm logn) n m 1 + t 1 1 which is for fixed m and K 0, and a fixed large) t, of order log n/n Entropy based on coverings of {ψ j }. We can consider {f β : β 1 = 1} as a subset of conv{±ψ j }), where {±ψ j } := {ψ j } { ψ j }, and conv{±ψ j }) is its convex hull. Infact, if the {ψ j } form a linearly dependent system in R n, F is exactly equal to conv{±ψ j }). The paper [7] gives a bound for the entropy of a convex hull for the case where the u-covering number of the extreme points is a polynomial in 1/u. This result can also be found in [9]. There is a redundant log-term in these entropy bounds, see [1] and [15], but removing this log-term may result in very large constants, depending on the dimension W as given in Example 6. see [3] for some explicit constants). This means that when the dimension W of the extreme points is large growing with n say), the simple bound with log-term we provide below in Lemma 6. may be better than the more involved ones. We give a bound for the entropy of F by balancing the u-covering number of {ψ j } and the squared radius u. The result is as in [9], with only new element its extension to general covering numbers i.e., not only polynomial ones). Lemma 6. and its proof can be found in [3]. Lemma 6.. Let Nu) := Nu, {ψ j }, n ), u > 0. We have min 0<u<1 6 ) H δ, {f β : β 1 = 1}, n Nu) + 6u δ ) log 8 + δ δ ) ) Nδ). Example 6.. In this example, we assume the u-covering numbers of {ψ j } are bounded by a polynomial in u. That is, we suppose that for some positive

14 14 S. VAN DE GEER ET AL. constants W and C, ) C W Nu, {ψ j }, n ), u > 0. u The constant W can be thought of as the dimension of {ψ j }. By Lemma 6., we can choose α = W + W, A = C W logn)) 1 α, and we get, as in Example 6.1, λ 0 = 4K ) 0 CW logn) + t. n +W 1 A refined analysis of the relation between compatibility constants, covering numbers and entropy is still to be carried out. We confine ourselves here to the following, rather trivial, observation without proof). Lemma 6.3. Consider normalized design: ψ j n = 1 j. Let {ψ j1,..., ψ jn } be a maximal u-packing set of {ψ j }. Then for any S {j 1,..., j N }, S {1,..., p}, and any L 1, φ L, S) su. One may argue that as u-packing sets are approximations of the original design {ψ j } with fewer covariables, they are good candidates for the sparsity set S 1 used in Theorem 4.1. Lemma 6.3 however shows that such sparsity sets will have very small compatibility constants Decorrelation numbers. Decorrelation numbers are closely related to packing numbers. First, define the inner product ρφ, φ) := φ T φ/n. Note that Σ j,k = ρψ j, ψ k ) and that in the case of standardized design i.e. ni=1 ψ j x i ) = 0 and ψ j n = 1 j), the inner product ρψ j, ψ k ) is for j k the empirical) correlation between ψ j and ψ k. Definition For ρ > 0, the ρ-decorrelation number Mρ) is the largest value of M such that there exists {φ 1,..., φ M } {±ψ j } with ρφ j, φ k ) < ρ for all j k.

15 LASSO WITH CORRELATED DESIGN 15 Hence, if the ρ-decorrelation number is small, then there are many large correlations, i.e., then the design is highly correlated. It is clear that when ψ j n = ψ k n = 1, it holds that ψ j ψ k n = 1 ρψ j, ψ k )). In other words, small correlations correspond to covariables that are near to each other. This can be translated into covering number as shown in Lemma 6.4. Its proof is straightforward and omitted. Lemma 6.4. Consider normalized design: ψ j n = 1 j. For all 0 < u < 1, N u, {±ψ j }, n ) M1 u ). 7. Conclusion. We have combined results for the prediction error of the Lasso with both compatibility conditions and entropy conditions. Small entropies of {f β : β 1 = 1} correspond to highly correlated design and possibly to small compatibility constants. Our analysis shows that small entropies allow for a smaller choice of the tuning parameter and possibly for a compensation of small compatibility constants. This means that the Lasso enjoys good prediction error properties, even in the case where the design is highly correlated. 8. Proofs. Proof of Lemma 4.1. We use that for positive u and v and for p 1, q 1, 1/p + 1/q = 1, the conjugate inequality uv u p /p + v q /q holds. Taking p = 1/1 α) and replacing u by u 1 α gives u 1 α v 1 α u α v. With p = 1 + α)/α), and replacing u by u α, we get Thus, u α v α 1 + α u + 1 α 1 + α v 1 α. λ 0 a 1 α b α 1 α a α ) λ 0 b α

16 16 S. VAN DE GEER ET AL. a a α ) λb) α λ α α 1 + α λb + 1 α 1 + α ) λ0 1 α Proof of Theorem 4.1. Since a + λb + 1 ). Y ˆf /n + λ ˆβ 1 Y f S /n + λ b S 1, we have the Basic Inequality Hence, on T α, ˆf f 0 n + λ ˆβ 1 ɛ T ˆf f S )/n + λ b S 1 + f S f 0 n. ˆf f 0 n + λ ˆβ 1 λ 0 ˆf f S 1 α n ˆβ b S α 1 / + λ b S 1 + f S f 0 n. Apply Lemma 4.1 to find 1 4 ˆf f S n + 1 λ ˆβ b S ˆf f 0 n + 1 λ ˆβ b S Thus, we get on T α, ˆf f 0 n + λ ˆβ 1 ˆf f 0 n + λ ˆβ 1 λ ˆβ b S 1 + λ b S Defining S 3 := S c, we rewrite this to ) + λ b S 1 + f S f 0 n. ) + λ b S f S f 0 n. ˆf f 0 n + λ ˆβ S S 3 1 ) + 3 fs f 0 n. λ ˆβ S1 b S ) S1 1 + λ ˆβ S b S ) S 1 + λ ˆβ S3 1 + λ b S ) S1 1 λ ˆβ S1 1 +λ b S ) S ) + 3 fs f 0 n

17 LASSO WITH CORRELATED DESIGN 17 3λ ˆβ S1 b S ) S1 1 +λ ˆβ S S λ b S ) S ) +3 fs f 0 n. Moving the term λ ˆβ S S 3 1 to the left hand side, and applying a triangle inequality, we obtain ˆf f 0 n + λ ˆβ S S 3 b S ) S 1 ) λ0 3λ ˆβ S1 b S ) S λ b S ) S α + 3 fs }{{} f 0 n. :=I }{{} :=II Case i. If I II, we arrive at ˆf f 0 n + λ ˆβ S S 3 b S ) S 1 6λ ˆβ S1 b S ) S1 1. We first add add a term λ ˆβ S1 b S ) S1 1 to the left and right hand side and then apply the compatibility condition to ˆβ b S, to get ˆf f 0 n + λ ˆβ b S 1 7λ s 1 φ6, S 1 ) ˆf f S n 1 ˆf f 0 n + 7 f S f 0 n + 56λ s 1 φ 6, S 1 ). Here we used the decoupling device So then xy bx + y /b x, y R, b > 0. ˆf f 0 n + λ ˆβ b S 1 56λ s 1 φ 6, S 1 ) + 7 f S f 0 n. Case ii. If I < II, we get ˆf f 0 n + λ ˆβ S S 3 b S ) S 1 II, and hence ˆf f 0 n + λ ˆβ b S II = 8 3 λ bs ) S ) + 7 fs 6 f 0 n.

18 18 S. VAN DE GEER ET AL. Proof of Lemma 6.1. Let β 1 = 1. Then β 1, and hence E T β 1. For N V 1 δ) it holds that ω N+1 δ and hence p j=n+1 ω j E T β) j ω N+1 p j=n+1 E T β) j δ. We now note that β 1 = 1 implies f β n 1 and hence N ωj E T β) j 1. j=1 Lemma 14.7 in [3] states that a ball with radius 1 in N-dimensional Euclidean space can be covered by 3/δ) N balls with radius δ see also Problem.1.6 in [15]). References. [1] K. Ball and A. Pajor. The entropy of convex bodies with few extreme points. London Mathematical Society Lecture Note Series, 158:5 3, [] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37: , 009. [3] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, 011. [4] F. Bunea, A.B. Tsybakov, and M.H. Wegkamp. Aggregation and sparsity via l 1- penalized least squares. In Proceedings of 19th Annual Conference on Learning Theory, COLT 006. Lecture Notes in Artificial Intelligence 4005, pages , Heidelberg, 006. Springer Verlag. [5] F. Bunea, A.B. Tsybakov, and M.H. Wegkamp. Aggregation for Gaussian regression. Annals of Statistics, 35: , 007a. [6] F. Bunea, A. Tsybakov, and M.H. Wegkamp. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1: , 007b. [7] R.M. Dudley. Universal Donsker classes and metric entropy. Annals of Probability, 15: , [8] V. Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 45:7 57, 009. [9] D. Pollard. Empirical Processes: Theory and Applications. IMS Lecture Notes, [10] S. van de Geer. The deterministic Lasso. The JSM Proceedings, 007. [11] S. van de Geer. High-dimensional generalized linear models and the Lasso. Annals of Statistics, 36: , 008. [1] S. van de Geer. On non-asymptotic bounds for estimation in generalized linear models with highly correlated design. In Asymptotics: Particles, Processes and Inverse Problems E.A. Cator, G. Jongbloed, C. Kraaikamp, H.P. Lopuhaä, J.A. Wellner eds.), volume 55, pages IMS Lecture Notes Monograph Series, 007. [13] S. van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, pages , 009. [14] S. A. van de Geer. Empirical Processes in M-Estimation. Cambridge, 000.

19 LASSO WITH CORRELATED DESIGN 19 [15] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, ISBN Seminar for Statistics ETH Zürich Rämistrasse Zürich

The deterministic Lasso

The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality