THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Size: px
Start display at page:

Download "THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich"

Transcription

1 Submitted to the Annals of Applied Statistics arxiv: math.pr/ THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional linear models and the l 1-penalized least squares estimator, also known as the Lasso estimator. In literature, oracle inequalities have been derived under restricted eigenvalue or compatibility conditions. In this paper, we complement this with entropy conditions which allow one to improve the dual norm bound, and demonstrate how this leads to new oracle inequalities. The new oracle inequalities show that a smaller choice for the tuning parameter and a trade-off between l 1-norms and small compatibility constants are possible. This implies, in particular for correlated design, improved bounds for the prediction error of the Lasso estimator as compared to the methods based on restricted eigenvalue or compatibility conditions only. 1. Introduction. We derive oracle inequalities for the Lasso estimator for various designs. Results in literature are generally based on restricted eigenvalue or compatibility conditions see Section 3 for definitions). We refer to [], [4], [5], [6], [8], [10], [11]. See also [3] and the references therein. In a sense, compatibility or restricted eigenvalue conditions and the so-called dual norm bound we describe below belong together. In contrast, if compatibility constants or restricted eigenvalues are very small, the design may have high correlations, and then the dual norm bound is too rough. In this paper, we discuss an approach that joins both situations. The work is a follow-up of [1]. It combines results of the latter with the parallel developments in the area based on the dual norm bound. We consider an input space X and p feature mappings ψ j : X R, j = 1,..., p. We let x 1,..., x n ) T X n be a given input vector, and Y := Research supported by SNF 0PA AMS 000 subject classifications: Primary 6J05; secondary 6J99 Keywords and phrases: compatibility, correlation, entropy, high-dimensional model, Lasso 1

2 S. VAN DE GEER ET AL. Y 1,..., Y n ) T R n be an output vector, and consider the linear model p Y = ψ j βj 0 + ɛ, j=1 with ɛ R n a noise vector, and β 0 R p a vector of unknown coefficients. Here, with some abuse of notation, ψ j denotes the vector ψ j = ψ j x 1 ),..., ψ j x n )) T. The design matrix is X := ψ 1,..., ψ p ) and the Gram matrix is ˆΣ := X T X/n. Throughout, we assume that n i=1 ψ j x i) n for all j. We write a linear function with coefficients β as f β := p j=1 ψ jβ j, β R p. The Lasso estimator is ˆβ := arg min β { Y f β /n } + λ β 1. We denote the estimator of the regression function f 0 := f β 0 by ˆf := f ˆβ. Oracle results using compatibility or restricted eigenvalue conditions are based on the dual norm bound Let us define sup ɛ T f β /n = max β 1 =1 1 j p ɛt ψ j /n. n f β n := fβx i )/n = β T ˆΣβ. i=1 The point we make in this paper is that the dual norm bound does not take into account possible small values for f ˆβ f β 0 n. Our results are based on bounds for sup ɛ T f β /n β 1 1, f β n R as function of R > 0. We then apply these to ˆβ β 0 or β 0 here replaced by a sparse approximation). We use an improvement of the dual norm bound, and show in Theorem 4.1 the consequences. The main observation here is that with highly correlated design, one can generally take the tuning parameter λ of much smaller order than the usual log p/n. Moreover, small compatibility constants may be traded off against the l 1 -norm of the coefficients of an oracle.

3 LASSO WITH CORRELATED DESIGN 3. Organization of the paper. In Section 3, we present our notation, and the definitions of compatibility constants and restricted eigenvalues. Section 4 contains the main result, based on a pre-assumed improvement of the dual norm bound. In Section 5, we present a result from empirical process theory, which shows that the improvement of the dual norm bound used in Section 4 holds under entropy conditions on F := {f β : β 1 = 1}. In Section 6, we first give a geometrical interpretation of the compatibility constant and discuss the relation with eigenvalues. The next question to address is then how to read off the entropy conditions directly from the design. We show that a Gram matrix with strongly decreasing eigenvalues leads to a small entropy of F. Alternatively, we derive an an entropy bound for F based on the covering number of the design {ψ j }, a result much in the spirit of [7]. We moreover link these covering numbers with the correlation structure of the design. Section 7 concludes and Section 8 contains proofs. 3. Notation and definitions The compatibility constant. Let S {1,..., p} be an index set with cardinality s. We define for all β R p, β S,j := β j l{j S}, j = 1,..., p, β S c := β β S. Below, we present for constants L > 0 the compatibility constant φl, S) introduced in [10]. For normalized ψ j i.e., ψ j n = 1 for all j), one can view 1 φ 1, S)/ as an l 1 -version of the canonical correlation between the linear space spanned by the variables in S on the one hand, and the linear space of the variables in S c on the other hand. Instead of all linear combinations with normalized l -norm, we now consider all linear combinations with normalized l 1 -norm of the coefficients. For a geometric interpretation, we refer to Section 6. Definition The compatibility constant is φ L, S) := min{s f βs f βs c n : β S 1 = 1, β S c 1 L}. The compatibility constant is closely related to and never smaller than) the restricted eigenvalue as defined in [], which is { φ fβs f βs c n REL, S) = min β S : β S c 1 L β S 1 }.

4 4 S. VAN DE GEER ET AL. See also [8], and see [13] for a discussion of the relation between restricted eigenvalues and compatibility. 3.. Projections. As the true β 0 is perhaps only approximately sparse, we will consider a sparse approximation. The projection of f 0 := f β 0 on the space spanned by the variables in S is f S := arg min f=f βs f f 0 n. The coefficients of f S are denoted by b S, i.e., f S = f b S. Note that f S only has non-zero coefficients inside S, that is, b S ) S = b S. 4. Main result. We let T α be the set T α := { 4 ɛ T f β /n sup β f β n 1 α β α 1 Here, 0 α 1 and λ 0 > 0 are fixed constants. Note that on T α, λ 0 }. sup ɛ T f β /n λ 0 R 1 α /4, β 1 =1, f β n R i.e., we have a refinement of the dual norm bound described in Section 1. Note that for fixed λ 0 and for α < α, it holds that T α T α. This is because by the triangle inequality f β n = j ψ j β j n j ψ j n β j β 1. We want to choose α preferably small, yet keep the probability of the set T α large. For α = 1, one has { } T 1 = max 1 j p 4 ɛt ψ j /n λ 0, by the dual norm bound. Thus, e.g. when ɛ N 0, I), the probability IPT 1 ) of T 1 is large when λ 0 log p/n. We detail in Section 5 how one can lowerbound IPT α ) for a proper value of α depending on the design {ψ j }.

5 LASSO WITH CORRELATED DESIGN 5 Generally, the value for λ 0 will be of order log p/n, as in the case α = 1, or λ log n/n or even λ 0 1/ n. The choice of the tuning parameter λ depends on λ 0. The following technical lemma will be used: Lemma 4.1. Let 0 α 1 and let a, b and λ 0 be positive numbers. Then Here, when α = 1, λ 0 a 1 α b α 1 a + λb + 1 ) ) λ0 = := λ ). λ < λ 0 1 λ = λ 0. 0 λ > λ 0 In the proof of the main result, Theorem 4.1, we invoke Lemma 4.1 to handle the noise part ɛ T f β with β = ˆβ β 0 or actually with β 0 replaced here by a sparse approximation). On T α, it holds that 4 ɛ T f β /n 1 f β n + λ β ), uniformly in β R p. In the right hand side of this inequality, the first term f β n/ can be incorporated in the risk and the second term λ β 1 will be overruled by the penalty. Finally, the third term λ 0 / ) 1 α / governs the choice of the tuning parameter λ. We now come to the main result. We formulate it for an arbitrary index set S partitioned in sets S 1 and S in an arbitrary way. We will elaborate on the choice of S in Remarks 4. and 4.5. Corollaries 4.1 and 4. take for a given S some special choices for the tuning parameter λ and for the partition of S into S 1 and S. Recall that f S is the projection of f 0 = f β 0 and b S are the coefficients of f S. Theorem 4.1. Let S be an arbitrary index set, partitioned into two sets S 1 and S, i.e. S = S 1 S, S 1 S =. Let s 1 be the cardinality of S 1. Let T α be the set { 4 ɛ T f β /n T α := sup β f β n 1 α β α λ 0 }. 1

6 6 S. VAN DE GEER ET AL. Then on T α, ˆf f 0 n+λ ˆβ b S 1 56λ s 1 φ 6, S 1 ) λ bs ) S ) +7 fs 6 f 0 n. Remark 4.1. We did not attempt to optimize the constants we provided in Theorem 4.1. Remark 4.. Given a value of the tuning parameter λ, we can now define the estimation error using the variables in S as ES) := 8λ s 1 min S 1 S, S =S\S 1 φ 6, S 1 ) λ bs ) S 1. The oracle set S is then the set which trades off estimation error and approximation error, i.e, the set S that minimizes ES) + f S f 0 n. Note that S depends on λ, say S = S λ). The best value for the tuning parameter λ is then obtained by minimizing ES λ)) ) + fs λ) f 0 n. Remark 4.3. In practice, the tuning parameter λ can be chosen by crossvalidation. As this method tries to mimic minimization of the prediction error, it can be conjectured that one then arrives at rates at least a good as the ones we discuss here choosing values of λ depending on the design, the unknown) error distribution, and the unknown sparsity. This is however not rigorously proven. Remark 4.4. We have restricted ourselves to improvements of the dual norm bound of the form given by sets T α. The situation can be generalized by considering sets of the form { 4 ɛ T } f β /n sup β G 1 λ 0, f β n / β 1 ) β 1 where G is a given increasing convex function with G0) = 0. Corollary 4.1. a) If we take S =, we have S 1 = S, and s 1 = S =: s. This is a good

7 LASSO WITH CORRELATED DESIGN 7 choice when the compatibility constants are large for all subsets of S. With the choice φ λ λ ) 6, S) 1 α 0, s we get on T α, ˆf f 0 n + λ ˆβ b S 1 = O λ 0 ) s α ) φ + f S f 0 n. 6, S) Recall that the dual norm bound has α = 1. With λ 0 log p/n we then arrive at the usual oracle inequality as provided by, among others, [], [4], [5], [6], [8] [10], [11]. When α < 1, the compatibility constant may be very small, as the design is highly correlated. The effect is however somewhat tempered by the power α in the bound. b) More generally, let φ ) 6, S 1 ) 1 α, λ λ 0 s 1 Then on T α, ˆf f 0 n + λ ˆβ b S 1 ) = O λ s α 1 φ ) 6, S 1 ) 1 α 0 φ + λ 0 b S ) S 1 + f S f 0 n. 6, S 1 ) s 1 Corollary 4.. a) With the choice S 1 =, the result does not involve the compatibility constant. This may be desirable when the design is highly correlated. The result then corresponds to what is sometimes called slow rates, although we will see that when α < 1, the rates can still be much faster than 1/ n. When α = 1, we must take λ > λ 0 due to the term λ 0 / ) 1 α ). When α < 1, we choose λ λ0 b S 1 α 1. We get on T α, ) ˆf f 0 n + λ ˆβ b S 1 = O λ0 b S α 1 + f S f 0 n. b) More generally, let λ λ0 b S ) S 1 α 1.

8 8 S. VAN DE GEER ET AL. Then on T α, 4 λ0 = O b S ) S 1 α) 1 ˆf f 0 n + λ ˆβ b S 1 s 1 φ 6, S 1 ) + λ 0 b S ) S 1 + f S f 0 n. Remark 4.5. Note that by taking S 1 smaller, the value of s 1 /φ 6, S 1 ) will not increase, but on the other hand, the value of b S ) S 1 will become larger. Thus, the best rate will emerge if we trade off these two effects. Indeed, suppose that for some S 1 Then on T α, for we have λ 0 s 1 φ S 1 ) bs ) S 1. s ) 1 α)/ 1 λ λ 0 φ λ0 b S ) S 1 α 6, S 1 ) ˆf f 0 n + λ ˆβ b S 1 = O λ 0 α 1, ) s α ) 1 φ + f S f 0 n 6, S 1 ) ) = O λ0 b S α ) S 1 + f S f 0 n. In particular for the case α < 1, it is however not clear when such a tradeoff is possible. It may well be that for any S 1, s 1 /φ 6, S 1 ) either heavily dominates or is heavily dominated by the l 1 -part b S ) S 1. See Section 6 for a further discussion. 5. Improving the dual norm bound. In this section, we provide probability bounds for the set T α introduced in Section 4. The results follow from empirical process theory, see e.g. and [14] and [15]. Theorem 5.1 is taken from [3]. Definition Let F be a class of real-valued functions on X. Endow F with norm n. Let δ > 0 be some radius. A δ-packing set is a set of functions in F that are each at least δ apart. A δ-covering set is a set of functions {φ 1,..., φ N }, such that sup f F min f φ k n δ. k=1,...,n

9 LASSO WITH CORRELATED DESIGN 9 The δ-covering number Nδ, F, n ) of F is the minimum size of a δ- covering set. The entropy of F is H, F, n ) = log N, F, n ). It is easy to see that Nδ, F, n ) can be bounded by the size of a maximal δ-packing set. We assume the errors are sub-gaussian, that is, for some positive constants K and σ 0, 5.1) K IE exp[ɛ i /K ] 1) σ 0, i = 1,..., n. The following theorem is Corollary 14.6 in [3]. It is in the spirit of a weighted concentration inequality, and uses the notation x + := max{x, 0}. Theorem 5.1. Assume 5.1). Let F be a class of functions with f n 1 for all f F, and with, for some 0 < α < 1 and some constant A, ) ) A α log 1 + Nδ, F, n ), 0 < δ 1. δ Define and [ B := exp A α α 1 α 1) ] 1, K 0 := 3 5 K + σ 0. It holds that IE exp [ sup f F [ ɛ T f / n f n K 0 Aα f α ) ] ] n 1 α 1 + /B. 1 + Corollary 5.1. Assume the conditions of Theorem 5.1. Chebyshev s inequality shows that for all t > 0, IP f F : ɛ T f / n K 0 A α f 1 α n exp[ t ]1 + /B). ) 1 α 1) 1 + K 0 f n t

10 10 S. VAN DE GEER ET AL. Corollary 5.. Consider now linear functions where ψ j n 1. Then p f β := ψ j β j, β R p, j=1 f β n β 1. Hence, {f β / β 1 : β R p } is a class of functions with n -norm bounded by 1. Suppose now ) ) A α log 1 + Nδ, {f β : β 1 = 1}, n ), 0 < δ 1. δ Under the sub-gaussianity condition 5.1), we then have for all t > 0 and for λ 0 = 4K 0 A α ) n 1 α 1 + t, the lower bound IPT α ) 1 exp[ t ]1 + /B). 6. Compatibility, eigenvalues, entropy and correlations. We study the set F := {f β : β 1 = 1}. It is considered as subset of L Q n ), where Q n := n i=1 δ xi /n. The L Q n )- norm is n Geometric interpretation of the compatibility constant. We first look at the minimal l 1 -eigenvalue { } Λ min,1s) := min sβs T ˆΣβ S : β S 1 = 1 as introduced in [3]. Note that Λ min,1 S)/ s is the minimal distance between any point f βs with β S 1 = 1 and the point {0}. We tacitly assume that the {ψ j } j S are linearly independent. The set {f βs : β S 1 = 1} is then an l 1 -version of a sphere: it is the boundary of the convex hull of {ψ j } j S { ψ j } j S in s-dimensional space with {0} in its center. It is a parallelogram when s = see Figure 1) and then a rectangle when the ψ j, j S, have equal length.

11 LASSO WITH CORRELATED DESIGN 11 ψ 1 Λ min, 1 S ) 0 A ψ ψ 1,..., ψ s s Λ min S ) Fig 1. Left panel: the set A = {f βs : β S 1 = 1}. Right panel: l 1- and l -eigenvalues. Let ˆΣ S be the Gram matrix of the variables in S and Λ min S) be the minimal l -)eigenvalue of the matrix ˆΣ S : Then { } Λ mins) := min βs T ˆΣβ S : β S = 1. Λ min,1s) Λ mins) Λ min,1s)/s, One can construct examples where Λ min S) is as small as 3/s ) s > ) and Λ min,1 S) is at least 1/ see [13]), that is, they can differ by the maximal amount s in order of magnitude. See also Figure 1 which is to be understood as representing a case s >. Thus, minimal l 1 -eigenvalues can be much larger than minimal l -)eigenvalues. The normalized compatibility constant φl, S)/ s is the minimal distance between the sets A := {f βs : β S 1 = 1} and B := {f βs c : β S c 1 L}, that is, φl, S) s { } = min a b n : a A, b B. See Figure for an impression of the situation. Observe that A is the boundary of the convex hull of {+ψ j } j S { ψ j } j S, and B is the convex hull of {+ψ j } j S c { ψ j } j S c including its interior, blown up with a factor L typically, the {ψ j } j S c form a linearly dependent system in R n ). Furthermore, since {0} B φl, S) Λ min,1 S). This shows that when l 1 -eigenvalues are small, the compatibility constant is necessarily also small. Small l -eigenvalues may have less of this effect. 6.. Eigenvalues and entropy. We now let ˆΣ = EΩ E T

12 1 S. VAN DE GEER ET AL. B φl,s)/ s A Fig. The compatibility constant be the spectral decomposition of the Gram matrix ˆΣ, E being the matrix of eigenvectors, E T E = EE T = I) and Ω = diagω 1,, ω p) the matrix of l -)eigenvalues. We assume they are in decreasing order: ω 1 ω p. Lemma 6.1. Suppose that for some strictly decreasing function V Then for all δ > 0, ω j+1 V j), j = 1,..., p. ) 3 Hδ, {f β : β 1 = 1}, n ) V 1 δ) log. δ Example 6.1. Suppose that for some positive constants m and C ω j C, j = 1,..., p. jm Then by Lemma 6.1, C Hδ, {f β : β 1 = 1}, n ) δ ) 1 m log 3 δ ). For δ 1/n say) we therefore have C Hδ, {f β : β 1 = 1}, n ) δ ) 1 m log6n). When m > 1/, one can use a minor generalization of Corollary 5., where the entropy bound is only required for values of δ > 1/n. One then takes α = 1 m, A = C m logn)) 1 α,

13 LASSO WITH CORRELATED DESIGN 13 where C m is a constant depending on m and C. Then the value of λ 0 defined there becomes λ 0 = 4K ) 0 Cm logn) n m 1 + t 1 1 which is for fixed m and K 0, and a fixed large) t, of order log n/n Entropy based on coverings of {ψ j }. We can consider {f β : β 1 = 1} as a subset of conv{±ψ j }), where {±ψ j } := {ψ j } { ψ j }, and conv{±ψ j }) is its convex hull. Infact, if the {ψ j } form a linearly dependent system in R n, F is exactly equal to conv{±ψ j }). The paper [7] gives a bound for the entropy of a convex hull for the case where the u-covering number of the extreme points is a polynomial in 1/u. This result can also be found in [9]. There is a redundant log-term in these entropy bounds, see [1] and [15], but removing this log-term may result in very large constants, depending on the dimension W as given in Example 6. see [3] for some explicit constants). This means that when the dimension W of the extreme points is large growing with n say), the simple bound with log-term we provide below in Lemma 6. may be better than the more involved ones. We give a bound for the entropy of F by balancing the u-covering number of {ψ j } and the squared radius u. The result is as in [9], with only new element its extension to general covering numbers i.e., not only polynomial ones). Lemma 6. and its proof can be found in [3]. Lemma 6.. Let Nu) := Nu, {ψ j }, n ), u > 0. We have min 0<u<1 6 ) H δ, {f β : β 1 = 1}, n Nu) + 6u δ ) log 8 + δ δ ) ) Nδ). Example 6.. In this example, we assume the u-covering numbers of {ψ j } are bounded by a polynomial in u. That is, we suppose that for some positive

14 14 S. VAN DE GEER ET AL. constants W and C, ) C W Nu, {ψ j }, n ), u > 0. u The constant W can be thought of as the dimension of {ψ j }. By Lemma 6., we can choose α = W + W, A = C W logn)) 1 α, and we get, as in Example 6.1, λ 0 = 4K ) 0 CW logn) + t. n +W 1 A refined analysis of the relation between compatibility constants, covering numbers and entropy is still to be carried out. We confine ourselves here to the following, rather trivial, observation without proof). Lemma 6.3. Consider normalized design: ψ j n = 1 j. Let {ψ j1,..., ψ jn } be a maximal u-packing set of {ψ j }. Then for any S {j 1,..., j N }, S {1,..., p}, and any L 1, φ L, S) su. One may argue that as u-packing sets are approximations of the original design {ψ j } with fewer covariables, they are good candidates for the sparsity set S 1 used in Theorem 4.1. Lemma 6.3 however shows that such sparsity sets will have very small compatibility constants Decorrelation numbers. Decorrelation numbers are closely related to packing numbers. First, define the inner product ρφ, φ) := φ T φ/n. Note that Σ j,k = ρψ j, ψ k ) and that in the case of standardized design i.e. ni=1 ψ j x i ) = 0 and ψ j n = 1 j), the inner product ρψ j, ψ k ) is for j k the empirical) correlation between ψ j and ψ k. Definition For ρ > 0, the ρ-decorrelation number Mρ) is the largest value of M such that there exists {φ 1,..., φ M } {±ψ j } with ρφ j, φ k ) < ρ for all j k.

15 LASSO WITH CORRELATED DESIGN 15 Hence, if the ρ-decorrelation number is small, then there are many large correlations, i.e., then the design is highly correlated. It is clear that when ψ j n = ψ k n = 1, it holds that ψ j ψ k n = 1 ρψ j, ψ k )). In other words, small correlations correspond to covariables that are near to each other. This can be translated into covering number as shown in Lemma 6.4. Its proof is straightforward and omitted. Lemma 6.4. Consider normalized design: ψ j n = 1 j. For all 0 < u < 1, N u, {±ψ j }, n ) M1 u ). 7. Conclusion. We have combined results for the prediction error of the Lasso with both compatibility conditions and entropy conditions. Small entropies of {f β : β 1 = 1} correspond to highly correlated design and possibly to small compatibility constants. Our analysis shows that small entropies allow for a smaller choice of the tuning parameter and possibly for a compensation of small compatibility constants. This means that the Lasso enjoys good prediction error properties, even in the case where the design is highly correlated. 8. Proofs. Proof of Lemma 4.1. We use that for positive u and v and for p 1, q 1, 1/p + 1/q = 1, the conjugate inequality uv u p /p + v q /q holds. Taking p = 1/1 α) and replacing u by u 1 α gives u 1 α v 1 α u α v. With p = 1 + α)/α), and replacing u by u α, we get Thus, u α v α 1 + α u + 1 α 1 + α v 1 α. λ 0 a 1 α b α 1 α a α ) λ 0 b α

16 16 S. VAN DE GEER ET AL. a a α ) λb) α λ α α 1 + α λb + 1 α 1 + α ) λ0 1 α Proof of Theorem 4.1. Since a + λb + 1 ). Y ˆf /n + λ ˆβ 1 Y f S /n + λ b S 1, we have the Basic Inequality Hence, on T α, ˆf f 0 n + λ ˆβ 1 ɛ T ˆf f S )/n + λ b S 1 + f S f 0 n. ˆf f 0 n + λ ˆβ 1 λ 0 ˆf f S 1 α n ˆβ b S α 1 / + λ b S 1 + f S f 0 n. Apply Lemma 4.1 to find 1 4 ˆf f S n + 1 λ ˆβ b S ˆf f 0 n + 1 λ ˆβ b S Thus, we get on T α, ˆf f 0 n + λ ˆβ 1 ˆf f 0 n + λ ˆβ 1 λ ˆβ b S 1 + λ b S Defining S 3 := S c, we rewrite this to ) + λ b S 1 + f S f 0 n. ) + λ b S f S f 0 n. ˆf f 0 n + λ ˆβ S S 3 1 ) + 3 fs f 0 n. λ ˆβ S1 b S ) S1 1 + λ ˆβ S b S ) S 1 + λ ˆβ S3 1 + λ b S ) S1 1 λ ˆβ S1 1 +λ b S ) S ) + 3 fs f 0 n

17 LASSO WITH CORRELATED DESIGN 17 3λ ˆβ S1 b S ) S1 1 +λ ˆβ S S λ b S ) S ) +3 fs f 0 n. Moving the term λ ˆβ S S 3 1 to the left hand side, and applying a triangle inequality, we obtain ˆf f 0 n + λ ˆβ S S 3 b S ) S 1 ) λ0 3λ ˆβ S1 b S ) S λ b S ) S α + 3 fs }{{} f 0 n. :=I }{{} :=II Case i. If I II, we arrive at ˆf f 0 n + λ ˆβ S S 3 b S ) S 1 6λ ˆβ S1 b S ) S1 1. We first add add a term λ ˆβ S1 b S ) S1 1 to the left and right hand side and then apply the compatibility condition to ˆβ b S, to get ˆf f 0 n + λ ˆβ b S 1 7λ s 1 φ6, S 1 ) ˆf f S n 1 ˆf f 0 n + 7 f S f 0 n + 56λ s 1 φ 6, S 1 ). Here we used the decoupling device So then xy bx + y /b x, y R, b > 0. ˆf f 0 n + λ ˆβ b S 1 56λ s 1 φ 6, S 1 ) + 7 f S f 0 n. Case ii. If I < II, we get ˆf f 0 n + λ ˆβ S S 3 b S ) S 1 II, and hence ˆf f 0 n + λ ˆβ b S II = 8 3 λ bs ) S ) + 7 fs 6 f 0 n.

18 18 S. VAN DE GEER ET AL. Proof of Lemma 6.1. Let β 1 = 1. Then β 1, and hence E T β 1. For N V 1 δ) it holds that ω N+1 δ and hence p j=n+1 ω j E T β) j ω N+1 p j=n+1 E T β) j δ. We now note that β 1 = 1 implies f β n 1 and hence N ωj E T β) j 1. j=1 Lemma 14.7 in [3] states that a ball with radius 1 in N-dimensional Euclidean space can be covered by 3/δ) N balls with radius δ see also Problem.1.6 in [15]). References. [1] K. Ball and A. Pajor. The entropy of convex bodies with few extreme points. London Mathematical Society Lecture Note Series, 158:5 3, [] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37: , 009. [3] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, 011. [4] F. Bunea, A.B. Tsybakov, and M.H. Wegkamp. Aggregation and sparsity via l 1- penalized least squares. In Proceedings of 19th Annual Conference on Learning Theory, COLT 006. Lecture Notes in Artificial Intelligence 4005, pages , Heidelberg, 006. Springer Verlag. [5] F. Bunea, A.B. Tsybakov, and M.H. Wegkamp. Aggregation for Gaussian regression. Annals of Statistics, 35: , 007a. [6] F. Bunea, A. Tsybakov, and M.H. Wegkamp. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1: , 007b. [7] R.M. Dudley. Universal Donsker classes and metric entropy. Annals of Probability, 15: , [8] V. Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 45:7 57, 009. [9] D. Pollard. Empirical Processes: Theory and Applications. IMS Lecture Notes, [10] S. van de Geer. The deterministic Lasso. The JSM Proceedings, 007. [11] S. van de Geer. High-dimensional generalized linear models and the Lasso. Annals of Statistics, 36: , 008. [1] S. van de Geer. On non-asymptotic bounds for estimation in generalized linear models with highly correlated design. In Asymptotics: Particles, Processes and Inverse Problems E.A. Cator, G. Jongbloed, C. Kraaikamp, H.P. Lopuhaä, J.A. Wellner eds.), volume 55, pages IMS Lecture Notes Monograph Series, 007. [13] S. van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, pages , 009. [14] S. A. van de Geer. Empirical Processes in M-Estimation. Cambridge, 000.

19 LASSO WITH CORRELATED DESIGN 19 [15] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, ISBN Seminar for Statistics ETH Zürich Rämistrasse Zürich

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics Vol. 0 (2010) ISSN: 1935-7524 The adaptive the thresholded Lasso for potentially misspecified models ( a lower bound for the Lasso) Sara van de Geer Peter Bühlmann Seminar

More information

arxiv: v1 [math.st] 5 Oct 2009

arxiv: v1 [math.st] 5 Oct 2009 On the conditions used to prove oracle results for the Lasso Sara van de Geer & Peter Bühlmann ETH Zürich September, 2009 Abstract arxiv:0910.0722v1 [math.st] 5 Oct 2009 Oracle inequalities and variable

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Delta Theorem in the Age of High Dimensions

Delta Theorem in the Age of High Dimensions Delta Theorem in the Age of High Dimensions Mehmet Caner Department of Economics Ohio State University December 15, 2016 Abstract We provide a new version of delta theorem, that takes into account of high

More information

Lecture 21 Theory of the Lasso II

Lecture 21 Theory of the Lasso II Lecture 21 Theory of the Lasso II 02 December 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Class Notes Midterm II - Available now, due next Monday Problem Set 7 - Available now, due December 11th

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B. Submitted to the Annals of Statistics arxiv: 007.77v ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B. Tsybakov We

More information

arxiv: v2 [math.st] 15 Sep 2015

arxiv: v2 [math.st] 15 Sep 2015 χ 2 -confidence sets in high-dimensional regression Sara van de Geer, Benjamin Stucky arxiv:1502.07131v2 [math.st] 15 Sep 2015 Abstract We study a high-dimensional regression model. Aim is to construct

More information

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

Statistics for high-dimensional data: Group Lasso and additive models

Statistics for high-dimensional data: Group Lasso and additive models Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Sparsity oracle inequalities for the Lasso

Sparsity oracle inequalities for the Lasso Electronic Journal of Statistics Vol. 1 (007) 169 194 ISSN: 1935-754 DOI: 10.114/07-EJS008 Sparsity oracle inequalities for the Lasso Florentina Bunea Department of Statistics, Florida State University

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

How Correlations Influence Lasso Prediction

How Correlations Influence Lasso Prediction How Correlations Influence Lasso Prediction Mohamed Hebiri and Johannes C. Lederer arxiv:1204.1605v2 [math.st] 9 Jul 2012 Université Paris-Est Marne-la-Vallée, 5, boulevard Descartes, Champs sur Marne,

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

arxiv: v1 [math.st] 8 Jan 2008

arxiv: v1 [math.st] 8 Jan 2008 arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Course 212: Academic Year Section 1: Metric Spaces

Course 212: Academic Year Section 1: Metric Spaces Course 212: Academic Year 1991-2 Section 1: Metric Spaces D. R. Wilkins Contents 1 Metric Spaces 3 1.1 Distance Functions and Metric Spaces............. 3 1.2 Convergence and Continuity in Metric Spaces.........

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Notes taken by Costis Georgiou revised by Hamed Hatami

Notes taken by Costis Georgiou revised by Hamed Hatami CSC414 - Metric Embeddings Lecture 6: Reductions that preserve volumes and distance to affine spaces & Lower bound techniques for distortion when embedding into l Notes taken by Costis Georgiou revised

More information

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS SOME CONVERSE LIMIT THEOREMS OR EXCHANGEABLE BOOTSTRAPS Jon A. Wellner University of Washington The bootstrap Glivenko-Cantelli and bootstrap Donsker theorems of Giné and Zinn (990) contain both necessary

More information

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

On threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden

On threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden On threshold-based classification rules By Leila Mohammadi and Sara van de Geer Mathematical Institute, University of Leiden Abstract. Suppose we have n i.i.d. copies {(X i, Y i ), i = 1,..., n} of an

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

The small ball property in Banach spaces (quantitative results)

The small ball property in Banach spaces (quantitative results) The small ball property in Banach spaces (quantitative results) Ehrhard Behrends Abstract A metric space (M, d) is said to have the small ball property (sbp) if for every ε 0 > 0 there exists a sequence

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA)

Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA) Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA) Natalia Bochkina & Ya acov Ritov February 27, 2010 Abstract We consider a joint processing of n independent similar sparse regression problems.

More information

arxiv: v1 [math.st] 21 Sep 2016

arxiv: v1 [math.st] 21 Sep 2016 Bounds on the prediction error of penalized least squares estimators with convex penalty Pierre C. Bellec and Alexandre B. Tsybakov arxiv:609.06675v math.st] Sep 06 Abstract This paper considers the penalized

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Math 341: Convex Geometry. Xi Chen

Math 341: Convex Geometry. Xi Chen Math 341: Convex Geometry Xi Chen 479 Central Academic Building, University of Alberta, Edmonton, Alberta T6G 2G1, CANADA E-mail address: xichen@math.ualberta.ca CHAPTER 1 Basics 1. Euclidean Geometry

More information

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Optimisation Combinatoire et Convexe.

Optimisation Combinatoire et Convexe. Optimisation Combinatoire et Convexe. Low complexity models, l 1 penalties. A. d Aspremont. M1 ENS. 1/36 Today Sparsity, low complexity models. l 1 -recovery results: three approaches. Extensions: matrix

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

In English, this means that if we travel on a straight line between any two points in C, then we never leave C. Convex sets In this section, we will be introduced to some of the mathematical fundamentals of convex sets. In order to motivate some of the definitions, we will look at the closest point problem from

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,

More information

Subdifferential representation of convex functions: refinements and applications

Subdifferential representation of convex functions: refinements and applications Subdifferential representation of convex functions: refinements and applications Joël Benoist & Aris Daniilidis Abstract Every lower semicontinuous convex function can be represented through its subdifferential

More information

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Relaxed Lasso. Nicolai Meinshausen December 14, 2006 Relaxed Lasso Nicolai Meinshausen nicolai@stat.berkeley.edu December 14, 2006 Abstract The Lasso is an attractive regularisation method for high dimensional regression. It combines variable selection with

More information

An elementary proof of the weak convergence of empirical processes

An elementary proof of the weak convergence of empirical processes An elementary proof of the weak convergence of empirical processes Dragan Radulović Department of Mathematics, Florida Atlantic University Marten Wegkamp Department of Mathematics & Department of Statistical

More information

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity

More information

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, L penalized LAD estimator for high dimensional linear regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, where the overall number of variables

More information

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem 56 Chapter 7 Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem Recall that C(X) is not a normed linear space when X is not compact. On the other hand we could use semi

More information

Chapter One. The Calderón-Zygmund Theory I: Ellipticity

Chapter One. The Calderón-Zygmund Theory I: Ellipticity Chapter One The Calderón-Zygmund Theory I: Ellipticity Our story begins with a classical situation: convolution with homogeneous, Calderón- Zygmund ( kernels on R n. Let S n 1 R n denote the unit sphere

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring,

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring, The Finite Dimensional Normed Linear Space Theorem Richard DiSalvo Dr. Elmer Mathematical Foundations of Economics Fall/Spring, 20-202 The claim that follows, which I have called the nite-dimensional normed

More information

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space 1 Professor Carl Cowen Math 54600 Fall 09 PROBLEMS 1. (Geometry in Inner Product Spaces) (a) (Parallelogram Law) Show that in any inner product space x + y 2 + x y 2 = 2( x 2 + y 2 ). (b) (Polarization

More information

IEOR 265 Lecture 3 Sparse Linear Regression

IEOR 265 Lecture 3 Sparse Linear Regression IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Covering an ellipsoid with equal balls

Covering an ellipsoid with equal balls Journal of Combinatorial Theory, Series A 113 (2006) 1667 1676 www.elsevier.com/locate/jcta Covering an ellipsoid with equal balls Ilya Dumer College of Engineering, University of California, Riverside,

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

Convex Geometry. Carsten Schütt

Convex Geometry. Carsten Schütt Convex Geometry Carsten Schütt November 25, 2006 2 Contents 0.1 Convex sets... 4 0.2 Separation.... 9 0.3 Extreme points..... 15 0.4 Blaschke selection principle... 18 0.5 Polytopes and polyhedra.... 23

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Math Linear Algebra II. 1. Inner Products and Norms

Math Linear Algebra II. 1. Inner Products and Norms Math 342 - Linear Algebra II Notes 1. Inner Products and Norms One knows from a basic introduction to vectors in R n Math 254 at OSU) that the length of a vector x = x 1 x 2... x n ) T R n, denoted x,

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices arxiv:1308.3416v1 [stat.me] 15 Aug 2013 Yixin Fang 1, Binhuan Wang 1, and Yang Feng 2 1 New York University and 2 Columbia

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Class notes: Approximation

Class notes: Approximation Class notes: Approximation Introduction Vector spaces, linear independence, subspace The goal of Numerical Analysis is to compute approximations We want to approximate eg numbers in R or C vectors in R

More information

Characterization of half-radial matrices

Characterization of half-radial matrices Characterization of half-radial matrices Iveta Hnětynková, Petr Tichý Faculty of Mathematics and Physics, Charles University, Sokolovská 83, Prague 8, Czech Republic Abstract Numerical radius r(a) is the

More information

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich Submitted to the Annals of Statistics DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich We congratulate Richard Lockhart, Jonathan Taylor, Ryan

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Boundary Behavior of Excess Demand Functions without the Strong Monotonicity Assumption

Boundary Behavior of Excess Demand Functions without the Strong Monotonicity Assumption Boundary Behavior of Excess Demand Functions without the Strong Monotonicity Assumption Chiaki Hara April 5, 2004 Abstract We give a theorem on the existence of an equilibrium price vector for an excess

More information

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA The Annals of Statistics 2009, Vol. 37, No. 1, 246 270 DOI: 10.1214/07-AOS582 Institute of Mathematical Statistics, 2009 LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA BY NICOLAI

More information

Lecture 7: Positive Semidefinite Matrices

Lecture 7: Positive Semidefinite Matrices Lecture 7: Positive Semidefinite Matrices Rajat Mittal IIT Kanpur The main aim of this lecture note is to prepare your background for semidefinite programming. We have already seen some linear algebra.

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Universal convex coverings

Universal convex coverings Bull. London Math. Soc. 41 (2009) 987 992 C 2009 London Mathematical Society doi:10.1112/blms/bdp076 Universal convex coverings Roland Bacher Abstract In every dimension d 1, we establish the existence

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES JOEL A. TROPP Abstract. We present an elementary proof that the spectral radius of a matrix A may be obtained using the formula ρ(a) lim

More information

1 Weak Convergence in R k

1 Weak Convergence in R k 1 Weak Convergence in R k Byeong U. Park 1 Let X and X n, n 1, be random vectors taking values in R k. These random vectors are allowed to be defined on different probability spaces. Below, for the simplicity

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery : Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information