(Part 1) High-dimensional statistics May / 41

Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2 I)-distributed. We moreover assume that the linear model holds exactly, with some true parameter value β 0. (Part 1) High-dimensional statistics May 2012 1 / 41

(Part 1) High-dimensional statistics May 2012 2 / 41 What is an oracle inequality? Suppose for the moment that p n and that X has full rank p. Consider the least squares estimator in the linear model Then the prediction error ˆβ LM := (X T X) 1 X T Y. X( ˆβ LM β 0 ) 2 2 /σ2 is χ 2 p-distributed. In particular, this means that E X( ˆβ LM β 0 ) 2 2 n = σ2 n p. In words: each parameter βj 0 is estimated with squared accuracy σ 2 /n, j = 1,..., p. The overall squared accuracy is then (σ 2 /n) p.

(Part 1) High-dimensional statistics May 2012 3 / 41 Sparsity We now turn to the situation where possibly p > n. The philosophy that will generally rescue us, is to believe that in fact only a few, say s 0, of the βj 0 are non-zero. We use the notation S 0 : S 0 := {j : β 0 j 0}, so that s 0 = S 0. We call S 0 the active set, and s 0 the sparsity index of β 0.

(Part 1) High-dimensional statistics May 2012 4 / 41 Notation β j,s0 := β j l{j S 0 }, β j,s c 0 := β j l{j / S 0 }. Clearly, and β = β S0 + β S c 0, β 0 S c 0 = 0.

(Part 1) High-dimensional statistics May 2012 5 / 41 If we would know S 0, we could simply neglect all variables X (j) with j / S 0. Then, by the above argument, the overall squared accuracy would be (σ 2 /n) s 0. With S 0 is unknown, we apply the l 1 -penalty, i.e., the Lasso { } ˆβ := arg min Y Xβ 2 2 /n + λ β 1. β Definition: Sparsity oracle inequality. The sparsity constant φ 0 is the largest value φ 0 > 0 such that Lasso ˆβ satisfies the φ 0 -sparsity oracle inequality X( ˆβ β 0 ) 2 2 /n + λ ˆβ S c 0 1 λ2 s 0 φ 2. 0

A digression: the noiseless case Let X be some measurable space, Q be a probability measure on X, and be the L 2 (Q) norm. Consider a fixed dictionary of functions {ψ j } p j=1 L 2(Q): Consider linear functions f β ( ) = p β j ψ j ( ) : β R p. j=1 Consider moreover a fixed target f 0 := p βj 0 ψ j. j=1 We let S 0 := {j : βj 0 0} be its active set, and s 0 := S 0 be the sparsity index of f 0. (Part 1) High-dimensional statistics May 2012 6 / 41

(Part 1) High-dimensional statistics May 2012 7 / 41 For some fixed λ > 0, the Lasso for the noiseless problem is } β := arg min { f β f 0 2 + λ β 1, β where 1 is the l 1 -norm. We write f := f β and let S be the active set of the Lasso. The Gram matrix is Σ := ψ T ψdq.

(Part 1) High-dimensional statistics May 2012 8 / 41 We will need certain conditions on the Gram matrix to make the theory work. We require a certain compatibility of l 1 -norms with l 2 -norms. Compatibility Let L > 0 be some constant. The compatibility constant is φ 2 Σ (L, S 0) := φ 2 (L, S 0 ) := min{s 0 β T Σβ : β S0 1 = 1, β c S 0 1 L}. We say that the (L, S 0 )-compatibility condition is met if φ(l, S 0 ) > 0.

(Part 1) High-dimensional statistics May 2012 9 / 41 Back to the noisy case Lemma (Basic Inequality) We have X( ˆβ β 0 ) 2 2 /n + 2λ ˆβ 1 2ɛ T X( ˆβ β 0 )/n + 2λ β 0 1.

(Part 1) High-dimensional statistics May 2012 10 / 41 We introduce the set We assume that T := { } max 1 j p ɛt X (j) /n λ 0. λ > λ 0, to make sure that on T we can get rid of the random part of the problem.

(Part 1) High-dimensional statistics May 2012 11 / 41 Let us denote the diagonal elements of the Gram matrix ˆΣ := X T X/n, by ˆσ 2 j := ˆΣ j,j, j = 1,..., p. Lemma Suppose that σ 2 = ˆσ 2 j = 1 for all j. Then we have for all t > 0, and for 2t + 2 log p λ 0 :=, n P(T ) 1 2 exp[ t].

(Part 1) High-dimensional statistics May 2012 12 / 41 Compatibility condition (Noisy case) Let L > 0 be some constant. The compatibility constant is φ 2ˆΣ(L, S 0 ) := φ 2 (L, S 0 ) := min{s 0 β T ˆΣβ : β S0 1 = 1, β c S 0 1 L}. We say that the (L, S 0 )-compatibility condition is met if φ(l, S 0 ) > 0.

(Part 1) High-dimensional statistics May 2012 13 / 41 Theorem Suppose λ λ 0 and that the compatibility condition holds for S 0, with L = λ + λ 0 λ λ 0. Then on we have and T := { } max 1 j p ɛt X (j) /n λ 0, X( ˆβ β 0 ) 2 n 4(λ + λ 0 ) 2 s 0 /φ 2 (L, S 0 ), ˆβ S0 β 0 1 2(λ + λ 0 )s 0 /φ 2 (L, S 0 ), ˆβ c S 0 1 2L(λ + λ 0 )s 0 /φ 2 (L, S 0 ).

(Part 1) High-dimensional statistics May 2012 14 / 41 When does the compatibility condition hold? oracle inequalities for prediction and estimation RIP weak (S,2s)- RIP adaptive (S, 2s)- restricted regression (S,2s)-restricted eigenvalue S-compatibility S \S s * coherence adaptive (S, s)- restricted regression (S,s)-restricted eigenvalue weak (S, 2s)- irrepresentable (S,2s)-irrepresentable (S,s)-uniform irrepresentable S \S =0 *

(Part 1) High-dimensional statistics May 2012 15 / 41 If Σ is non-singular, the compatibility condition holds, with φ 2 (S 0 ) Λ 2 min, the latter being the smallest eigenvalue of Σ. Example Consider the matrix 1 ρ ρ Σ := (1 ρ)i + ριι T ρ 1 ρ =....., ρ ρ 1 with 0 < ρ < 1, and ι := (1,..., 1) T a vector of 1 s. Then the smallest eigenvalue of Σ is Λ 2 min = 1 ρ, so the compatibility condition holds with φ(s 0 ) 1 ρ. (The uniform S 0 -irrepresentable condition is met as well.)

(Part 1) High-dimensional statistics May 2012 16 / 41 Geometric interpretation Let X j R n denote the j-th column of X (j = 1,..., p). The set A := {Xβ S : β S 1 = 1} is the convex hull of the vectors {±X j } j S in R n. Likewise, the set B := {Xβ S c : β S c 1 L} is the convex hull including interior of the vectors {±LX j } j S c. The l 1 -eigenvalue δ(l, S) is the distance between these two sets. δ(l,s) B A

(Part 1) High-dimensional statistics May 2012 17 / 41 We note that: if L is large the l 1 -eigenvalue will be small, it will also be small if the vectors in S exhibit strong correlation with those in S c, when the vectors in {X j } j S are linearly dependent, it holds that {Xβ S : β S 1 = 1} = {Xβ S : β S 1 1}, and hence then δ(l, S) = 0.

(Part 1) High-dimensional statistics May 2012 18 / 41 The difference between the compatibility constant and the squared l 1 -eigenvalue lies only in the normalization by the size S of the set S. This normalization is inspired by the orthogonal case, which we detail in the following example. Example Suppose that the columns of X are all orthogonal: Xj T X k = 0 for all j k. Then δ(l, S) = 1/ S and φ(l, S) = 1.

(Part 1) High-dimensional statistics May 2012 19 / 41 Let S β := {j : β j 0}. We call S β the sparsity-index of β. More generally, we call S the sparsity index of the set S. Definition For a set S and constant L > 0, the effective sparsity Γ 2 (L, S) is the inverse of the squared l 1 -eigenvalue, that is Γ 2 (L, S) = 1 δ 2 (L, S).

(Part 1) High-dimensional statistics May 2012 20 / 41 Example As a simple numerical example, let us suppose n = 2, p = 3, S = {3}, and X = ( ) 5/13 0 1 n. 12/13 1 0 The l 1 -eigenvalue δ(l, S) is equal to the distance of X 3 to line that connects LX 1 and LX 2, that is δ(l, S) = max{(5 L)/ 26, 0}. Hence, for example for L = 3 the effective sparsity is Γ 2 (3, S) = 13/2. Alternatively, when X = ( ) 12/13 0 1 n. 5/13 1 0 then for example δ(3, S) = 0 and hence Γ 2 (3, S) =. This is due to the sharper angle between X 1 and X 3.

The compatibility condition is slightly weaker than the restricted eigenvalue condition of Bickel et al. [2009]. The restricted isometry property of Candes [2005] implies the restricted eigenvalue condition. (Part 1) High-dimensional statistics May 2012 21 / 41

(Part 1) High-dimensional statistics May 2012 22 / 41 Approximating the Gram matrix For two (positive semi-definite) matrices Σ 0 and Σ 1, we define the supremum distance Σ 1 Σ 0 := max (Σ 1 ) j,k (Σ 0 ) j,k. j,k Lemma Assume Then β S c 0 1 3 β S0 1, f β 2 Σ 1 f β 2 1 Σ 0 Σ 1 Σ 0 λ. 16 λs φ 2 compatible (Σ 0, S 0 ).

(Part 1) High-dimensional statistics May 2012 23 / 41 Corollary We have φ Σ1 (3, S 0 ) φ Σ0 (3, S 0 ) 4 Σ 0 Σ 1 s 0.

(Part 1) High-dimensional statistics May 2012 24 / 41 Example Suppose we have a Gaussian random matrix ˆΣ := X T X/n = (ˆσ j,k ), where X = (X i,j ) is a n p-matrix with i.i.d. N (0, 1)-distributed entries in each column. For all t > 0, and for 4t + 8 log p 4t + 8 log p λ(t) := +, n n one has the inequality ( ) P ˆΣ Σ λ(t) 2 exp[ t].

Example (continued) Hence, we know for example that with probability at least 1 2 exp[ t], φ compatible (ˆΣ, S 0 ) Λ min (Σ) 4 λ(t)s 0. This leads to a bound on the sparsity of the form s 0 = o(1/ λ(t)), which roughly says that s 0 should be of small order n/ log p. (Part 1) High-dimensional statistics May 2012 25 / 41

Definition We call a random variable X sub-gaussian if for some constant K and σ 2 0, E exp[x 2 /K 2 ] σ 2 0. Theorem Suppose X 1,..., X n are uniformly sub-gaussian with constants K and σ0 2. Then for constants η = η(k, σ2 0 ), it holds that β T 1 ˆΣβ 3 βt Σβ t + log p β 2 1 n /η2, with probability at least 1 2 exp[ t]. See Raskutti, Wainwright and Yu [2010]. (Part 1) High-dimensional statistics May 2012 26 / 41

(Part 1) High-dimensional statistics May 2012 27 / 41 General convex loss Consider data {Z i } n i=1, with Z i in some space Z. Consider a linear space F := {f β ( ) = p j=1 β jψ j ( ) : β R p }. For each f F, ρ f : Z R be a loss function. We assume that the map f ρ f (z) is convex for all z Z. For example, Z i = (X i, Y i ), and ρ is quadratic loss or logistic loss ρ f (, y) = (y f ( )) 2, ρ f (, y) = yf ( ) + log(1 + exp[f ( )]), or minus log-likelihood loss ρ f = f log exp[f ], etc.

(Part 1) High-dimensional statistics May 2012 28 / 41 We denote, for a function ρ : Z R, the empirical average by P n ρ := n ρ(z i )/n, i=1 and the theoretical mean by Pρ := n Eρ(Z i )/n. i=1 The Lasso is { } ˆβ = arg min P n ρ fβ + λ β 1. (1) β We write ˆf = f ˆβ.

(Part 1) High-dimensional statistics May 2012 29 / 41 We furthermore define the target as the minimizer of the theoretical risk f 0 := arg min f F Pρ f. The excess risk is E(f ) := P(ρ f ρ f 0). Note that by definition, E(f ) 0 for all f F. We will mainly examine the excess risk E(ˆf ) of the Lasso.

(Part 1) High-dimensional statistics May 2012 30 / 41 Definition We say that the margin condition holds with strictly convex function G, if E(f ) G( f f 0 ). In typical cases, the margin condition holds with quadratic function G, that is, G(u) = cu 2, u 0, where c is a positive constant. G -1.0-0.5 0.0 0.5 1.0 1.5 2.0 2.5 uv uv-g(u) Definition Let G be a strictly convex function on [0, ). Its convex conjugate H is defined as -G(u) H(v) = sup {uv G(u)}, v 0. u H(v)

(Part 1) High-dimensional statistics May 2012 31 / 41 Set and Z M := sup (P n P)(ρ fβ ρ fβ0 ), (2) β β 0 1 M M 0 := H ( ) 4λ s0 /λ 0, (3) φ(s 0 ) where φ(s 0 ) = φ compatible (S 0 ). Set T := {Z M0 λ 0 M 0 }. (4)

(Part 1) High-dimensional statistics May 2012 32 / 41 Theorem (Oracle inequality for the Lasso) Assume the compatibility condition and the margin condition with strictly convex function G. Take λ 8λ 0. Then on the set T given in (4), we have ( ) 4λ E(ˆf ) + λ ˆβ β 0 s0 1 4H. φ(s 0 )

(Part 1) High-dimensional statistics May 2012 33 / 41 Corollary Assume quadratic margin behavior, i.e., G(u) = u 2. Then H(v) = v 2 /4, and we obtain on T, E(ˆf ) + λ ˆβ β 0 1 4λ2 s 0 φ 2 (S 0 ).

(Part 1) High-dimensional statistics May 2012 34 / 41 l 2 -rates To derive rates for ˆβ β 0 2, we need a stronger compatibility condition. Definition We say that the (S 0, 2s 0 )-restricted eigenvalue condition is satisfied, with constant φ = φ(s 0, 2s 0 ) > 0, if for all N S 0, N = 2s 0, and all β R p, that satisfy β S c 0 1 3 β S0 1, and β j β N \S0, j / N, it holds that β N 2 f β /φ.

Lemma Suppose the conditions of the previous theorem are met, but now with the stronger (S 0, 2s 0 )-restricted eigenvalue condition. On T, ( )) 4λ 2 ˆβ β 0 2 2 (H 16 s0 /(λ 2 s 0 ) + λ2 s 0 φ 4φ 4. In the case of quadratic margin behavior, with G(u) = u 2, we then get on T, ˆβ β 0 2 2 16λ2 s 0 φ 4. (Part 1) High-dimensional statistics May 2012 35 / 41

(Part 1) High-dimensional statistics May 2012 36 / 41 Theory for l 1 /l 2 -penalties Group Lasso Y i = p ( T ) X (j) i,t β0 j,t + ɛ i, i = 1,..., n, j=1 t=1 where the βj 0 := (βj,1 0,..., β0 j,t )T have sparsity property βj 0 0 for most j. l 1 /l 2 -penalty: p β 2,1 := β j 2. j=1

Multivariate linear model Y i,t = p j=1 X (j) i,t β0 j,t + ɛ i,t,, i = 1,..., n, t = 1,..., T, with for β 0 j := (β 0 j,1,..., β0 j,t )T, the sparsity property β 0 j 0 for most j. Linear model with time-varying coefficients Y i (t) = p j=1 X (j) i (t)β 0 j (t) + ɛ i (t), i = 1,..., n, t = 1,..., T, where the coefficients β 0 j ( ) are smooth functions, with the sparsity property that most of the β 0 j 0. (Part 1) High-dimensional statistics May 2012 37 / 41

(Part 1) High-dimensional statistics May 2012 38 / 41 High-dimensional additive model Y i = p j=1 f 0 j (X (j) i ) + ɛ i, i = 1,..., n, where the f 0 j (X (j) i ) are (non-parametric) smooth functions, with sparsity property f 0 j 0 for most j.

(Part 1) High-dimensional statistics May 2012 39 / 41 Theorem Consider the group Lasso where λ 4λ 0, with { } ˆβ = arg min Y Xβ 2 2 /n + λ T β 2,1, β λ 0 = 2 4x + 4 log p 1 + + n T Then with probability at least 1 exp[ x], we have 4x + 4 log p. T X ˆβ f 0 2 2 /n + λ T ˆβ β 0 2,1 24λ2 T S0 s 0 φ 2. 0

(Part 1) High-dimensional statistics May 2012 40 / 41 Theorem Consider the smoothed group Lasso } ˆβ := arg min { Y Xβ 22 /n + λ β 2,1 + λ 2 Bβ 2,1, β where λ 4λ 0. Then on T := {2 ɛ T Xβ /n λ 0 β 2,1 + λ 2 0 Bβ 2,1}, we have } ˆf f 0 2 n + λpen( ˆβ β 0 )/2 3 {16λ 2 s 0 /φ 20 + 2λ2 Bβ 0 2,1.

etc.... (Part 1) High-dimensional statistics May 2012 41 / 41