The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality is presented, under a so called compatibility condition Our aim is three fold: to proof a result announced in van de Geer 2007, to provide a simple proof with simple constants, and to separate the stochastic problem from the deterministic one KEY WORDS: coherence, Lasso, oracle, sparsity Introduction We consider the Lasso for high-dimensional generalized linear models, introducing three new elements Firstly, we complete the proof of an oracle inequality, which was announced in van de Geer 2007 The result is shown to hold under a so-called compatibility condition for certain norms This condition is a weaker version of Assumption C* in van de Geer 2007, and is related to coherence conditions in eg Candes and Tao, and Bunea et al 2006, 2007 Secondly, we state the result with simple constants, and give a new and rather straightforward proof Thirdly, in Subsection 3, we separate the stochastic part of the problem from the deterministic part In the subsequent sections, we then only concentrate on the deterministic part whence the title of this paper The advantage of this approach is that the behavior of the Lasso under various sampling schemes eg, dependent observations, can immediately be deduced once the stochastic problem a probability inequality for the empirical process is clear Consider data {Z i } n i=, with for i =,, n Z i in some space Z Let F := F, be a normed real vector space, and, for each f F, ρ f : Z R be a loss function We assume that the map f ρ f z is convex for all z Z For example, in a regression setup, one has for i =,, n Z i = X i, Y i, with covariables X i X and response variable Y i Y R, and f : X R is a regression function Examples are quadratic loss, or logistic loss, etc ρ f, y = y f 2, ρ f, y = yf + log + exp[f], The target We denote, for a function ρ : Z R, the theoretical measure by n P ρ := EρZ i /n, i= and the empirical measure by Define the target P n ρ := n ρz i /n i= f 0 := arg min f F P ρ f We assume for simplicity that the minimum is attained, and that it is unique For f F, the excess risk is 2 The Lasso Ef := P ρ f ρ f 0 Consider a given linear subspace F := {f β : β R p } F, where β f β is linear The collection F will be the model class over which we perform empirical risk minimization When F is high-dimensional, a complexity penalty can prevent overfitting The Lasso is where ˆβ = arg min β { Pn ρ fβ + λ β }, β := p β j, j= ie, is the l -norm The parameter λ is a smoothing - or tuning - parameter We examine the excess risk Ef ˆβ of the Lasso, and show that it behaves as if it knew which variables are relevant for a linear approximation of the target f 0 Section 2 does this under a so-called compatibility condition Section 3 examines under what circumstances the compatibility condition is met In, the penalty weights all coefficients equally In practice, certain terms eg, the constant term will not be penalized, and a weighted version of the l norm is used, with eg, more weight assigned to variables with large sample variance In essence possibly random weights do not alter the main issues in the theory We therefore consider the non-weighted l penalty to clarify these issues More details on the case of empirical weights are in Bunea et al 2006 and van de Geer 2007
3 The empirical process Throughout, the study of the stochastic part of the problem involving the empirical process is set aside It can be handled using empirical process theory In the current paper, we simply restrict ourselves to a set S where the empirical process behaves well see 6 in Lemma 2 This allows us to proceed with purely deterministic, and relatively straightforward, arguments More precisely, write for M > 0, Z M := sup P n P ρ fβ ρ fβ 2 β β M Here, β is some fixed parameter value, which will be specified in Section 2 Our results hold on the set S = {Z M ε}, with M and ε appropriately chosen The ratio ε/m will play a major role: it has to be chosen large enough so that S has large probability On the other hand, ε/m will form be a lower bound for the smoothing parameter λ see 5 in Lemma 2 It is shown in van de Geer 2007, that for independent observations Z i, under mild conditions, and with ε/m log2p/n, the set S has large probability in the case of Lipschitz loss functions, that is, for f ρ f z is Lipschitz for all z For other loss functions eg quadratic loss, similar behavior can be obtained, but only for small enough values of M In the result of Lemma 2, one may replace P n by any other random probability measure, say ˆP, in the definition of ˆβ, provided one also adjusts the definition 2 of Z M The result for the Lasso goes through on S One should then ensure that S contains most of the probability mass by taking ε/m appropriately, and this will then put lower bounds on the smoothing parameter 4 The margin behavior We will make use of the so-called margin condition see below, which is assumed for a neighborhood, denoted by F η, of the target f 0 Some of the effort goes in proving that the estimator is indeed in this neighborhood Here, we use arguments that rely on the assumed convexity of f ρ f These arguments also show that the stochastic part of the problem needs only to be studied locally ie, 2 for small values of M The margin condition requires that in the neighborhood F η F of f 0, the excess risk E is bounded from below by a strictly convex function This is true in many particular cases for an L neighborhood F η = { f f 0 η} for F being a class of functions Definition We say that the margin condition holds with strictly convex function G, if for all f F η, we have Ef G f f 0 Indeed, in typical cases, the margin condition holds with quadratic function G, that is, Gu = cu 2, u 0, where c is a positive constant We also need the notion of convex conjugate of a function Definition Let G be a strictly convex function on [0,, with G0 = 0 The convex conjugate H of G is defined as Hv = sup {uv Gu}, v 0 u Note that when G is quadratic, its convex conjugate H is quadratic as well The function H will appear in the estimation error term 2 An oracle inequality Our goal is now, to show that the estimator has oracle properties, namely, to show, for a set B R p as large as possible, that Ef ˆβ const min β B { Ef β + H λ } J β, 3 with large probability Here, H is the convex conjugate of the function G appearing in the margin condition The left hand side of 3 is small if the target f 0 can be well approximated by an f β with only a few of the coefficients β j j =,, p non-zero, that is, by a sparse f β To arrive at such an inequality, we need a certain amount of compatibility between the l norm and the metric on the vector space F 2 The compatibility condition Let J {,, p} be an index set Definition We say that the compatibility condition is met for the set J, with constants φj > 0 and νj > 0, if for all β R p that satisfy j / J β j 3 β j, it holds that β j J f β /φj + j / J β j / + νj 4 More details on this condition are in Section 3 22 The result Let 0 < δ < be fixed but otherwise arbitrary For β R p, set J β := {j : β j 0} and J β := J β The oracle β is defined as { } 2λ β Jβ := arg min + δef β + 2δH, β B φj δ Here, the minimum is over the set B of all β, for which the compatibility condition holds for J β Let J := {j : βj 0}, and let φ = φj and = νj Set 2λ ε Jβ := + δef β + 2δH φ δ
Lemma 2 Lasso under the compatibility condition Consider a constant λ 0 satisfying the inequality λ 0 82 + λ, 5 and let M = ε /λ 0 Assume the margin condition with strictly convex function G, and that the oracle f β is in the neighborhood F η of the target, as well as f β F η for all β β M Then on the set we have either or S := {Z M ε }, 6 δef ˆβ + ν 2 + λ ˆβ β 2ε, Ef ˆβ + λ ˆβ 42 + ν β ε 23 Asymptotics in n In the case of independent observations, the set S has large probability under general conditions, for λ 0 log2p/n see van de Geer 2007 Taking λ also of this order, and assuming a a quadratic margin, and in addition that φ stays away from zero, the estimation error H2λ J β /φ δ behaves like J β log2p/n 24 The case = We observe that often λ 0 can be taken independently of oracle values, whereas 5 requires a lower bound on the smoothing parameter λ which depends on the unknown oracle value The best possible situation is when is arbitrarily large Now, the case > 2 corresponds after a reparametrization to the case = Indeed, if νj > 2 and j / J β j 3 β j, then 4 implies νj 2 J β j fβ /φj + νj With =, Lemma 2 reads Corollary 2 Let λ 0 λ/8, and M := ε /λ 0 Assume the conditions of Lemma 2 with = Then on the set S given in 6, we have either or δef ˆβ + λ ˆβ β 2ɛ, Ef ˆβ + λ ˆβ β 4ɛ 3 On the compatibility condition Let f β 2 := β T Σβ, with Σ a p p matrix The entries of Σ are denoted by σ j,k j, k {,, p} We first present a simple lemma Lemma 3 Suppose that Σ has smallest eigenvalue φ > 0 Then the compatibility condition holds for all index sets J, with φj = φ and νj = 3 The coherence lemma For a vector v, and for q, we shall write v q = j v j q /q, with the usual extension for q = We now fix an L {,, p} Consider a partition of the form Σ, Σ Σ :=,2 Σ 2, Σ 2,2 where Σ, is an L L matrix, Σ 2, is a p L L matrix and Σ,2 = Σ T 2, is its transpose, and where Σ 2,2 is a p L p L matrix Then for any b R L and b 2 R p L, and for b T := b T, b T 2, one has b T Σ, b b T Σb + 2 Σ 2, q b 2 b 2 r, with /q + /r =, and Σ 2, q := sup Σ 2, a q a R L, a 2= Consider now a subset L {,, p} We introduce the L L matrix Σ, L := σ j,k j,k L, and the p L L matrix Σ 2, L := σ j,k j / L,k L, We let ρ 2 L be the smallest eigenvalue of Σ, L and take ϑ 2 ql := Σ 2, L q We moreover define and φj := θ q J := min ρl, L J : L =L max ϑ ql L J : L =L Lemma 32 Coherence lemma Fix J {,, p}, with cardinality J := J L Then we have for any β R p, J β j φj f β 2 + β j / + νj j / J with, for /q + /r =, + νj = 2 Jq /r θ 2 q J L+ J /q φ 2 J 2 J θ2 J φ 2 J q < q = The coherence lemma uses an auxiliary lemma Lemma 4 which is based on ideas in Candes and Tao 2007
32 Relation with other coherence conditions In the coherence lemma, the choice of L J and q is open They are allowed to depend on J, and for our purposes should be taken in such a way that φj and νj are as large as possible To gain insight into what the coherence lemma can bring us, let us again consider the partition Σ, Σ Σ :=,2 Σ 2, Σ 2,2 Note first that Σ 2, 2 2 is the largest eigenvalue of the matrix Σ,2 Σ 2, The coherence lemma with q = 2 is along the lines of the coherence condition in Candes and Tao 2007 They choose L not larger than 3J It is easy to see that Σ 2, q p j=l+ L k= σ 2 j,k q /q 7 with β j,in = β j l{j J } and β j,out = β j l{j / J } Note thus that βin = β and βout 0 Throughout the proof, we assume we are on the set S Let M s := M + ˆβ, β and β := s ˆβ + sβ Write f := f β, Ẽ := E f and f := f β, E := Ef The proof is structured as follows We first note that β β M and prove the result with ˆβ replaced by β As a byproduct we obtain that ˆβ β M The latter allows us to repeat the proof with β replaced by ˆβ By the convexity of f ρ f, and of, Thus, P n ρ f + λ β s[p n ρ f ˆβ + λ ˆβ ] + s[p n ρ f + λ β ] P n ρ f + λ β Ẽ + λ β 8 Thus, Σ 2, q p L max σ j,k q k L j=l+ /q = P n P ρ f ρ f + P n ρ f ρ f + E + λ β P n P ρ f ρ f + E + λ β Z M + E + λ β So on S For q =, this reads Ẽ + λ β ε + E + λ β 9 L Σ 2, max max σ j,k L+ j p k L Note also that with q =, the choice L = J in the coherence lemma gives the best result With this choice of L, the coherence lemma with q = gives a positive value for νj required in the compatibility conditions, if 2J max max σ j,k /φ 2 J < j / J k J This is in the spirit of the maximal local coherence condition in Bunea et al 2007 It also follows from 7, that Σ 2, q p j=l+ L q σ j,k k= Invoking this bound with q = and L = J in the coherence lemma leads to a condition which is similar to the cumulative local coherence condition in Bunea et al 2007 4 Proofs Proof of Lemma 2 We write β = β in + β out, /q Therefore, we have Ẽ + λ β out 0 ε + E + λ β in β 2ε + λ β in β Case If then 0 implies λ β in β Ẽ + λ β out 22 + ε, 22 + + ε, and hence Ẽ + λ β 42 + ν β ε But then β β 42 + ν ε λ = 42 + ν λ 0 λ M M 2, since λ 82 + λ 0 / This implies that ˆβ β M Case 2 If λ β 22 + ν in β ε,
we get from 0 and using 22 + = 4 + ν, that λ β out 2ε + λ β in β 3λ β in β This means that we can apply the weak compatibility condition Recall that inequality 9 implies We have and So Ẽ + λ β out ε + E + λ β λ β in ε + E + λ β in β β out = 2 2 + β out + ν 2 + β out, β out = β β β in β Ẽ + 2 2 + ν λ β out + ν 2 + λ β β ε + E + 2 + ν 2 + λ β in β With the compatibility condition on J, we find β in β J f β f β /φ + β out / + Thus Ẽ + 2 2 + λ β out + ν 2 + λ β β ε +E + 2 + ν 2 + J f β f β /φ + 2 2 + λ β out The term with β out cancels out Moreover, we may use the bound + /2 + This allows us to conclude that Ẽ + ν 2 + λ β β ε + E + 2λ J f β f β /φ Now, because we assumed f β F η, and since also f β F η, we can invoke the margin condition to arrive at 2λ J f βn f β n /φ 2λ J 2δH φ + δẽ δ + δe It follows that Ẽ + ν 2 + λ β β ε + + δe + 2δH 2λ J φ + δẽ δ = ε + ε + δẽ = 2ε + δẽ, or δẽ + ν 2 + λ β β 2ε 2 This yields β β 22 + ν λ 0 M M λ 4, where the last inequality is ensured by the assumption λ 82 + / λ 0 But β β M /4 implies ˆβ β M 3 M So in both Case and Case 2, we arrive at ˆβ β M Now, repeat the above arguments with β replaced by ˆβ Let Ê = Ef ˆβ Then, either from Case with this replacement: Ê + λ ˆβ 42 + ν β ε Or we are in Case 2 and can apply the compatibility assumption Then we arrive at 2 with replacement: δê + ν 2 + λ ˆβ β 2ε Proof of Lemma 3 Let β R p be arbitrary It is clear that β 2 2 f β 2 /φ 2 Hence, β j J β j 2 J β 2 J f β /φ Proof of Lemma 32 Let β = β in + β out, with β j,in := β j l and β j,out := β j l j / J, j =,, p Let L J, with L\J being the set of the L J largest β j with j / J Define b j, := β j l j L and b j,2 := β j l j / L We have b T Σb 2 θqj 2 b 2 b 2 r /2 By the auxiliary lemma below, for q <, This yields Hence b 2 r β out /L + J /q q /r 3 b T Σb 2 θ 2 qj b 2 β out /L + J /q q /r f b 2 f β 2 + 2 b T Σb 2 f β 2 + 2θ 2 qj b 2 β out /L + J /q q /r It follows that b 2 2 f b 2 φ 2 J
f β 2 φ 2 J + 2 θ2 qj φ 2 J b 2 β out /L + J /q q /r This implies b 2 f β φj + 2 θ2 qj φ 2 J β out /L J /q We thus arrive at J fβ φj β in J β in 2 + 2 J θ2 qj φ 2 J β out /L J /q The result with q = follows in the same manner, using instead of 3, the trivial bound b β out Lemma 4 Auxiliary lemma Let b b 2 0 For < r, /q + /r =, and L {0,, }, we have b j r /r q /r L + /q b Proof We have j r L + r + Moreover, for all j, b so that b j b /j Hence b r j b r L+ x r dx q L + r j b l jb j, l= REFERENCES j r q b r L + r Bunea, F, Tsybakov, AB and Wegkamp, MH 2006 Sparsity oracle inequalities for the Lasso submitted Bunea, F, Tsybakov, AB and Wegkamp, MH 2007 Sparse density estimation and aggregation with l penalties, COLT 2007 530 543 Candes, E and Tao, T 2007 The Dantzig selector: statistical estimation when p is much larger than n To appear in Ann Statist van de Geer, SA 2007, High-dimensional generalized linear models and the Lasso To appear in Ann Statist