arxiv:math/ v1 [math.st] 1 Aug 2006

Size: px

Start display at page:

Download "arxiv:math/ v1 [math.st] 1 Aug 2006"

Erica Evans
6 years ago
Views:

1 The Annals of Statistics 2006, Vol. 34, No. 3, DOI: / c Institute of Mathematical Statistics, 2006 arxiv:math/ v1 [math.st] 1 Aug 2006 HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO By Nicolai Meinshausen and Peter Bühlmann ETH Zürich The pattern of zero entries in the inverse covariance matrix of a multivariate normal distriution corresponds to conditional independence restrictions etween variales. Covariance selection aims at estimating those structural zeros from data. We show that neighorhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighorhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variale selection for Gaussian linear models. We show that the proposed neighorhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighorhood estimate. Controlling instead the proaility of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved with exponential rates), even when the numer of variales grows as the numer of oservations raised to an aritrary power. 1. Introduction. Consider the p-dimensional multivariate normal distriuted random variale X = X 1,...,X p ) Nµ,Σ). This includes Gaussian linear models where, for example, X 1 is the response variale and {X k ;2 k p} are the predictor variales. Assuming that the covariance matrix Σ is nonsingular, the conditional independence structure of the distriution can e conveniently represented y a graphical model G = Γ,E), where Γ = {1,...,p} is the set of nodes and E the set of edges in Γ Γ. A pair a,) is contained in the edge set E if and only if X a is conditionally dependent on X, given all remaining variales X Γ\{a,} = Received May 2004; revised August AMS 2000 suject classifications. Primary 62J07; secondary 62H20, 62F12. Key words and phrases. Linear regression, covariance selection, Gaussian graphical models, penalized regression. This is an electronic reprint of the original article pulished y the Institute of Mathematical Statistics in The Annals of Statistics, 2006, Vol. 34, No. 3, This reprint differs from the original in pagination and typographic detail. 1

2 2 N. MEINSHAUSEN AND P. BÜHLMANN {X k ;k Γ \ {a,}}. Every pair of variales not contained in the edge set is conditionally independent, given all remaining variales, and corresponds to a zero entry in the inverse covariance matrix [12]. Covariance selection was introduced y Dempster [3] and aims at discovering the conditional independence restrictions the graph) from a set of i.i.d. oservations. Covariance selection traditionally relies on the discrete optimization of an ojective function [5, 12]. Exhaustive search is computationally infeasile for all ut very low-dimensional models. Usually, greedy forward or ackward search is employed. In forward search, the initial estimate of the edge set is the empty set and edges are then added iteratively until a suitale stopping criterion is satisfied. The selection deletion) of a single edge in this search strategy requires an MLE fit [15] for Op 2 ) different models. The procedure is not well suited for high-dimensional graphs. The existence of the MLE is not guaranteed in general if the numer of oservations is smaller than the numer of nodes [1]. More disturingly, the complexity of the procedure renders even greedy search strategies impractical for modestly sized graphs. In contrast, neighorhood selection with the Lasso, proposed in the following, relies on optimization of a convex function, applied consecutively to each node in the graph. The method is computationally very efficient and is consistent even for the high-dimensional setting, as will e shown. Neighorhood selection is a suprolem of covariance selection. The neighorhood ne a of a node a Γ is the smallest suset of Γ\{a} so that, given all variales X nea in the neighorhood, X a is conditionally independent of all remaining variales. The neighorhood of a node a Γ consists of all nodes Γ \ {a} so that a,) E. Given n i.i.d. oservations of X, neighorhood selection aims at estimating individually) the neighorhood of any given variale or node). The neighorhood selection can e cast as a standard regression prolem and can e solved efficiently with the Lasso [16], as will e shown in this paper. The consistency of the proposed neighorhood selection will e shown for sparse high-dimensional graphs, where the numer of variales is potentially growing as any power of the numer of oservations high-dimensionality), whereas the numer of neighors of any variale is growing at most slightly slower than the numer of oservations sparsity). A numer of studies have examined the case of regression with a growing numer of parameters as sample size increases. The closest to our setting is the recent work of Greenshtein and Ritov [8], who study consistent prediction in a triangular setup very similar to ours see also [10]). However, the prolem of consistent estimation of the model structure, which is the relevant concept for graphical models, is very different and not treated in these studies.

3 VARIABLE SELECTION WITH THE LASSO 3 We study in Section 2 under which conditions, and at which rate, the neighorhood estimate with the Lasso converges to the true neighorhood. The choice of the penalty is crucial in the high-dimensional setting. The oracle penalty for optimal prediction turns out to e inconsistent for estimation of the true model. This solution might include an unounded numer of noise variales in the model. We motivate a different choice of the penalty such that the proaility of falsely connecting two or more distinct connectivity components of the graph is controlled at very low levels. Asymptotically, the proaility of estimating the correct neighorhood converges exponentially to 1, even when the numer of nodes in the graph is growing rapidly as any power of the numer of oservations. As a consequence, consistent estimation of the full edge set in a sparse high-dimensional graph is possile Section 3). Encouraging numerical results are provided in Section 4. The proposed estimate is shown to e oth more accurate than the traditional forward selection MLE strategy and computationally much more efficient. The accuracy of the forward selection MLE fit is in particular poor if the numer of nodes in the graph is comparale to the numer of oservations. In contrast, neighorhood selection with the Lasso is shown to e reasonaly accurate for estimating graphs with several thousand nodes, using only a few hundred oservations. 2. Neighorhood selection. Instead of assuming a fixed true underlying model, we adopt a more flexile approach similar to the triangular setup in [8]. Both the numer of nodes in the graphs numer of variales), denoted y pn) = Γn), and the distriution the covariance matrix) depend in general on the numer of oservations, so that Γ = Γn) and Σ = Σn). The neighorhood ne a of a node a Γn) is the smallest suset of Γn) \ {a} so that X a is conditionally independent of all remaining variales. Denote the closure of node a Γn) y cl a := ne a {a}. Then X a {X k ;k Γn) \ cl a } X nea. For details see [12]. The neighorhood depends in general on n as well. However, this dependence is notationally suppressed in the following. It is instructive to give a slightly different definition of a neighorhood. For each node a Γn), consider optimal prediction of X a, given all remaining variales. Let θ a R pn) e the vector of coefficients for optimal prediction, 1) θ a = argmine θ: θ a=0 X a k Γn) θ k X k ) 2. As a generalization of 1), which will e of use later, consider optimal prediction of X a, given only a suset of variales {X k ;k A}, where A

4 4 N. MEINSHAUSEN AND P. BÜHLMANN Γn) \ {a}. The optimal prediction is characterized y the vector θ a,a, θ a,a = argmin E X a ) 2 2) θ k X k. θ : θ k =0, k/ A k Γn) The elements of θ a are determined y the inverse covariance matrix [12]. For Γ \ {a} and Kn) = Σ 1 n), it holds that θ a = K an)/k aa n). The set of nonzero coefficients of θ a is identical to the set { Γn)\{a}:K a n) 0} of nonzero entries in the corresponding row vector of the inverse covariance matrix and defines precisely the set of neighors of node a. The est predictor for X a is thus a linear function of variales in the set of neighors of the node a only. The set of neighors of a node a Γn) can hence e written as ne a = { Γn):θ a 0}. This set corresponds to the set of effective predictor variales in regression with response variale X a and predictor variales {X k ;k Γn) \ {a}}. Given n independent oservations of X N0, Σn)), neighorhood selection tries to estimate the set of neighors of a node a Γn). As the optimal linear prediction of X a has nonzero coefficients precisely for variales in the set of neighors of the node a, it seems reasonale to try to exploit this relation Neighorhood selection with the Lasso. It is well known that the Lasso, introduced y Tishirani [16], and known as Basis Pursuit in the context of wavelet regression [2], has a parsimonious property [11]. When predicting a variale X a with all remaining variales {X k ;k Γn) \ {a}}, the vanishing Lasso coefficient estimates identify asymptotically the neighorhood of node a in the graph, as shown in the following. Let the n pn)- dimensional matrix X contain n independent oservations of X, so that the columns X a correspond for all a Γn) to the vector of n independent oservations of X a. Let, e the usual inner product on R n and 2 the corresponding norm. The Lasso estimate ˆθ a,λ of θ a is given y 3) ˆθ a,λ = argminn 1 X a Xθ λ θ 1 ), θ : θ a=0 where θ 1 = Γn) θ is the l 1 -norm of the coefficient vector. Normalization of all variales to a common empirical variance is recommended for the estimator in 3). The solution to 3) is not necessarily unique. However, if uniqueness fails, the set of solutions is still convex and all our results aout neighorhoods in particular Theorems 1 and 2) hold for any solution of 3). Other regression estimates have een proposed, which are ased on the l p - norm, where p is typically in the range [0,2] see [7]). A value of p = 2 leads to

5 VARIABLE SELECTION WITH THE LASSO 5 the ridge estimate, while p = 0 corresponds to traditional model selection. It is well known that the estimates have a parsimonious property with some components eing exactly zero) for p 1 only, while the optimization prolem in 3) is only convex for p 1. Hence l 1 -constrained empirical risk minimization occupies a unique position, as p = 1 is the only value of p for which variale selection takes place while the optimization prolem is still convex and hence feasile for high-dimensional prolems. The neighorhood estimate parameterized y λ) is defined y the nonzero coefficient estimates of the l 1 -penalized regression, ˆne λ a,λ a = { Γn): ˆθ 0}. Each choice of a penalty parameter λ specifies thus an estimate of the neighorhood ne a of node a Γn) and one is left with the choice of a suitale penalty parameter. Larger values of the penalty tend to shrink the size of the estimated set, while more variales are in general included into ˆne λ a if the value of λ is diminished The prediction-oracle solution. A seemingly useful choice of the penalty parameter is the unavailale) prediction-oracle value, λ oracle = argmine λ X a k Γn) ˆθ a,λ k X k) 2. The expectation is understood to e with respect to a new X, which is independent of the sample on which ˆθ a,λ is estimated. The prediction-oracle penalty minimizes the predictive risk among all Lasso estimates. An estimate of λ oracle is otained y the cross-validated choice λ cv. For l 0 -penalized regression it was shown y Shao [14] that the crossvalidated choice of the penalty parameter is consistent for model selection under certain conditions on the size of the validation set. The predictionoracle solution does not lead to consistent model selection for the Lasso, as shown in the following for a simple example. Proposition 1. Let the numer of variales grow to infinity, pn), for n, with pn) = on γ ) for some γ > 0. Assume that the covariance matrices Σn) are identical to the identity matrix except for some pair a,) Γn) Γn), for which Σ a n) = Σ a n) = s, for some 0 < s < 1 and all n N. The proaility of selecting the wrong neighorhood for node a converges to 1 under the prediction-oracle penalty, P ˆne λ oracle a ne a ) 1 for n.

6 6 N. MEINSHAUSEN AND P. BÜHLMANN A proof is given in the Appendix. It follows from the proof of Proposition 1 that many noise variales are included in the neighorhood estimate with the prediction-oracle solution. In fact, the proaility of including noise variales with the prediction-oracle solution does not even vanish asymptotically for a fixed numer of variales. If the penalty is chosen larger than the predictionoptimal value, consistent neighorhood selection is possile with the Lasso, as demonstrated in the following Assumptions. We make a few assumptions to prove consistency of neighorhood selection with the Lasso. We always assume availaility of n independent oservations from X N0, Σ). High-dimensionality. The numer of variales is allowed to grow as the numer of oservations n raised to an aritrarily high power. Assumption 1. There exists γ > 0, so that pn) = On γ ) for n. In particular, it is allowed for the following analysis that the numer of variales is very much larger than the numer of oservations, pn) n. Nonsingularity. We make two regularity assumptions for the covariance matrices. Assumption 2. For all a Γn) and n N, VarX a ) = 1. There exists v 2 > 0, so that for all n N and a Γn), VarX a X Γn)\{a} ) v 2. Common variance can always e achieved y appropriate scaling of the variales. A scaling to a common empirical) variance of all variales is desirale, as the solutions would otherwise depend on the chosen units or dimensions in which they are represented. The second part of the assumption explicitly excludes singular or nearly singular covariance matrices. For singular covariance matrices, edges are not uniquely defined y the distriution and it is hence not surprising that nearly singular covariance matrices are not suitale for consistent variale selection. Note, however, that the empirical covariance matrix is a.s. singular if pn) > n, which is allowed in our analysis.

7 VARIABLE SELECTION WITH THE LASSO 7 Sparsity. The main assumption is the sparsity of the graph. This entails a restriction on the size of the neighorhoods of variales. Assumption 3. There exists some 0 κ < 1 so that max ne a = On κ ) for n. a Γn) This assumption limits the maximal possile rate of growth for the size of neighorhoods. For the next sparsity condition, consider again the definition in 2) of the optimal coefficient θ,a for prediction of X, given variales in the set A Γn). Assumption 4. There exists some ϑ < so that for all neighoring nodes a, Γn) and all n N, θ a,ne \{a} 1 ϑ. This assumption is, for example, satisfied if Assumption 2 holds and the size of the overlap of neighorhoods is ounded y an aritrarily large numer from aove for neighoring nodes. That is, if there exists some m < so that for all n N, 4) max a, Γn), ne a ne a ne m for n, then Assumption 4 is satisfied. To see this, note that Assumption 2 gives a finite ound for the l 2 -norm of θ a,ne \{a}, while 4) gives a finite ound for the l 0 -norm. Taken together, Assumption 4 is implied. Magnitude of partial correlations. The next assumption ounds the magnitude of partial correlations from elow. The partial correlation π a etween variales X a and X is the correlation after having eliminated the linear effects from all remaining variales {X k ;k Γn) \ {a,}}; for details see [12]. Assumption 5. There exist a constant δ > 0 and some ξ > κ, with κ as in Assumption 3, so that for every a,) E, π a δn 1 ξ)/2. It will e shown elow that Assumption 5 cannot e relaxed in general. Note that neighorhood selection for node a Γn) is equivalent to simultaneously testing the null hypothesis of zero partial correlation etween variale X a and all remaining variales X, Γn) \ {a}. The null hypothesis of zero partial correlation etween two variales can e tested y

8 8 N. MEINSHAUSEN AND P. BÜHLMANN using the corresponding entry in the normalized inverse empirical covariance matrix. A graph estimate ased on such tests has een proposed y Drton and Perlman [4]. Such a test can only e applied, however, if the numer of variales is smaller than the numer of oservations, pn) n, as the empirical covariance matrix is singular otherwise. Even if pn) = n c for some constant c > 0, Assumption 5 would have to hold with ξ = 1 to have a positive power of rejecting false null hypotheses for such an estimate; that is, partial correlations would have to e ounded y a positive value from elow. Neighorhood staility. The last assumption is referred to as neighorhood staility. Using the definition of θ a,a in 2), define for all a, Γn), 5) S a ) := signθ a,nea k )θ,nea k. k ne a The assumption of neighorhood staility restricts the magnitude of the quantities S a ) for nonneighoring nodes a, Γn). Assumption 6. There exists some δ < 1 so that for all a, Γn) with / ne a, S a ) < δ. It is shown in Proposition 3 that this assumption cannot e relaxed. We give in the following a more intuitive condition which essentially implies Assumption 6. This will justify the term neighorhood staility. Consider the definition in 1) of the optimal coefficients θ a for prediction of X a. For η > 0, define θ a η) as the optimal set of coefficients under an additional l 1 -penalty, 6) θ a η) := argmine θ : θ a=0 X a k Γn) θ k X k ) 2 + η θ 1. The neighorhood ne a of node a was defined as the set of nonzero coefficients of θ a, ne a = {k Γn) : θ a k 0}. Define the distured neighorhood ne aη) as ne a η) := {k Γn):θ a kη) 0}. It clearly holds that ne a = ne a 0). The assumption of neighorhood staility is satisfied if there exists some infinitesimally small perturation η, which may depend on n, so that the distured neighorhood ne a η) is identical to the undistured neighorhood ne a 0).

9 VARIABLE SELECTION WITH THE LASSO 9 Proposition 2. If there exists some η > 0 so that ne a η) = ne a 0), then S a ) 1 for all Γn) \ ne a. A proof is given in the Appendix. In light of Proposition 2 it seems that Assumption 6 is a very weak condition. To give one example, Assumption 6 is automatically satisfied under the much stronger assumption that the graph does not contain cycles. We give a rief reasoning for this. Consider two nonneighoring nodes a and. If the nodes are in different connectivity components, there is nothing left to show as S a ) = 0. If they are in the same connectivity component, then there exists one node k ne a that separates from ne a \ {k}, as there is just one unique path etween any two variales in the same connectivity component if the graph does not contain cycles. Using the gloal Markov property, the random variale X is independent of X nea\{k}, given X k. The random variale EX X nea ) is thus a function of X k only. As the distriution is Gaussian, EX X nea ) = θ,nea k X k. By Assumption 2, VarX X nea ) = v 2 for some v 2 > 0. It follows that VarX ) = v 2 + θ,nea k ) 2 = 1 and hence θ,nea k = 1 v 2 < 1, which implies that Assumption 6 is indeed satisfied if the graph does not contain cycles. We mention that Assumption 6 is likewise satisfied if the inverse covariance matrices Σ 1 n) are for each n N diagonally dominant. A matrix is said to e diagonally dominant if and only if, for each row, the sum of the asolute values of the nondiagonal elements is smaller than the asolute value of the diagonal element. The proof of this is straightforward ut tedious and hence is omitted Controlling type I errors. The asymptotic properties of Lasso-type estimates in regression have een studied in detail y Knight and Fu [11] for a fixed numer of variales. Their results say that the penalty parameter λ should decay for an increasing numer of oservations at least as fast as n 1/2 to otain an n 1/2 -consistent estimate. It turns out that a slower rate is needed for consistent model selection in the high-dimensional case where pn) n. However, a rate n 1 ε)/2 with any κ < ε < ξ where κ,ξ are defined as in Assumptions 3 and 5) is sufficient for consistent neighorhood selection, even when the numer of variales is growing rapidly with the numer of oservations. Theorem 1. Let Assumptions 1 6 hold. Let the penalty parameter satisfy λ n dn 1 ε)/2 with some κ < ε < ξ and d > 0. There exists some c > 0 so that, for all a Γn), P ˆne λ a ne a) = 1 Oexp cn ε )) for n.

10 10 N. MEINSHAUSEN AND P. BÜHLMANN A proof is given in the Appendix. Theorem 1 states that the proaility of falsely) including any of the nonneighoring variales of the node a Γn) into the neighorhood estimate vanishes exponentially fast, even though the numer of nonneighoring variales may grow very rapidly with the numer of oservations. It is shown in the following that Assumption 6 cannot e relaxed. Proposition 3. If there exists some a, Γn) with / ne a and S a ) > 1, then, for λ = λ n as in Theorem 1, P ˆne λ a ne a) 0 for n. A proof is given in the Appendix. Assumption 6 of neighorhood staility is hence critical for the success of Lasso neighorhood selection Controlling type II errors. So far it has een shown that the proaility of falsely including variales into the neighorhood can e controlled y the Lasso. The question arises whether the proaility of including all neighoring variales into the neighorhood estimate converges to 1 for n. Theorem 2. Let the assumptions of Theorem 1 e satisfied. For λ = λ n as in Theorem 1, for some c > 0 Pne a ˆne λ a) = 1 Oexp cn ε )) for n. A proof is given in the Appendix. It may e of interest whether Assumption 5 could e relaxed, so that edges are detected even if the partial correlation is vanishing at a rate n 1 ξ)/2 for some ξ < κ. The following proposition says that ξ > ε and thus ξ > κ as ε > κ) is a necessary condition if a stronger version of Assumption 4 holds, which is satisfied for forests and trees, for example. Proposition 4. Let the assumptions of Theorem 1 hold with ϑ < 1 in Assumption 4, except that for a Γn), let there e some Γn) \ {a} with π a 0 and π a = On 1 ξ)/2 ) for n for some ξ < ε. Then P ˆne λ a ) 0 for n. Theorem 2 and Proposition 4 say that edges etween nodes for which partial correlation vanishes at a rate n 1 ξ)/2 are, with proaility converging to 1 for n, detected if ξ > ε and are undetected if ξ < ε. The results do not cover the case ξ = ε, which remains a challenging question for further research.

11 VARIABLE SELECTION WITH THE LASSO 11 All results so far have treated the distinction etween zero and nonzero partial correlations only. The signs of partial correlations of neighoring nodes can e estimated consistently under the same assumptions and with the same rates, as can e seen in the proofs. 3. Covariance selection. It follows from Section 2 that it is possile under certain conditions to estimate the neighorhood of each node in the graph consistently, for example, P ˆne λ a = ne a) 1 for n. The full graph is given y the set Γn) of nodes and the edge set E = En). The edge set contains those pairs a, ) Γn) Γn) for which the partial correlation etween X a and X is not zero. As the partial correlations are precisely nonzero for neighors, the edge set E Γn) Γn) is given y E = {a,):a ne ne a }. The first condition, a ne, implies in fact the second, ne a, and vice versa, so that the edge is as well identical to {a,):a ne ne a }. For an estimate of the edge set of a graph, we can apply neighorhood selection to each node in the graph. A natural estimate of the edge set is then given y Êλ, Γn) Γn), where 7) Ê λ, = {a,):a ˆne λ ˆne λ a}. Note that a ˆne λ does not necessarily imply ˆneλ a and vice versa. We can hence also define a second, less conservative, estimate of the edge set y 8) Ê λ, = {a,):a ˆne λ ˆneλ a }. The discrepancies etween the estimates 7) and 8) are quite small in our experience. Asymptotically the difference etween oth estimates vanishes, as seen in the following corollary. We refer to oth edge set estimates collectively with the generic notation Êλ, as the following result holds for oth of them. Corollary 1. Under the conditions of Theorem 2, for some c > 0, PÊλ = E) = 1 Oexp cn ε )) for n. The claim follows since Γn) 2 = pn) 2 = On 2γ ) y Assumption 1 and neighorhood selection has an exponentially fast convergence rate as descried y Theorem 2. Corollary 1 says that the conditional independence structure of a multivariate normal distriution can e estimated consistently y comining the neighorhood estimates for all variales.

12 12 N. MEINSHAUSEN AND P. BÜHLMANN Note that there are in total 2 p2 p)/2 distinct graphs for a p-dimensional variale. However, for each of the p nodes there are only 2 p 1 distinct potential neighorhoods. By reaking the graph selection prolem into a consecutive series of neighorhood selection prolems, the complexity of the search is thus reduced sustantially at the price of potential inconsistencies etween neighorhood estimates. Graph estimates that apply this strategy for complexity reduction are sometimes called dependency networks [9]. The complexity of the proposed neighorhood selection for one node with the Lasso is reduced further to Onpmin{n,p}), as the Lars procedure of Efron, Hastie, Johnstone and Tishirani [6] requires Omin{n, p}) steps, each of complexity Onp). For high-dimensional prolems as in Theorems 1 and 2, where the numer of variales grows as pn) cn γ for some c > 0 and γ > 1, this is equivalent to Op 2+2/γ ) computations for the whole graph. The complexity of the proposed method thus scales approximately quadratic with the numer of nodes for large values of γ. Before providing some numerical results, we discuss in the following the choice of the penalty parameter. Finite-sample results and significance. It was shown aove that consistent neighorhood and covariance selection is possile with the Lasso in a high-dimensional setting. However, the asymptotic considerations give little advice on how to choose a specific penalty parameter for a given prolem. Ideally, one would like to guarantee that pairs of variales which are not contained in the edge set enter the estimate of the edge set only with very low prespecified) proaility. Unfortunately, it seems very difficult to otain such a result as the proaility of falsely including a pair of variales into the estimate of the edge set depends on the exact covariance matrix, which is in general unknown. It is possile, however, to constrain the proaility of falsely) connecting two distinct connectivity components of the true graph. The connectivity component C a Γn) of a node a Γn) is the set of nodes which are connected to node a y a chain of edges. The neighorhood ne a is clearly part of the connectivity component C a. Let Ĉλ a e the connectivity component of a in the estimated graph Γ,Êλ ). For any level 0 < α < 1, consider the choice of the penalty 9) λα) = 2ˆσ a α Φ 1 n 2pn) 2 where Φ = 1 Φ [Φ is the c.d.f. of N0,1)] and ˆσ 2 a = n 1 X a,x a. The proaility of falsely joining two distinct connectivity components with the estimate of the edge set is ounded y the level α under the choice λ = λα) of the penalty parameter, as shown in the following theorem. ),

13 VARIABLE SELECTION WITH THE LASSO 13 Theorem 3. Let Assumptions 1 6 e satisfied. Using the penalty parameter λα), we have for all n N that P a Γn):Ĉλ a C a ) α. A proof is given in the Appendix. This implies that if the edge set is empty E = ), it is estimated y an empty set with high proaility, PÊλ = ) 1 α. Theorem 3 is a finite-sample result. The previous asymptotic results in Theorems 1 and 2 hold if the level α vanishes exponentially to zero for an increasing numer of oservations, leading to consistent edge set estimation. 4. Numerical examples. We use oth the Lasso estimate from Section 3 and forward selection MLE [5, 12] to estimate sparse graphs. We found it difficult to compare numerically neighorhood selection with forward selection MLE for more than 30 nodes in the graph. The high computational complexity of the forward selection MLE made the computations for such relatively low-dimensional prolems very costly already. The Lasso scheme in contrast handled with ease graphs with more than 1000 nodes, using the recent algorithm developed in [6]. Where comparison was feasile, the performance of the neighorhood selection scheme was etter. The difference was particularly pronounced if the ratio of oservations to variales was low, as can e seen in Tale 1, which will e descried in more detail elow. First we give an account of the generation of the underlying graphs which we are trying to estimate. A realization of an underlying random) graph is given in the left panel of Figure 1. The nodes of the graph are associated with spatial location and the location of each node is distriuted identically and uniformly in the two-dimensional square [0,1] 2. Every pair of nodes is included initially in the edge set with proaility ϕd/ p), where d is the Tale 1 The average numer of correctly identified edges as a function of the numer k of falsely included edges for n = 40 oservations and p = 10, 20, 30 nodes for forward selection MLE FS), Ê λ,, Ê λ, and random guessing p = 10 p = 20 p = 30 k Random FS Ê λ, Ê λ,

14 14 N. MEINSHAUSEN AND P. BÜHLMANN Fig. 1. A realization of a graph is shown on the left, generated as descried in the text. The graph consists of 1000 nodes and 1747 edges out of 449,500 distinct pairs of variales. The estimated edge set, using estimate 7) at level α = 0.05 [see 9)], is shown in the middle. There are two erroneously included edges, marked y an arrow, while 1109 edges are correctly detected. For estimate 8) and an adjusted level as descried in the text, the result is shown on the right. Again two edges are erroneously included. Not a single pair of disjoint connectivity components of the true graph has een falsely) joined y either estimate. Euclidean distance etween the pair of variales and ϕ is the density of the standard normal distriution. The maximum numer of edges connecting to each node is limited to four to achieve the desired sparsity of the graph. Edges which connect to nodes which do not satisfy this constraint are removed randomly until the constraint is satisfied for all edges. Initially all variales have identical conditional variance and the partial correlation etween neighors is set to asolute values less than 0.25 guarantee positive definiteness of the inverse covariance matrix); that is, Σ 1 aa = 1 for all nodes a Γ, Σ 1 a = if there is an edge connecting a and and Σ 1 a = 0 otherwise. The diagonal elements of the corresponding covariance matrix are in general larger than 1. To achieve constant variance, all variales are finally rescaled so that the diagonal elements of Σ are all unity. Using the Cholesky transformation of the covariance matrix, n independent samples are drawn from the corresponding Gaussian distriution. The average numer of edges which are correctly included into the estimate of the edge set is shown in Tale 1 as a function of the numer of edges which are falsely included. The accuracy of the forward selection MLE is comparale to the proposed Lasso neighorhood selection if the numer of nodes is much smaller than the numer of oservations. The accuracy of the forward selection MLE reaks down, however, if the numer of nodes is approximately equal to the numer of oservations. Forward selection MLE is only marginally etter than random guessing in this case. Computation of the forward selection MLE using MIM, [5]) on the same desktop took up to several hundred times longer than the Lasso neighorhood selection for the full graph. For more than 30 nodes, the differences are even more pronounced.

15 VARIABLE SELECTION WITH THE LASSO 15 The Lasso neighorhood selection can e applied to hundred- or thousanddimensional graphs, a realistic size, for example, iological networks. A graph with 1000 nodes following the same model as descried aove) and its estimates 7) and 8), using 600 oservations, are shown in Figure 1. A level α = 0.05 is used for the estimate Êλ,. For etter comparison, the level α was adjusted to α = for the estimate Êλ,, so that oth estimates lead to the same numer of included edges. There are two erroneous edge inclusions, while 1109 out of all 1747 edges have een correctly identified y either estimate. Of these 1109 edges, 907 are common to oth estimates while 202 are just present in either 7) or 8). To examine if results are critically dependent on the assumption of Gaussianity, long-tailed noise is added to the oservations. Instead of n i.i.d. oservations of X N0,Σ), n i.i.d. oservations of X +0.1Z are made, where the components of Z are independent and follow a t 2 -distriution. For 10 simulations with each 500 oservations), the proportion of false rejections among all rejections increases only slightly from 0.8% without long-tailed noise) to 1.4% with long tailed-noise) for Êλ, and from 4.8% to 5.2% for Ê λ,. Our limited numerical experience suggests that the properties of the graph estimator do not seem to e critically affected y deviations from Gaussianity. APPENDIX: PROOFS A.1. Notation and useful lemmas. As a generalization of 3), the Lasso estimate ˆθ a,a,λ of θ a,a, defined in 2), is given y A.1) ˆθ a,a,λ = argmin n 1 X a Xθ λ θ 1). θ : θ k =0 k/ A The notation ˆθ a,λ is thus just a shorthand notation for ˆθ a,γn)\{a},λ. Lemma A.1. Given θ R pn), let Gθ) e a pn)-dimensional vector with elements G θ) = 2n 1 X a Xθ,X. A vector ˆθ with ˆθ k = 0, k Γn) \ A is a solution to A.1) iff for all A, G ˆθ) = signˆθ )λ in case ˆθ 0 and G ˆθ) λ in case ˆθ = 0. Moreover, if the solution is not unique and G ˆθ) < λ for some solution ˆθ, then ˆθ = 0 for all solutions of A.1). Proof. Denote the sudifferential of n 1 X a Xθ λ θ 1

16 16 N. MEINSHAUSEN AND P. BÜHLMANN with respect to θ y Dθ). The vector ˆθ is a solution to A.1) iff there exists an element d Dˆθ) so that d = 0, A. Dθ) is given y {Gθ) + λe,e S}, where S R pn) is given y S := {e R pn) :e = signθ ) if θ 0 and e [ 1,1] if θ = 0}. The first part of the claim follows. The second part follows from the proof of Theorem 3.1. in [13]. Lemma A.2. Let ˆθ a,nea,λ e defined for every a Γn) as in A.1). Under the assumptions of Theorem 1, for some c > 0, for all a Γn), Psignˆθ a,nea,λ ) = signθ a ), ne a) = 1 Oexp cn ε )) for n. For the sign-function, it is understood that sign0) = 0. The lemma says, in other words, that if one could restrict the Lasso estimate to have zero coefficients for all nodes which are not in the neighorhood of node a, then the signs of the partial correlations in the neighorhood of node a are estimated consistently under the given assumptions. Proof. Using Bonferroni s inequality, and ne a = on) for n, it suffices to show that there exists some c > 0 so that for every a, Γn) with ne a, Psignˆθ a,nea,λ ) = signθ a )) = 1 Oexp cn ε )) for n. Consider the definition of ˆθ a,nea,λ in A.1), A.2) ˆθ a,nea,λ = argmin n 1 X a Xθ λ θ 1). θ : θ k =0 k/ ne a Assume now that component of this estimate is fixed at a constant value β. Denote this new estimate y θ a,,λ β), A.3) where θ a,,λ β) = argminn 1 X a Xθ λ θ 1 ), θ Θ a, β) Θ a, β) := {θ R pn) :θ = β;θ k = 0, k / ne a }. ˆθ a,nea,λ There always exists a value β namely β = ) so that θ a,,λ β) is identical to ˆθ a,nea,λ. Thus, if signˆθ a,nea,λ ) signθ a ), there would exist some β with signβ)signθ a) 0 so that θ a,,λ β) would e a solution to A.2). Using signθ a) 0 for all ne a, it is thus sufficient to show that for every β with signβ)signθ a) < 0, θ a,,λ β) cannot e a solution to A.2) with high proaility. We focus in the following on the case where θ a > 0 for notational simplicity. The case θ a < 0 follows analogously. If θa > 0, it follows y Lemma A.1

17 VARIABLE SELECTION WITH THE LASSO 17 that θ a,,λ a,,λ β) with θ β) = β 0 can e a solution to A.2) only if G θ a,,λ β)) λ. Hence it suffices to show that for some c > 0 and all ne a with θ a > 0, for n, ) A.4) P {G θ a,,λ β))} < λ = 1 Oexp cn ε )). sup β 0 Let in the following R λ aβ) e the n-dimensional vector of residuals, A.5) We can write X as A.6) X = R λ a β) := X a X θ a,,λ β). k ne a\{} θ,nea\{} k X k + W, where W is independent of {X k ;k ne a \ {}}. By straightforward calculation, using A.6), G θ a,,λ β)) = 2n 1 R λ a β),w θ,nea\{} k 2n 1 R λ a β),x k ). k ne a\{} By Lemma A.1, for all k ne a \{}, G k θ a,,λ β)) = 2n 1 R λ aβ),x k λ. This together with the equation aove yields A.7) G θ a,,λ β)) 2n 1 R λ aβ),w + λ θ,nea\{} 1. Using Assumption 4, there exists some ϑ <, so that θ,nea\{} 1 ϑ. For proving A.4) it is therefore sufficient to show that there exists for every g > 0 some c > 0 so that for all ne a with θ a > 0, for n, ) A.8) P inf β 0 {2n 1 R λ aβ),w } > gλ = 1 Oexp cn ε )). With a little ause of notation, let W R n e the at most ne a 1)- dimensional space which is spanned y the vectors {X k, k ne a \ {}} and let W e the orthogonal complement of W in R n. Split the n-dimensional vector W of oservations of W into the sum of two vectors A.9) W = W +W, where W is contained in the space W R n, while the remaining part W is chosen orthogonal to this space in the orthogonal complement W of W ). The inner product in A.8) can e written as A.10) 2n 1 R λ aβ),w = 2n 1 R λ aβ),w + 2n 1 R λ aβ),w.

18 18 N. MEINSHAUSEN AND P. BÜHLMANN By Lemma A.3 see elow), there exists for every g > 0 some c > 0 so that, for n, ) P inf β 0 {2n 1 R λ aβ),w /1 + β )} > gλ = 1 Oexp cn ε )). To show A.8), it is sufficient to prove that there exists for every g > 0 some c > 0 so that, for n, ) A.11) P inf β 0 {2n 1 R λ a β),w g1 + β )λ} > gλ = 1 Oexp cn ε )). For some random variale V a, independent of X nea, we have X a = k ne a θ a k X k + V a. Note that V a and W are independent normally distriuted random variales with variances σa 2 and σ 2, respectively. By Assumption 2, 0 < v2 σ 2,σ2 a 1. Note furthermore that W and X nea\{} are independent. Using θ a = θ a,nea and A.6), A.12) X a = θk a + θθ a,nea\{} k )X k + θw a + V a. k ne a\{} Using A.12), the definition of the residuals in A.5) and the orthogonality property of W, A.13) 2n 1 R λ a β),w = 2n 1 θ a β) W,W + 2n 1 V a,w, 2n 1 θ a β) W,W 2n 1 V a,w. The second term, 2n 1 V a,w, is stochastically smaller than 2n 1 V a,w this can e derived y conditioning on {X k ;k ne a }). Due to independence of V a and W, EV a W ) = 0. Using Bernstein s inequality Lemma in [17]), and λ dn 1 ε)/2 with ε > 0, there exists for every g > 0 some c > 0 so that A.14) P 2n 1 V a,w gλ) P 2n 1 V a,w gλ) = Oexp cn ε )). Instead of A.11), it is sufficient y A.13) and A.14) to show that there exists for every g > 0 a c > 0 so that, for n, ) A.15) P inf β 0 {2n 1 θ a β) W,W g1 + β )λ} > 2gλ = 1 Oexp cn ε )). Note that σ 2 W,W follows a χ2 n ne distriution. As ne a a = on) and σ 2 v2 y Assumption 2), it follows that there exists some k > 0 so that for n n 0 with some n 0 k) N, and any c > 0, P2n 1 W,W > k) = 1 Oexp cn ε )).

19 VARIABLE SELECTION WITH THE LASSO 19 To show A.15), it hence suffices to prove that for every k,l > 0 there exists some n 0 k,l) N so that, for all n n 0, A.16) inf β 0 {θa β)k l1 + β )λ} > 0. By Assumption 5, π a is of order at least n 1 ξ)/2. Using π a = θ a /VarX a X Γn)\{a} )VarX X Γn)\{} )) 1/2 and Assumption 2, this implies that there exists some q > 0 so that θ a qn 1 ξ)/2. As λ dn 1 ε)/2 and, y the assumptions in Theorem 1, ξ > ε, it follows that for every k,l > 0 and large enough values of n, θ a k lλ > 0. It remains to show that for any k,l > 0 there exists some n 0 k,l) so that for all n n 0, inf { βk l β λ} 0. β 0 This follows as λ 0 for n, which completes the proof. Lemma A.3. Assume the conditions of Theorem 1 hold. Let R λ a β) e defined as in A.5) and W as in A.9). For any g > 0 there exists c > 0 so that for all a, Γn), for n, P sup β R 2n 1 R λ aβ),w /1 + β ) < gλ ) = 1 Oexp cn ε )). Proof. By Schwarz s inequality, 2n 1 R λ a β),w /1 + β ) 2n 1/2 W n 1/2 R λ aβ) 2 A.17) β The sum of squares of the residuals is increasing with increasing value of λ. Thus, R λ a β) 2 2 R a β) 2 2. By definition of Rλ a in A.5), and using A.3), and hence R a β) 2 2 = X a βx 2 2, R λ aβ) β ) 2 max{ X a 2 2, X 2 2}. Therefore, for any q > 0, n 1/2 R λ a P sup β) ) 2 > q Pn 1/2 max{ X a 2, X 2 } > q). β R 1 + β

20 20 N. MEINSHAUSEN AND P. BÜHLMANN Note that oth X a 2 2 and X 2 2 are χ2 n-distriuted. Thus there exist q > 1 and c > 0 so that n 1/2 R λ a P sup β) ) 2 A.18) > q = Oexp cn ε )) for n. β R 1 + β It remains to show that for every g > 0 there exists some c > 0 so that A.19) Pn 1/2 W 2 > gλ) = Oexp cn ε )) for n. The expression σ 2 W,W is χ2 ne a 1 -distriuted. As σ 1 and ne a = On κ ), it follows that n 1/2 W 2 is for some t > 0 stochastically smaller than tn 1 κ)/2 Z/n κ ) 1/2, where Z is χ 2 nκ-distriuted. Thus, for every g > 0, Pn 1/2 W 2 > gλ) PZ/n κ ) > g/t) 2 n 1 κ) λ 2 ). As λ 1 = On 1 ε)/2 ), it follows that n 1 κ λ 2 hn ε κ for some h > 0 and sufficiently large n. By the properties of the χ 2 distriution and ε > κ, y assumption in Theorem 1, claim A.19) follows. This completes the proof. Proof of Proposition 1. All diagonal elements of the covariance matrices Σn) are equal to 1, while all off-diagonal elements vanish for all pairs except for a, Γn), where Σ a n) = s with 0 < s < 1. Assume w.l.o.g. that a corresponds to the first and to the second variale. The est vector of coefficients θ a for linear prediction of X a is given y θ a = 0, K a /K aa,0,0,...) = 0,s,0,0,...), where K = Σ 1 n). A necessary condition for ˆne λ a = ne a is that ˆθ a,λ = 0,τ,0,0,...) is the oracle Lasso solution for some τ 0. In the following, we show first that A.20) P λ,τ s: ˆθ a,λ = 0,τ,0,0,...)) 0, n. The proof is then completed y showing in addition that 0,τ,0,0,...) cannot e the oracle Lasso solution as long as τ < s. We egin y showing A.20). If ˆθ = 0,τ,0,0,...) is a Lasso solution for some value of the penalty, it follows that, using Lemma A.1 and positivity of τ, A.21) X 1 τx 2,X 2 X 1 τx 2,X k k Γn), k > 2. Under the given assumptions, X 2,X 3,... can e understood to e independently and identically distriuted, while X 1 = sx 2 + W 1, with W 1 independent of X 2,X 3,...). Sustituting X 1 = sx 2 + W 1 in A.21) yields for all k Γn) with k > 2, W 1,X 2 τ s) X 2,X 2 W 1,X k τ s) X 2,X k.

21 VARIABLE SELECTION WITH THE LASSO 21 Let U 2,U 3,...,U pn) e the random variales defined y U k = W 1,X k. Note that the random variales U k, k = 2,...,pn), are exchangeale. Let furthermore The inequality aove implies then D = X 2,X 2 max X 2,X k. k Γn),k>2 U 2 > max U k + τ s)d. k Γn),k>2 To show the claim, it thus suffices to show that ) A.22) P U 2 > max U k + τ s)d 0 for n. k Γn),k>2 Using τ s > 0, ) P U 2 > max U k + τ s)d k Γn),k>2 P U 2 > max k Γn),k>2 U k ) + PD < 0). Using the assumption that s < 1, it follows y pn) = on γ ) for some γ > 0 and a Bernstein-type inequality that PD < 0) 0 for n. Furthermore, as U 2,...,U pn) are exchangeale, ) P U 2 > max = pn) 1) 1 0 for n, k Γn),k>2 U k which shows that A.22) holds. The claim A.20) follows. It hence suffices to show that 0,τ,0,0,...) with τ < s cannot e the oracle Lasso solution. Let τ max e the maximal value of τ so that 0,τ,0,...) is a Lasso solution for some value λ > 0. By the previous assumption, τ max < s. For τ < τ max, the vector 0,τ,0,...) cannot e the oracle Lasso solution. We show in the following that 0,τ max,0,...) cannot e an oracle Lasso solution either. Suppose that 0,τ max,0,0,...) is the Lasso solution ˆθ a,λ for some λ = λ > 0. As τ max is the maximal value such that 0,τ,0,...) is a Lasso solution, there exists some k Γn) > 2, such that n 1 X 1 τ max X 2,X 2 = n 1 X 1 τ max X 2,X k, and the value of oth components G 2 and G k of the gradient is equal to λ. By appropriately reordering the variales we can assume that k = 3. Furthermore, it holds a.s. that max X 1 τ max X 2,X k < λ. k Γn),k>3

22 22 N. MEINSHAUSEN AND P. BÜHLMANN Hence, for sufficiently small δλ 0, a Lasso solution for the penalty λ δλ is given y 0,τ max + δθ 2,δθ 3,0,...). Let H n e the empirical covariance matrix of X 2,X 3 ). Assume w.l.o.g. that n 1 X 1 τ max X 2,X k > 0 and n 1 X 2,X 2 = n 1 X 3,X 3 = 1. Following, for example, Efron et al. [6], page 417), the components δθ 2,δθ 3 ) are then given y Hn 1 1,1) T, from which it follows that δθ 2 = δθ 3, which we areviate y δθ in the following one can accommodate a negative sign for n 1 X 1 τ max X 2,X k y reversing the sign of δθ 3 ). Denote y L δ the squared error loss for this solution. Then, for sufficiently small δθ, L δ L 0 = EX 1 τ max + δθ)x 2 + δθx 3 ) 2 EX 1 τ max X 2 ) 2 = s τ max + δθ)) 2 + δθ 2 s τ max ) 2 = 2s τ max )δθ + 2δθ 2. It holds that L δθ L 0 < 0 for any 0 < δθ < 1/2s τ max ), which shows that 0,τ,0,...) cannot e the oracle solution for τ < s. Together with A.20), this completes the proof. Proof of Proposition 2. The sudifferential of the argument in 6), E X a ) 2 θmx a m + η θ a 1, m Γn) with respect to θk a, k Γn) \ {a}, is given y 2E X a ) ) θmx a m X k + ηe k, m Γn) where e k [ 1,1] if θk a = 0, and e k = signθk a) if θa k 0. Using the fact that ne a η) = ne a, it follows as in Lemma A.1 that for all k ne a, 2E X a ) ) A.23) θmη)x a m X k = η signθk) a and, for / ne a, A.24) m Γn) 2E X a m Γn) θ a m η)x m A variale X with / ne a can e written as X = k X k + W, k ne a θ,nea ) X ) η.

23 VARIABLE SELECTION WITH THE LASSO 23 where W is independent of {X k ;k cl a }. Using this in A.24) yields 2 k E X a ) ) θm a η)x m X k η. k ne a θ,nea k ne a θ,nea m Γn) Using A.23) and θ a = θ a,nea, it follows that k signθ a,nea k ) 1, which completes the proof. Proof of Theorem 1. The event ˆne λ a ne a is equivalent to the event that there exists some node Γn) \cl a in the set of nonneighors of node a such that the estimated coefficient is not zero. Thus A.25) ˆθ a,λ P ˆne λ a,λ a ne a ) = 1 P Γn) \ cl a : ˆθ 0). Consider the Lasso estimate ˆθ a,nea,λ, which is y A.1) constrained to have nonzero components only in the neighorhood of node a Γn). Using ne a = On κ ) with some κ < 1, we can assume w.l.o.g. that ne a n. This in turn implies see, e.g., [13]) that ˆθ a,nea,λ is a.s. a unique solution to A.1) with A = ne a. Let E e the event max k Γn)\cl a G k ˆθ a,nea,λ ) < λ. Conditional on the event E, it follows from the first part of Lemma A.1 that ˆθ a,nea,λ is not only a solution of A.1), with A = ne a, ut as well a solution a,nea,λ of 3), where A = Γn) \ {a}. As ˆθ = 0 for all Γn) \ cl a, it follows from the second part of Lemma A.1 that = 0, Γn) \ cl a. Hence where A.26) a,λ P Γn) \ cl a : ˆθ ˆθ a,λ 0) 1 PE) = P max k Γn)\cl a G k ˆθ a,nea,λ ) λ G ˆθ a,nea,λ ) = 2n 1 X a Xˆθ a,nea,λ,x. Using Bonferroni s inequality and pn) = On γ ) for any γ > 0, it suffices to show that there exists a constant c > 0 so that for all Γn) \ cl a, A.27) P G ˆθ a,nea,λ ) λ) = Oexp cn ε )). One can write for any Γn) \ cl a, A.28) X = m X m + V, m ne a θ,ne a ),

24 24 N. MEINSHAUSEN AND P. BÜHLMANN where V N0,σ 2) for some σ2 1 and V is independent of {X m ;m cl a }. Hence G ˆθ a,nea,λ ) = 2n 1 m X a Xˆθ a,nea,λ,x m m ne a θ,nea 2n 1 X a Xˆθ a,nea,λ,v. By Lemma A.2, there exists some c > 0 so that with proaility 1 Oexp cn ε )), A.29) signˆθ a,nea,λ k ) = signθ a,nea k ) k ne a. In this case y Lemma A.1 2n 1 ) θm,nea X a Xˆθ a,nea,λ,x m = signθm a,nea )θ,nea m λ. m ne a m ne a If A.29) holds, the gradient is given y ) G ˆθ a,nea,λ ) = signθ a,ne a m )θm,nea λ A.30) m ne a 2n 1 X a Xˆθ a,nea,λ,v. Using Assumption 6 and Proposition 2, there exists some δ < 1 so that signθ a,ne a m )θm,nea δ. m ne a The asolute value of the coefficient G of the gradient in A.26) is hence ounded with proaility 1 Oexp cn ε )) y A.31) G ˆθ a,nea,λ ) δλ + 2n 1 X a Xˆθ a,nea,λ,v. Conditional on X cla = {X k ; k cl a }, the random variale X a Xˆθ a,nea,λ k,v is normally distriuted with mean zero and variance σ 2 X a Xˆθ a,nea,λ 2 2. On the one hand, σ 2 1. On the other hand, y definition of ˆθ a,nea,λ, Thus X a Xˆθ a,nea,λ 2 X a 2. 2n 1 X a Xˆθ a,nea,λ,v is stochastically smaller than or equal to 2n 1 X a,v. Using A.31), it remains to e shown that for some c > 0 and δ < 1, P 2n 1 X a,v 1 δ)λ) = Oexp cn ε )).

25 VARIABLE SELECTION WITH THE LASSO 25 As V and X a are independent, EX a V ) = 0. Using the Gaussianity and ounded variance of oth X a and V, there exists some g < so that Eexp X a V )) g. Hence, using Bernstein s inequality and the oundedness of λ, for some c > 0, for all ne a, P 2n 1 X a,v 1 δ)λ) = Oexp cnλ 2 )). The claim A.27) follows, which completes the proof. Proof of Proposition 3. Following a similar argument as in Theorem 1 up to A.27), it is sufficient to show that for every a, with Γn)\cl a and S a ) > 1, A.32) P G ˆθ a,nea,λ ) > λ) 1 for n. Using A.30) in the proof of Theorem 1, one can conclude that for some δ > 1, with proaility converging to 1 for n, A.33) G ˆθ a,nea,λ ) δλ 2n 1 X a Xˆθ a,nea,λ,v. Using the identical argument as in the proof of Theorem 1 elow A.31), for the second term, for any g > 0, P 2n 1 X a Xˆθ a,nea,λ,v > gλ) 0 for n, which together with δ > 1 in A.33) shows that A.32) holds. This completes the proof. Proof of Theorem 2. First, Pne a ˆne λ a,λ a) = 1 P ne a : ˆθ = 0). Let E e again the event A.34) max k Γn)\cl a G k ˆθ a,nea,λ ) < λ. ˆθ a,nea,λ Conditional on E, we can conclude as in the proof of Theorem 1 that and ˆθ a,λ are unique solutions to A.1) and 3), respectively, and ˆθ a,nea,λ = ˆθ a,λ. Thus a,λ a,nea,λ P ne a : ˆθ = 0) P ne a : ˆθ = 0) + PE c ). It follows from the proof of Theorem 1 that there exists some c > 0 so that PE c ) = Oexp cn ε )). Using Bonferroni s inequality, it hence remains to show that there exists some c > 0 so that for all ne a, A.35) Pˆθ a,nea,λ = 0) = Oexp cn ε )). This follows from Lemma A.2, which completes the proof. Proof of Proposition 4. The proof of Proposition 4 is to a large extent analogous to the proofs of Theorems 1 and 2. Let E e again the event A.34). Conditional on the event E, we can conclude as efore that

HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO

The Annals of Statistics 2006, Vol. 34, No. 3, 1436 1462 DOI: 10.1214/009053606000000281 Institute of Mathematical Statistics, 2006 HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO BY NICOLAI