arxiv:math/ v1 [math.st] 1 Aug 2006

Size: px
Start display at page:

Download "arxiv:math/ v1 [math.st] 1 Aug 2006"

Transcription

1 The Annals of Statistics 2006, Vol. 34, No. 3, DOI: / c Institute of Mathematical Statistics, 2006 arxiv:math/ v1 [math.st] 1 Aug 2006 HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO By Nicolai Meinshausen and Peter Bühlmann ETH Zürich The pattern of zero entries in the inverse covariance matrix of a multivariate normal distriution corresponds to conditional independence restrictions etween variales. Covariance selection aims at estimating those structural zeros from data. We show that neighorhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighorhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variale selection for Gaussian linear models. We show that the proposed neighorhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighorhood estimate. Controlling instead the proaility of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved with exponential rates), even when the numer of variales grows as the numer of oservations raised to an aritrary power. 1. Introduction. Consider the p-dimensional multivariate normal distriuted random variale X = X 1,...,X p ) Nµ,Σ). This includes Gaussian linear models where, for example, X 1 is the response variale and {X k ;2 k p} are the predictor variales. Assuming that the covariance matrix Σ is nonsingular, the conditional independence structure of the distriution can e conveniently represented y a graphical model G = Γ,E), where Γ = {1,...,p} is the set of nodes and E the set of edges in Γ Γ. A pair a,) is contained in the edge set E if and only if X a is conditionally dependent on X, given all remaining variales X Γ\{a,} = Received May 2004; revised August AMS 2000 suject classifications. Primary 62J07; secondary 62H20, 62F12. Key words and phrases. Linear regression, covariance selection, Gaussian graphical models, penalized regression. This is an electronic reprint of the original article pulished y the Institute of Mathematical Statistics in The Annals of Statistics, 2006, Vol. 34, No. 3, This reprint differs from the original in pagination and typographic detail. 1

2 2 N. MEINSHAUSEN AND P. BÜHLMANN {X k ;k Γ \ {a,}}. Every pair of variales not contained in the edge set is conditionally independent, given all remaining variales, and corresponds to a zero entry in the inverse covariance matrix [12]. Covariance selection was introduced y Dempster [3] and aims at discovering the conditional independence restrictions the graph) from a set of i.i.d. oservations. Covariance selection traditionally relies on the discrete optimization of an ojective function [5, 12]. Exhaustive search is computationally infeasile for all ut very low-dimensional models. Usually, greedy forward or ackward search is employed. In forward search, the initial estimate of the edge set is the empty set and edges are then added iteratively until a suitale stopping criterion is satisfied. The selection deletion) of a single edge in this search strategy requires an MLE fit [15] for Op 2 ) different models. The procedure is not well suited for high-dimensional graphs. The existence of the MLE is not guaranteed in general if the numer of oservations is smaller than the numer of nodes [1]. More disturingly, the complexity of the procedure renders even greedy search strategies impractical for modestly sized graphs. In contrast, neighorhood selection with the Lasso, proposed in the following, relies on optimization of a convex function, applied consecutively to each node in the graph. The method is computationally very efficient and is consistent even for the high-dimensional setting, as will e shown. Neighorhood selection is a suprolem of covariance selection. The neighorhood ne a of a node a Γ is the smallest suset of Γ\{a} so that, given all variales X nea in the neighorhood, X a is conditionally independent of all remaining variales. The neighorhood of a node a Γ consists of all nodes Γ \ {a} so that a,) E. Given n i.i.d. oservations of X, neighorhood selection aims at estimating individually) the neighorhood of any given variale or node). The neighorhood selection can e cast as a standard regression prolem and can e solved efficiently with the Lasso [16], as will e shown in this paper. The consistency of the proposed neighorhood selection will e shown for sparse high-dimensional graphs, where the numer of variales is potentially growing as any power of the numer of oservations high-dimensionality), whereas the numer of neighors of any variale is growing at most slightly slower than the numer of oservations sparsity). A numer of studies have examined the case of regression with a growing numer of parameters as sample size increases. The closest to our setting is the recent work of Greenshtein and Ritov [8], who study consistent prediction in a triangular setup very similar to ours see also [10]). However, the prolem of consistent estimation of the model structure, which is the relevant concept for graphical models, is very different and not treated in these studies.

3 VARIABLE SELECTION WITH THE LASSO 3 We study in Section 2 under which conditions, and at which rate, the neighorhood estimate with the Lasso converges to the true neighorhood. The choice of the penalty is crucial in the high-dimensional setting. The oracle penalty for optimal prediction turns out to e inconsistent for estimation of the true model. This solution might include an unounded numer of noise variales in the model. We motivate a different choice of the penalty such that the proaility of falsely connecting two or more distinct connectivity components of the graph is controlled at very low levels. Asymptotically, the proaility of estimating the correct neighorhood converges exponentially to 1, even when the numer of nodes in the graph is growing rapidly as any power of the numer of oservations. As a consequence, consistent estimation of the full edge set in a sparse high-dimensional graph is possile Section 3). Encouraging numerical results are provided in Section 4. The proposed estimate is shown to e oth more accurate than the traditional forward selection MLE strategy and computationally much more efficient. The accuracy of the forward selection MLE fit is in particular poor if the numer of nodes in the graph is comparale to the numer of oservations. In contrast, neighorhood selection with the Lasso is shown to e reasonaly accurate for estimating graphs with several thousand nodes, using only a few hundred oservations. 2. Neighorhood selection. Instead of assuming a fixed true underlying model, we adopt a more flexile approach similar to the triangular setup in [8]. Both the numer of nodes in the graphs numer of variales), denoted y pn) = Γn), and the distriution the covariance matrix) depend in general on the numer of oservations, so that Γ = Γn) and Σ = Σn). The neighorhood ne a of a node a Γn) is the smallest suset of Γn) \ {a} so that X a is conditionally independent of all remaining variales. Denote the closure of node a Γn) y cl a := ne a {a}. Then X a {X k ;k Γn) \ cl a } X nea. For details see [12]. The neighorhood depends in general on n as well. However, this dependence is notationally suppressed in the following. It is instructive to give a slightly different definition of a neighorhood. For each node a Γn), consider optimal prediction of X a, given all remaining variales. Let θ a R pn) e the vector of coefficients for optimal prediction, 1) θ a = argmine θ: θ a=0 X a k Γn) θ k X k ) 2. As a generalization of 1), which will e of use later, consider optimal prediction of X a, given only a suset of variales {X k ;k A}, where A

4 4 N. MEINSHAUSEN AND P. BÜHLMANN Γn) \ {a}. The optimal prediction is characterized y the vector θ a,a, θ a,a = argmin E X a ) 2 2) θ k X k. θ : θ k =0, k/ A k Γn) The elements of θ a are determined y the inverse covariance matrix [12]. For Γ \ {a} and Kn) = Σ 1 n), it holds that θ a = K an)/k aa n). The set of nonzero coefficients of θ a is identical to the set { Γn)\{a}:K a n) 0} of nonzero entries in the corresponding row vector of the inverse covariance matrix and defines precisely the set of neighors of node a. The est predictor for X a is thus a linear function of variales in the set of neighors of the node a only. The set of neighors of a node a Γn) can hence e written as ne a = { Γn):θ a 0}. This set corresponds to the set of effective predictor variales in regression with response variale X a and predictor variales {X k ;k Γn) \ {a}}. Given n independent oservations of X N0, Σn)), neighorhood selection tries to estimate the set of neighors of a node a Γn). As the optimal linear prediction of X a has nonzero coefficients precisely for variales in the set of neighors of the node a, it seems reasonale to try to exploit this relation Neighorhood selection with the Lasso. It is well known that the Lasso, introduced y Tishirani [16], and known as Basis Pursuit in the context of wavelet regression [2], has a parsimonious property [11]. When predicting a variale X a with all remaining variales {X k ;k Γn) \ {a}}, the vanishing Lasso coefficient estimates identify asymptotically the neighorhood of node a in the graph, as shown in the following. Let the n pn)- dimensional matrix X contain n independent oservations of X, so that the columns X a correspond for all a Γn) to the vector of n independent oservations of X a. Let, e the usual inner product on R n and 2 the corresponding norm. The Lasso estimate ˆθ a,λ of θ a is given y 3) ˆθ a,λ = argminn 1 X a Xθ λ θ 1 ), θ : θ a=0 where θ 1 = Γn) θ is the l 1 -norm of the coefficient vector. Normalization of all variales to a common empirical variance is recommended for the estimator in 3). The solution to 3) is not necessarily unique. However, if uniqueness fails, the set of solutions is still convex and all our results aout neighorhoods in particular Theorems 1 and 2) hold for any solution of 3). Other regression estimates have een proposed, which are ased on the l p - norm, where p is typically in the range [0,2] see [7]). A value of p = 2 leads to

5 VARIABLE SELECTION WITH THE LASSO 5 the ridge estimate, while p = 0 corresponds to traditional model selection. It is well known that the estimates have a parsimonious property with some components eing exactly zero) for p 1 only, while the optimization prolem in 3) is only convex for p 1. Hence l 1 -constrained empirical risk minimization occupies a unique position, as p = 1 is the only value of p for which variale selection takes place while the optimization prolem is still convex and hence feasile for high-dimensional prolems. The neighorhood estimate parameterized y λ) is defined y the nonzero coefficient estimates of the l 1 -penalized regression, ˆne λ a,λ a = { Γn): ˆθ 0}. Each choice of a penalty parameter λ specifies thus an estimate of the neighorhood ne a of node a Γn) and one is left with the choice of a suitale penalty parameter. Larger values of the penalty tend to shrink the size of the estimated set, while more variales are in general included into ˆne λ a if the value of λ is diminished The prediction-oracle solution. A seemingly useful choice of the penalty parameter is the unavailale) prediction-oracle value, λ oracle = argmine λ X a k Γn) ˆθ a,λ k X k) 2. The expectation is understood to e with respect to a new X, which is independent of the sample on which ˆθ a,λ is estimated. The prediction-oracle penalty minimizes the predictive risk among all Lasso estimates. An estimate of λ oracle is otained y the cross-validated choice λ cv. For l 0 -penalized regression it was shown y Shao [14] that the crossvalidated choice of the penalty parameter is consistent for model selection under certain conditions on the size of the validation set. The predictionoracle solution does not lead to consistent model selection for the Lasso, as shown in the following for a simple example. Proposition 1. Let the numer of variales grow to infinity, pn), for n, with pn) = on γ ) for some γ > 0. Assume that the covariance matrices Σn) are identical to the identity matrix except for some pair a,) Γn) Γn), for which Σ a n) = Σ a n) = s, for some 0 < s < 1 and all n N. The proaility of selecting the wrong neighorhood for node a converges to 1 under the prediction-oracle penalty, P ˆne λ oracle a ne a ) 1 for n.

6 6 N. MEINSHAUSEN AND P. BÜHLMANN A proof is given in the Appendix. It follows from the proof of Proposition 1 that many noise variales are included in the neighorhood estimate with the prediction-oracle solution. In fact, the proaility of including noise variales with the prediction-oracle solution does not even vanish asymptotically for a fixed numer of variales. If the penalty is chosen larger than the predictionoptimal value, consistent neighorhood selection is possile with the Lasso, as demonstrated in the following Assumptions. We make a few assumptions to prove consistency of neighorhood selection with the Lasso. We always assume availaility of n independent oservations from X N0, Σ). High-dimensionality. The numer of variales is allowed to grow as the numer of oservations n raised to an aritrarily high power. Assumption 1. There exists γ > 0, so that pn) = On γ ) for n. In particular, it is allowed for the following analysis that the numer of variales is very much larger than the numer of oservations, pn) n. Nonsingularity. We make two regularity assumptions for the covariance matrices. Assumption 2. For all a Γn) and n N, VarX a ) = 1. There exists v 2 > 0, so that for all n N and a Γn), VarX a X Γn)\{a} ) v 2. Common variance can always e achieved y appropriate scaling of the variales. A scaling to a common empirical) variance of all variales is desirale, as the solutions would otherwise depend on the chosen units or dimensions in which they are represented. The second part of the assumption explicitly excludes singular or nearly singular covariance matrices. For singular covariance matrices, edges are not uniquely defined y the distriution and it is hence not surprising that nearly singular covariance matrices are not suitale for consistent variale selection. Note, however, that the empirical covariance matrix is a.s. singular if pn) > n, which is allowed in our analysis.

7 VARIABLE SELECTION WITH THE LASSO 7 Sparsity. The main assumption is the sparsity of the graph. This entails a restriction on the size of the neighorhoods of variales. Assumption 3. There exists some 0 κ < 1 so that max ne a = On κ ) for n. a Γn) This assumption limits the maximal possile rate of growth for the size of neighorhoods. For the next sparsity condition, consider again the definition in 2) of the optimal coefficient θ,a for prediction of X, given variales in the set A Γn). Assumption 4. There exists some ϑ < so that for all neighoring nodes a, Γn) and all n N, θ a,ne \{a} 1 ϑ. This assumption is, for example, satisfied if Assumption 2 holds and the size of the overlap of neighorhoods is ounded y an aritrarily large numer from aove for neighoring nodes. That is, if there exists some m < so that for all n N, 4) max a, Γn), ne a ne a ne m for n, then Assumption 4 is satisfied. To see this, note that Assumption 2 gives a finite ound for the l 2 -norm of θ a,ne \{a}, while 4) gives a finite ound for the l 0 -norm. Taken together, Assumption 4 is implied. Magnitude of partial correlations. The next assumption ounds the magnitude of partial correlations from elow. The partial correlation π a etween variales X a and X is the correlation after having eliminated the linear effects from all remaining variales {X k ;k Γn) \ {a,}}; for details see [12]. Assumption 5. There exist a constant δ > 0 and some ξ > κ, with κ as in Assumption 3, so that for every a,) E, π a δn 1 ξ)/2. It will e shown elow that Assumption 5 cannot e relaxed in general. Note that neighorhood selection for node a Γn) is equivalent to simultaneously testing the null hypothesis of zero partial correlation etween variale X a and all remaining variales X, Γn) \ {a}. The null hypothesis of zero partial correlation etween two variales can e tested y

8 8 N. MEINSHAUSEN AND P. BÜHLMANN using the corresponding entry in the normalized inverse empirical covariance matrix. A graph estimate ased on such tests has een proposed y Drton and Perlman [4]. Such a test can only e applied, however, if the numer of variales is smaller than the numer of oservations, pn) n, as the empirical covariance matrix is singular otherwise. Even if pn) = n c for some constant c > 0, Assumption 5 would have to hold with ξ = 1 to have a positive power of rejecting false null hypotheses for such an estimate; that is, partial correlations would have to e ounded y a positive value from elow. Neighorhood staility. The last assumption is referred to as neighorhood staility. Using the definition of θ a,a in 2), define for all a, Γn), 5) S a ) := signθ a,nea k )θ,nea k. k ne a The assumption of neighorhood staility restricts the magnitude of the quantities S a ) for nonneighoring nodes a, Γn). Assumption 6. There exists some δ < 1 so that for all a, Γn) with / ne a, S a ) < δ. It is shown in Proposition 3 that this assumption cannot e relaxed. We give in the following a more intuitive condition which essentially implies Assumption 6. This will justify the term neighorhood staility. Consider the definition in 1) of the optimal coefficients θ a for prediction of X a. For η > 0, define θ a η) as the optimal set of coefficients under an additional l 1 -penalty, 6) θ a η) := argmine θ : θ a=0 X a k Γn) θ k X k ) 2 + η θ 1. The neighorhood ne a of node a was defined as the set of nonzero coefficients of θ a, ne a = {k Γn) : θ a k 0}. Define the distured neighorhood ne aη) as ne a η) := {k Γn):θ a kη) 0}. It clearly holds that ne a = ne a 0). The assumption of neighorhood staility is satisfied if there exists some infinitesimally small perturation η, which may depend on n, so that the distured neighorhood ne a η) is identical to the undistured neighorhood ne a 0).

9 VARIABLE SELECTION WITH THE LASSO 9 Proposition 2. If there exists some η > 0 so that ne a η) = ne a 0), then S a ) 1 for all Γn) \ ne a. A proof is given in the Appendix. In light of Proposition 2 it seems that Assumption 6 is a very weak condition. To give one example, Assumption 6 is automatically satisfied under the much stronger assumption that the graph does not contain cycles. We give a rief reasoning for this. Consider two nonneighoring nodes a and. If the nodes are in different connectivity components, there is nothing left to show as S a ) = 0. If they are in the same connectivity component, then there exists one node k ne a that separates from ne a \ {k}, as there is just one unique path etween any two variales in the same connectivity component if the graph does not contain cycles. Using the gloal Markov property, the random variale X is independent of X nea\{k}, given X k. The random variale EX X nea ) is thus a function of X k only. As the distriution is Gaussian, EX X nea ) = θ,nea k X k. By Assumption 2, VarX X nea ) = v 2 for some v 2 > 0. It follows that VarX ) = v 2 + θ,nea k ) 2 = 1 and hence θ,nea k = 1 v 2 < 1, which implies that Assumption 6 is indeed satisfied if the graph does not contain cycles. We mention that Assumption 6 is likewise satisfied if the inverse covariance matrices Σ 1 n) are for each n N diagonally dominant. A matrix is said to e diagonally dominant if and only if, for each row, the sum of the asolute values of the nondiagonal elements is smaller than the asolute value of the diagonal element. The proof of this is straightforward ut tedious and hence is omitted Controlling type I errors. The asymptotic properties of Lasso-type estimates in regression have een studied in detail y Knight and Fu [11] for a fixed numer of variales. Their results say that the penalty parameter λ should decay for an increasing numer of oservations at least as fast as n 1/2 to otain an n 1/2 -consistent estimate. It turns out that a slower rate is needed for consistent model selection in the high-dimensional case where pn) n. However, a rate n 1 ε)/2 with any κ < ε < ξ where κ,ξ are defined as in Assumptions 3 and 5) is sufficient for consistent neighorhood selection, even when the numer of variales is growing rapidly with the numer of oservations. Theorem 1. Let Assumptions 1 6 hold. Let the penalty parameter satisfy λ n dn 1 ε)/2 with some κ < ε < ξ and d > 0. There exists some c > 0 so that, for all a Γn), P ˆne λ a ne a) = 1 Oexp cn ε )) for n.

10 10 N. MEINSHAUSEN AND P. BÜHLMANN A proof is given in the Appendix. Theorem 1 states that the proaility of falsely) including any of the nonneighoring variales of the node a Γn) into the neighorhood estimate vanishes exponentially fast, even though the numer of nonneighoring variales may grow very rapidly with the numer of oservations. It is shown in the following that Assumption 6 cannot e relaxed. Proposition 3. If there exists some a, Γn) with / ne a and S a ) > 1, then, for λ = λ n as in Theorem 1, P ˆne λ a ne a) 0 for n. A proof is given in the Appendix. Assumption 6 of neighorhood staility is hence critical for the success of Lasso neighorhood selection Controlling type II errors. So far it has een shown that the proaility of falsely including variales into the neighorhood can e controlled y the Lasso. The question arises whether the proaility of including all neighoring variales into the neighorhood estimate converges to 1 for n. Theorem 2. Let the assumptions of Theorem 1 e satisfied. For λ = λ n as in Theorem 1, for some c > 0 Pne a ˆne λ a) = 1 Oexp cn ε )) for n. A proof is given in the Appendix. It may e of interest whether Assumption 5 could e relaxed, so that edges are detected even if the partial correlation is vanishing at a rate n 1 ξ)/2 for some ξ < κ. The following proposition says that ξ > ε and thus ξ > κ as ε > κ) is a necessary condition if a stronger version of Assumption 4 holds, which is satisfied for forests and trees, for example. Proposition 4. Let the assumptions of Theorem 1 hold with ϑ < 1 in Assumption 4, except that for a Γn), let there e some Γn) \ {a} with π a 0 and π a = On 1 ξ)/2 ) for n for some ξ < ε. Then P ˆne λ a ) 0 for n. Theorem 2 and Proposition 4 say that edges etween nodes for which partial correlation vanishes at a rate n 1 ξ)/2 are, with proaility converging to 1 for n, detected if ξ > ε and are undetected if ξ < ε. The results do not cover the case ξ = ε, which remains a challenging question for further research.

11 VARIABLE SELECTION WITH THE LASSO 11 All results so far have treated the distinction etween zero and nonzero partial correlations only. The signs of partial correlations of neighoring nodes can e estimated consistently under the same assumptions and with the same rates, as can e seen in the proofs. 3. Covariance selection. It follows from Section 2 that it is possile under certain conditions to estimate the neighorhood of each node in the graph consistently, for example, P ˆne λ a = ne a) 1 for n. The full graph is given y the set Γn) of nodes and the edge set E = En). The edge set contains those pairs a, ) Γn) Γn) for which the partial correlation etween X a and X is not zero. As the partial correlations are precisely nonzero for neighors, the edge set E Γn) Γn) is given y E = {a,):a ne ne a }. The first condition, a ne, implies in fact the second, ne a, and vice versa, so that the edge is as well identical to {a,):a ne ne a }. For an estimate of the edge set of a graph, we can apply neighorhood selection to each node in the graph. A natural estimate of the edge set is then given y Êλ, Γn) Γn), where 7) Ê λ, = {a,):a ˆne λ ˆne λ a}. Note that a ˆne λ does not necessarily imply ˆneλ a and vice versa. We can hence also define a second, less conservative, estimate of the edge set y 8) Ê λ, = {a,):a ˆne λ ˆneλ a }. The discrepancies etween the estimates 7) and 8) are quite small in our experience. Asymptotically the difference etween oth estimates vanishes, as seen in the following corollary. We refer to oth edge set estimates collectively with the generic notation Êλ, as the following result holds for oth of them. Corollary 1. Under the conditions of Theorem 2, for some c > 0, PÊλ = E) = 1 Oexp cn ε )) for n. The claim follows since Γn) 2 = pn) 2 = On 2γ ) y Assumption 1 and neighorhood selection has an exponentially fast convergence rate as descried y Theorem 2. Corollary 1 says that the conditional independence structure of a multivariate normal distriution can e estimated consistently y comining the neighorhood estimates for all variales.

12 12 N. MEINSHAUSEN AND P. BÜHLMANN Note that there are in total 2 p2 p)/2 distinct graphs for a p-dimensional variale. However, for each of the p nodes there are only 2 p 1 distinct potential neighorhoods. By reaking the graph selection prolem into a consecutive series of neighorhood selection prolems, the complexity of the search is thus reduced sustantially at the price of potential inconsistencies etween neighorhood estimates. Graph estimates that apply this strategy for complexity reduction are sometimes called dependency networks [9]. The complexity of the proposed neighorhood selection for one node with the Lasso is reduced further to Onpmin{n,p}), as the Lars procedure of Efron, Hastie, Johnstone and Tishirani [6] requires Omin{n, p}) steps, each of complexity Onp). For high-dimensional prolems as in Theorems 1 and 2, where the numer of variales grows as pn) cn γ for some c > 0 and γ > 1, this is equivalent to Op 2+2/γ ) computations for the whole graph. The complexity of the proposed method thus scales approximately quadratic with the numer of nodes for large values of γ. Before providing some numerical results, we discuss in the following the choice of the penalty parameter. Finite-sample results and significance. It was shown aove that consistent neighorhood and covariance selection is possile with the Lasso in a high-dimensional setting. However, the asymptotic considerations give little advice on how to choose a specific penalty parameter for a given prolem. Ideally, one would like to guarantee that pairs of variales which are not contained in the edge set enter the estimate of the edge set only with very low prespecified) proaility. Unfortunately, it seems very difficult to otain such a result as the proaility of falsely including a pair of variales into the estimate of the edge set depends on the exact covariance matrix, which is in general unknown. It is possile, however, to constrain the proaility of falsely) connecting two distinct connectivity components of the true graph. The connectivity component C a Γn) of a node a Γn) is the set of nodes which are connected to node a y a chain of edges. The neighorhood ne a is clearly part of the connectivity component C a. Let Ĉλ a e the connectivity component of a in the estimated graph Γ,Êλ ). For any level 0 < α < 1, consider the choice of the penalty 9) λα) = 2ˆσ a α Φ 1 n 2pn) 2 where Φ = 1 Φ [Φ is the c.d.f. of N0,1)] and ˆσ 2 a = n 1 X a,x a. The proaility of falsely joining two distinct connectivity components with the estimate of the edge set is ounded y the level α under the choice λ = λα) of the penalty parameter, as shown in the following theorem. ),

13 VARIABLE SELECTION WITH THE LASSO 13 Theorem 3. Let Assumptions 1 6 e satisfied. Using the penalty parameter λα), we have for all n N that P a Γn):Ĉλ a C a ) α. A proof is given in the Appendix. This implies that if the edge set is empty E = ), it is estimated y an empty set with high proaility, PÊλ = ) 1 α. Theorem 3 is a finite-sample result. The previous asymptotic results in Theorems 1 and 2 hold if the level α vanishes exponentially to zero for an increasing numer of oservations, leading to consistent edge set estimation. 4. Numerical examples. We use oth the Lasso estimate from Section 3 and forward selection MLE [5, 12] to estimate sparse graphs. We found it difficult to compare numerically neighorhood selection with forward selection MLE for more than 30 nodes in the graph. The high computational complexity of the forward selection MLE made the computations for such relatively low-dimensional prolems very costly already. The Lasso scheme in contrast handled with ease graphs with more than 1000 nodes, using the recent algorithm developed in [6]. Where comparison was feasile, the performance of the neighorhood selection scheme was etter. The difference was particularly pronounced if the ratio of oservations to variales was low, as can e seen in Tale 1, which will e descried in more detail elow. First we give an account of the generation of the underlying graphs which we are trying to estimate. A realization of an underlying random) graph is given in the left panel of Figure 1. The nodes of the graph are associated with spatial location and the location of each node is distriuted identically and uniformly in the two-dimensional square [0,1] 2. Every pair of nodes is included initially in the edge set with proaility ϕd/ p), where d is the Tale 1 The average numer of correctly identified edges as a function of the numer k of falsely included edges for n = 40 oservations and p = 10, 20, 30 nodes for forward selection MLE FS), Ê λ,, Ê λ, and random guessing p = 10 p = 20 p = 30 k Random FS Ê λ, Ê λ,

14 14 N. MEINSHAUSEN AND P. BÜHLMANN Fig. 1. A realization of a graph is shown on the left, generated as descried in the text. The graph consists of 1000 nodes and 1747 edges out of 449,500 distinct pairs of variales. The estimated edge set, using estimate 7) at level α = 0.05 [see 9)], is shown in the middle. There are two erroneously included edges, marked y an arrow, while 1109 edges are correctly detected. For estimate 8) and an adjusted level as descried in the text, the result is shown on the right. Again two edges are erroneously included. Not a single pair of disjoint connectivity components of the true graph has een falsely) joined y either estimate. Euclidean distance etween the pair of variales and ϕ is the density of the standard normal distriution. The maximum numer of edges connecting to each node is limited to four to achieve the desired sparsity of the graph. Edges which connect to nodes which do not satisfy this constraint are removed randomly until the constraint is satisfied for all edges. Initially all variales have identical conditional variance and the partial correlation etween neighors is set to asolute values less than 0.25 guarantee positive definiteness of the inverse covariance matrix); that is, Σ 1 aa = 1 for all nodes a Γ, Σ 1 a = if there is an edge connecting a and and Σ 1 a = 0 otherwise. The diagonal elements of the corresponding covariance matrix are in general larger than 1. To achieve constant variance, all variales are finally rescaled so that the diagonal elements of Σ are all unity. Using the Cholesky transformation of the covariance matrix, n independent samples are drawn from the corresponding Gaussian distriution. The average numer of edges which are correctly included into the estimate of the edge set is shown in Tale 1 as a function of the numer of edges which are falsely included. The accuracy of the forward selection MLE is comparale to the proposed Lasso neighorhood selection if the numer of nodes is much smaller than the numer of oservations. The accuracy of the forward selection MLE reaks down, however, if the numer of nodes is approximately equal to the numer of oservations. Forward selection MLE is only marginally etter than random guessing in this case. Computation of the forward selection MLE using MIM, [5]) on the same desktop took up to several hundred times longer than the Lasso neighorhood selection for the full graph. For more than 30 nodes, the differences are even more pronounced.

15 VARIABLE SELECTION WITH THE LASSO 15 The Lasso neighorhood selection can e applied to hundred- or thousanddimensional graphs, a realistic size, for example, iological networks. A graph with 1000 nodes following the same model as descried aove) and its estimates 7) and 8), using 600 oservations, are shown in Figure 1. A level α = 0.05 is used for the estimate Êλ,. For etter comparison, the level α was adjusted to α = for the estimate Êλ,, so that oth estimates lead to the same numer of included edges. There are two erroneous edge inclusions, while 1109 out of all 1747 edges have een correctly identified y either estimate. Of these 1109 edges, 907 are common to oth estimates while 202 are just present in either 7) or 8). To examine if results are critically dependent on the assumption of Gaussianity, long-tailed noise is added to the oservations. Instead of n i.i.d. oservations of X N0,Σ), n i.i.d. oservations of X +0.1Z are made, where the components of Z are independent and follow a t 2 -distriution. For 10 simulations with each 500 oservations), the proportion of false rejections among all rejections increases only slightly from 0.8% without long-tailed noise) to 1.4% with long tailed-noise) for Êλ, and from 4.8% to 5.2% for Ê λ,. Our limited numerical experience suggests that the properties of the graph estimator do not seem to e critically affected y deviations from Gaussianity. APPENDIX: PROOFS A.1. Notation and useful lemmas. As a generalization of 3), the Lasso estimate ˆθ a,a,λ of θ a,a, defined in 2), is given y A.1) ˆθ a,a,λ = argmin n 1 X a Xθ λ θ 1). θ : θ k =0 k/ A The notation ˆθ a,λ is thus just a shorthand notation for ˆθ a,γn)\{a},λ. Lemma A.1. Given θ R pn), let Gθ) e a pn)-dimensional vector with elements G θ) = 2n 1 X a Xθ,X. A vector ˆθ with ˆθ k = 0, k Γn) \ A is a solution to A.1) iff for all A, G ˆθ) = signˆθ )λ in case ˆθ 0 and G ˆθ) λ in case ˆθ = 0. Moreover, if the solution is not unique and G ˆθ) < λ for some solution ˆθ, then ˆθ = 0 for all solutions of A.1). Proof. Denote the sudifferential of n 1 X a Xθ λ θ 1

16 16 N. MEINSHAUSEN AND P. BÜHLMANN with respect to θ y Dθ). The vector ˆθ is a solution to A.1) iff there exists an element d Dˆθ) so that d = 0, A. Dθ) is given y {Gθ) + λe,e S}, where S R pn) is given y S := {e R pn) :e = signθ ) if θ 0 and e [ 1,1] if θ = 0}. The first part of the claim follows. The second part follows from the proof of Theorem 3.1. in [13]. Lemma A.2. Let ˆθ a,nea,λ e defined for every a Γn) as in A.1). Under the assumptions of Theorem 1, for some c > 0, for all a Γn), Psignˆθ a,nea,λ ) = signθ a ), ne a) = 1 Oexp cn ε )) for n. For the sign-function, it is understood that sign0) = 0. The lemma says, in other words, that if one could restrict the Lasso estimate to have zero coefficients for all nodes which are not in the neighorhood of node a, then the signs of the partial correlations in the neighorhood of node a are estimated consistently under the given assumptions. Proof. Using Bonferroni s inequality, and ne a = on) for n, it suffices to show that there exists some c > 0 so that for every a, Γn) with ne a, Psignˆθ a,nea,λ ) = signθ a )) = 1 Oexp cn ε )) for n. Consider the definition of ˆθ a,nea,λ in A.1), A.2) ˆθ a,nea,λ = argmin n 1 X a Xθ λ θ 1). θ : θ k =0 k/ ne a Assume now that component of this estimate is fixed at a constant value β. Denote this new estimate y θ a,,λ β), A.3) where θ a,,λ β) = argminn 1 X a Xθ λ θ 1 ), θ Θ a, β) Θ a, β) := {θ R pn) :θ = β;θ k = 0, k / ne a }. ˆθ a,nea,λ There always exists a value β namely β = ) so that θ a,,λ β) is identical to ˆθ a,nea,λ. Thus, if signˆθ a,nea,λ ) signθ a ), there would exist some β with signβ)signθ a) 0 so that θ a,,λ β) would e a solution to A.2). Using signθ a) 0 for all ne a, it is thus sufficient to show that for every β with signβ)signθ a) < 0, θ a,,λ β) cannot e a solution to A.2) with high proaility. We focus in the following on the case where θ a > 0 for notational simplicity. The case θ a < 0 follows analogously. If θa > 0, it follows y Lemma A.1

17 VARIABLE SELECTION WITH THE LASSO 17 that θ a,,λ a,,λ β) with θ β) = β 0 can e a solution to A.2) only if G θ a,,λ β)) λ. Hence it suffices to show that for some c > 0 and all ne a with θ a > 0, for n, ) A.4) P {G θ a,,λ β))} < λ = 1 Oexp cn ε )). sup β 0 Let in the following R λ aβ) e the n-dimensional vector of residuals, A.5) We can write X as A.6) X = R λ a β) := X a X θ a,,λ β). k ne a\{} θ,nea\{} k X k + W, where W is independent of {X k ;k ne a \ {}}. By straightforward calculation, using A.6), G θ a,,λ β)) = 2n 1 R λ a β),w θ,nea\{} k 2n 1 R λ a β),x k ). k ne a\{} By Lemma A.1, for all k ne a \{}, G k θ a,,λ β)) = 2n 1 R λ aβ),x k λ. This together with the equation aove yields A.7) G θ a,,λ β)) 2n 1 R λ aβ),w + λ θ,nea\{} 1. Using Assumption 4, there exists some ϑ <, so that θ,nea\{} 1 ϑ. For proving A.4) it is therefore sufficient to show that there exists for every g > 0 some c > 0 so that for all ne a with θ a > 0, for n, ) A.8) P inf β 0 {2n 1 R λ aβ),w } > gλ = 1 Oexp cn ε )). With a little ause of notation, let W R n e the at most ne a 1)- dimensional space which is spanned y the vectors {X k, k ne a \ {}} and let W e the orthogonal complement of W in R n. Split the n-dimensional vector W of oservations of W into the sum of two vectors A.9) W = W +W, where W is contained in the space W R n, while the remaining part W is chosen orthogonal to this space in the orthogonal complement W of W ). The inner product in A.8) can e written as A.10) 2n 1 R λ aβ),w = 2n 1 R λ aβ),w + 2n 1 R λ aβ),w.

18 18 N. MEINSHAUSEN AND P. BÜHLMANN By Lemma A.3 see elow), there exists for every g > 0 some c > 0 so that, for n, ) P inf β 0 {2n 1 R λ aβ),w /1 + β )} > gλ = 1 Oexp cn ε )). To show A.8), it is sufficient to prove that there exists for every g > 0 some c > 0 so that, for n, ) A.11) P inf β 0 {2n 1 R λ a β),w g1 + β )λ} > gλ = 1 Oexp cn ε )). For some random variale V a, independent of X nea, we have X a = k ne a θ a k X k + V a. Note that V a and W are independent normally distriuted random variales with variances σa 2 and σ 2, respectively. By Assumption 2, 0 < v2 σ 2,σ2 a 1. Note furthermore that W and X nea\{} are independent. Using θ a = θ a,nea and A.6), A.12) X a = θk a + θθ a,nea\{} k )X k + θw a + V a. k ne a\{} Using A.12), the definition of the residuals in A.5) and the orthogonality property of W, A.13) 2n 1 R λ a β),w = 2n 1 θ a β) W,W + 2n 1 V a,w, 2n 1 θ a β) W,W 2n 1 V a,w. The second term, 2n 1 V a,w, is stochastically smaller than 2n 1 V a,w this can e derived y conditioning on {X k ;k ne a }). Due to independence of V a and W, EV a W ) = 0. Using Bernstein s inequality Lemma in [17]), and λ dn 1 ε)/2 with ε > 0, there exists for every g > 0 some c > 0 so that A.14) P 2n 1 V a,w gλ) P 2n 1 V a,w gλ) = Oexp cn ε )). Instead of A.11), it is sufficient y A.13) and A.14) to show that there exists for every g > 0 a c > 0 so that, for n, ) A.15) P inf β 0 {2n 1 θ a β) W,W g1 + β )λ} > 2gλ = 1 Oexp cn ε )). Note that σ 2 W,W follows a χ2 n ne distriution. As ne a a = on) and σ 2 v2 y Assumption 2), it follows that there exists some k > 0 so that for n n 0 with some n 0 k) N, and any c > 0, P2n 1 W,W > k) = 1 Oexp cn ε )).

19 VARIABLE SELECTION WITH THE LASSO 19 To show A.15), it hence suffices to prove that for every k,l > 0 there exists some n 0 k,l) N so that, for all n n 0, A.16) inf β 0 {θa β)k l1 + β )λ} > 0. By Assumption 5, π a is of order at least n 1 ξ)/2. Using π a = θ a /VarX a X Γn)\{a} )VarX X Γn)\{} )) 1/2 and Assumption 2, this implies that there exists some q > 0 so that θ a qn 1 ξ)/2. As λ dn 1 ε)/2 and, y the assumptions in Theorem 1, ξ > ε, it follows that for every k,l > 0 and large enough values of n, θ a k lλ > 0. It remains to show that for any k,l > 0 there exists some n 0 k,l) so that for all n n 0, inf { βk l β λ} 0. β 0 This follows as λ 0 for n, which completes the proof. Lemma A.3. Assume the conditions of Theorem 1 hold. Let R λ a β) e defined as in A.5) and W as in A.9). For any g > 0 there exists c > 0 so that for all a, Γn), for n, P sup β R 2n 1 R λ aβ),w /1 + β ) < gλ ) = 1 Oexp cn ε )). Proof. By Schwarz s inequality, 2n 1 R λ a β),w /1 + β ) 2n 1/2 W n 1/2 R λ aβ) 2 A.17) β The sum of squares of the residuals is increasing with increasing value of λ. Thus, R λ a β) 2 2 R a β) 2 2. By definition of Rλ a in A.5), and using A.3), and hence R a β) 2 2 = X a βx 2 2, R λ aβ) β ) 2 max{ X a 2 2, X 2 2}. Therefore, for any q > 0, n 1/2 R λ a P sup β) ) 2 > q Pn 1/2 max{ X a 2, X 2 } > q). β R 1 + β

20 20 N. MEINSHAUSEN AND P. BÜHLMANN Note that oth X a 2 2 and X 2 2 are χ2 n-distriuted. Thus there exist q > 1 and c > 0 so that n 1/2 R λ a P sup β) ) 2 A.18) > q = Oexp cn ε )) for n. β R 1 + β It remains to show that for every g > 0 there exists some c > 0 so that A.19) Pn 1/2 W 2 > gλ) = Oexp cn ε )) for n. The expression σ 2 W,W is χ2 ne a 1 -distriuted. As σ 1 and ne a = On κ ), it follows that n 1/2 W 2 is for some t > 0 stochastically smaller than tn 1 κ)/2 Z/n κ ) 1/2, where Z is χ 2 nκ-distriuted. Thus, for every g > 0, Pn 1/2 W 2 > gλ) PZ/n κ ) > g/t) 2 n 1 κ) λ 2 ). As λ 1 = On 1 ε)/2 ), it follows that n 1 κ λ 2 hn ε κ for some h > 0 and sufficiently large n. By the properties of the χ 2 distriution and ε > κ, y assumption in Theorem 1, claim A.19) follows. This completes the proof. Proof of Proposition 1. All diagonal elements of the covariance matrices Σn) are equal to 1, while all off-diagonal elements vanish for all pairs except for a, Γn), where Σ a n) = s with 0 < s < 1. Assume w.l.o.g. that a corresponds to the first and to the second variale. The est vector of coefficients θ a for linear prediction of X a is given y θ a = 0, K a /K aa,0,0,...) = 0,s,0,0,...), where K = Σ 1 n). A necessary condition for ˆne λ a = ne a is that ˆθ a,λ = 0,τ,0,0,...) is the oracle Lasso solution for some τ 0. In the following, we show first that A.20) P λ,τ s: ˆθ a,λ = 0,τ,0,0,...)) 0, n. The proof is then completed y showing in addition that 0,τ,0,0,...) cannot e the oracle Lasso solution as long as τ < s. We egin y showing A.20). If ˆθ = 0,τ,0,0,...) is a Lasso solution for some value of the penalty, it follows that, using Lemma A.1 and positivity of τ, A.21) X 1 τx 2,X 2 X 1 τx 2,X k k Γn), k > 2. Under the given assumptions, X 2,X 3,... can e understood to e independently and identically distriuted, while X 1 = sx 2 + W 1, with W 1 independent of X 2,X 3,...). Sustituting X 1 = sx 2 + W 1 in A.21) yields for all k Γn) with k > 2, W 1,X 2 τ s) X 2,X 2 W 1,X k τ s) X 2,X k.

21 VARIABLE SELECTION WITH THE LASSO 21 Let U 2,U 3,...,U pn) e the random variales defined y U k = W 1,X k. Note that the random variales U k, k = 2,...,pn), are exchangeale. Let furthermore The inequality aove implies then D = X 2,X 2 max X 2,X k. k Γn),k>2 U 2 > max U k + τ s)d. k Γn),k>2 To show the claim, it thus suffices to show that ) A.22) P U 2 > max U k + τ s)d 0 for n. k Γn),k>2 Using τ s > 0, ) P U 2 > max U k + τ s)d k Γn),k>2 P U 2 > max k Γn),k>2 U k ) + PD < 0). Using the assumption that s < 1, it follows y pn) = on γ ) for some γ > 0 and a Bernstein-type inequality that PD < 0) 0 for n. Furthermore, as U 2,...,U pn) are exchangeale, ) P U 2 > max = pn) 1) 1 0 for n, k Γn),k>2 U k which shows that A.22) holds. The claim A.20) follows. It hence suffices to show that 0,τ,0,0,...) with τ < s cannot e the oracle Lasso solution. Let τ max e the maximal value of τ so that 0,τ,0,...) is a Lasso solution for some value λ > 0. By the previous assumption, τ max < s. For τ < τ max, the vector 0,τ,0,...) cannot e the oracle Lasso solution. We show in the following that 0,τ max,0,...) cannot e an oracle Lasso solution either. Suppose that 0,τ max,0,0,...) is the Lasso solution ˆθ a,λ for some λ = λ > 0. As τ max is the maximal value such that 0,τ,0,...) is a Lasso solution, there exists some k Γn) > 2, such that n 1 X 1 τ max X 2,X 2 = n 1 X 1 τ max X 2,X k, and the value of oth components G 2 and G k of the gradient is equal to λ. By appropriately reordering the variales we can assume that k = 3. Furthermore, it holds a.s. that max X 1 τ max X 2,X k < λ. k Γn),k>3

22 22 N. MEINSHAUSEN AND P. BÜHLMANN Hence, for sufficiently small δλ 0, a Lasso solution for the penalty λ δλ is given y 0,τ max + δθ 2,δθ 3,0,...). Let H n e the empirical covariance matrix of X 2,X 3 ). Assume w.l.o.g. that n 1 X 1 τ max X 2,X k > 0 and n 1 X 2,X 2 = n 1 X 3,X 3 = 1. Following, for example, Efron et al. [6], page 417), the components δθ 2,δθ 3 ) are then given y Hn 1 1,1) T, from which it follows that δθ 2 = δθ 3, which we areviate y δθ in the following one can accommodate a negative sign for n 1 X 1 τ max X 2,X k y reversing the sign of δθ 3 ). Denote y L δ the squared error loss for this solution. Then, for sufficiently small δθ, L δ L 0 = EX 1 τ max + δθ)x 2 + δθx 3 ) 2 EX 1 τ max X 2 ) 2 = s τ max + δθ)) 2 + δθ 2 s τ max ) 2 = 2s τ max )δθ + 2δθ 2. It holds that L δθ L 0 < 0 for any 0 < δθ < 1/2s τ max ), which shows that 0,τ,0,...) cannot e the oracle solution for τ < s. Together with A.20), this completes the proof. Proof of Proposition 2. The sudifferential of the argument in 6), E X a ) 2 θmx a m + η θ a 1, m Γn) with respect to θk a, k Γn) \ {a}, is given y 2E X a ) ) θmx a m X k + ηe k, m Γn) where e k [ 1,1] if θk a = 0, and e k = signθk a) if θa k 0. Using the fact that ne a η) = ne a, it follows as in Lemma A.1 that for all k ne a, 2E X a ) ) A.23) θmη)x a m X k = η signθk) a and, for / ne a, A.24) m Γn) 2E X a m Γn) θ a m η)x m A variale X with / ne a can e written as X = k X k + W, k ne a θ,nea ) X ) η.

23 VARIABLE SELECTION WITH THE LASSO 23 where W is independent of {X k ;k cl a }. Using this in A.24) yields 2 k E X a ) ) θm a η)x m X k η. k ne a θ,nea k ne a θ,nea m Γn) Using A.23) and θ a = θ a,nea, it follows that k signθ a,nea k ) 1, which completes the proof. Proof of Theorem 1. The event ˆne λ a ne a is equivalent to the event that there exists some node Γn) \cl a in the set of nonneighors of node a such that the estimated coefficient is not zero. Thus A.25) ˆθ a,λ P ˆne λ a,λ a ne a ) = 1 P Γn) \ cl a : ˆθ 0). Consider the Lasso estimate ˆθ a,nea,λ, which is y A.1) constrained to have nonzero components only in the neighorhood of node a Γn). Using ne a = On κ ) with some κ < 1, we can assume w.l.o.g. that ne a n. This in turn implies see, e.g., [13]) that ˆθ a,nea,λ is a.s. a unique solution to A.1) with A = ne a. Let E e the event max k Γn)\cl a G k ˆθ a,nea,λ ) < λ. Conditional on the event E, it follows from the first part of Lemma A.1 that ˆθ a,nea,λ is not only a solution of A.1), with A = ne a, ut as well a solution a,nea,λ of 3), where A = Γn) \ {a}. As ˆθ = 0 for all Γn) \ cl a, it follows from the second part of Lemma A.1 that = 0, Γn) \ cl a. Hence where A.26) a,λ P Γn) \ cl a : ˆθ ˆθ a,λ 0) 1 PE) = P max k Γn)\cl a G k ˆθ a,nea,λ ) λ G ˆθ a,nea,λ ) = 2n 1 X a Xˆθ a,nea,λ,x. Using Bonferroni s inequality and pn) = On γ ) for any γ > 0, it suffices to show that there exists a constant c > 0 so that for all Γn) \ cl a, A.27) P G ˆθ a,nea,λ ) λ) = Oexp cn ε )). One can write for any Γn) \ cl a, A.28) X = m X m + V, m ne a θ,ne a ),

24 24 N. MEINSHAUSEN AND P. BÜHLMANN where V N0,σ 2) for some σ2 1 and V is independent of {X m ;m cl a }. Hence G ˆθ a,nea,λ ) = 2n 1 m X a Xˆθ a,nea,λ,x m m ne a θ,nea 2n 1 X a Xˆθ a,nea,λ,v. By Lemma A.2, there exists some c > 0 so that with proaility 1 Oexp cn ε )), A.29) signˆθ a,nea,λ k ) = signθ a,nea k ) k ne a. In this case y Lemma A.1 2n 1 ) θm,nea X a Xˆθ a,nea,λ,x m = signθm a,nea )θ,nea m λ. m ne a m ne a If A.29) holds, the gradient is given y ) G ˆθ a,nea,λ ) = signθ a,ne a m )θm,nea λ A.30) m ne a 2n 1 X a Xˆθ a,nea,λ,v. Using Assumption 6 and Proposition 2, there exists some δ < 1 so that signθ a,ne a m )θm,nea δ. m ne a The asolute value of the coefficient G of the gradient in A.26) is hence ounded with proaility 1 Oexp cn ε )) y A.31) G ˆθ a,nea,λ ) δλ + 2n 1 X a Xˆθ a,nea,λ,v. Conditional on X cla = {X k ; k cl a }, the random variale X a Xˆθ a,nea,λ k,v is normally distriuted with mean zero and variance σ 2 X a Xˆθ a,nea,λ 2 2. On the one hand, σ 2 1. On the other hand, y definition of ˆθ a,nea,λ, Thus X a Xˆθ a,nea,λ 2 X a 2. 2n 1 X a Xˆθ a,nea,λ,v is stochastically smaller than or equal to 2n 1 X a,v. Using A.31), it remains to e shown that for some c > 0 and δ < 1, P 2n 1 X a,v 1 δ)λ) = Oexp cn ε )).

25 VARIABLE SELECTION WITH THE LASSO 25 As V and X a are independent, EX a V ) = 0. Using the Gaussianity and ounded variance of oth X a and V, there exists some g < so that Eexp X a V )) g. Hence, using Bernstein s inequality and the oundedness of λ, for some c > 0, for all ne a, P 2n 1 X a,v 1 δ)λ) = Oexp cnλ 2 )). The claim A.27) follows, which completes the proof. Proof of Proposition 3. Following a similar argument as in Theorem 1 up to A.27), it is sufficient to show that for every a, with Γn)\cl a and S a ) > 1, A.32) P G ˆθ a,nea,λ ) > λ) 1 for n. Using A.30) in the proof of Theorem 1, one can conclude that for some δ > 1, with proaility converging to 1 for n, A.33) G ˆθ a,nea,λ ) δλ 2n 1 X a Xˆθ a,nea,λ,v. Using the identical argument as in the proof of Theorem 1 elow A.31), for the second term, for any g > 0, P 2n 1 X a Xˆθ a,nea,λ,v > gλ) 0 for n, which together with δ > 1 in A.33) shows that A.32) holds. This completes the proof. Proof of Theorem 2. First, Pne a ˆne λ a,λ a) = 1 P ne a : ˆθ = 0). Let E e again the event A.34) max k Γn)\cl a G k ˆθ a,nea,λ ) < λ. ˆθ a,nea,λ Conditional on E, we can conclude as in the proof of Theorem 1 that and ˆθ a,λ are unique solutions to A.1) and 3), respectively, and ˆθ a,nea,λ = ˆθ a,λ. Thus a,λ a,nea,λ P ne a : ˆθ = 0) P ne a : ˆθ = 0) + PE c ). It follows from the proof of Theorem 1 that there exists some c > 0 so that PE c ) = Oexp cn ε )). Using Bonferroni s inequality, it hence remains to show that there exists some c > 0 so that for all ne a, A.35) Pˆθ a,nea,λ = 0) = Oexp cn ε )). This follows from Lemma A.2, which completes the proof. Proof of Proposition 4. The proof of Proposition 4 is to a large extent analogous to the proofs of Theorems 1 and 2. Let E e again the event A.34). Conditional on the event E, we can conclude as efore that

HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO

HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO The Annals of Statistics 2006, Vol. 34, No. 3, 1436 1462 DOI: 10.1214/009053606000000281 Institute of Mathematical Statistics, 2006 HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO BY NICOLAI

More information

The annals of Statistics (2006)

The annals of Statistics (2006) High dimensional graphs and variable selection with the Lasso Nicolai Meinshausen and Peter Buhlmann The annals of Statistics (2006) presented by Jee Young Moon Feb. 19. 2010 High dimensional graphs and

More information

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Relaxed Lasso. Nicolai Meinshausen December 14, 2006 Relaxed Lasso Nicolai Meinshausen nicolai@stat.berkeley.edu December 14, 2006 Abstract The Lasso is an attractive regularisation method for high dimensional regression. It combines variable selection with

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

Shifting Inequality and Recovery of Sparse Signals

Shifting Inequality and Recovery of Sparse Signals Shifting Inequality and Recovery of Sparse Signals T. Tony Cai Lie Wang and Guangwu Xu Astract In this paper we present a concise and coherent analysis of the constrained l 1 minimization method for stale

More information

FinQuiz Notes

FinQuiz Notes Reading 9 A time series is any series of data that varies over time e.g. the quarterly sales for a company during the past five years or daily returns of a security. When assumptions of the regression

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Expansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2

Expansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2 Least squares: Mathematical theory Below we provide the "vector space" formulation, and solution, of the least squares prolem. While not strictly necessary until we ring in the machinery of matrix algera,

More information

EVALUATIONS OF EXPECTED GENERALIZED ORDER STATISTICS IN VARIOUS SCALE UNITS

EVALUATIONS OF EXPECTED GENERALIZED ORDER STATISTICS IN VARIOUS SCALE UNITS APPLICATIONES MATHEMATICAE 9,3 (), pp. 85 95 Erhard Cramer (Oldenurg) Udo Kamps (Oldenurg) Tomasz Rychlik (Toruń) EVALUATIONS OF EXPECTED GENERALIZED ORDER STATISTICS IN VARIOUS SCALE UNITS Astract. We

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008 Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

A732: Exercise #7 Maximum Likelihood

A732: Exercise #7 Maximum Likelihood A732: Exercise #7 Maximum Likelihood Due: 29 Novemer 2007 Analytic computation of some one-dimensional maximum likelihood estimators (a) Including the normalization, the exponential distriution function

More information

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES #A50 INTEGERS 14 (014) ON RATS SEQUENCES IN GENERAL BASES Johann Thiel Dept. of Mathematics, New York City College of Technology, Brooklyn, New York jthiel@citytech.cuny.edu Received: 6/11/13, Revised:

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix

More information

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω

More information

SVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof)

SVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof) JOURNAL OF MODERN DYNAMICS VOLUME 4, NO. 4, 010, 637 691 doi: 10.3934/jmd.010.4.637 STRUCTURE OF ATTRACTORS FOR (a, )-CONTINUED FRACTION TRANSFORMATIONS SVETLANA KATOK AND ILIE UGARCOVICI (Communicated

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

High dimensional ising model selection using l 1 -regularized logistic regression

High dimensional ising model selection using l 1 -regularized logistic regression High dimensional ising model selection using l 1 -regularized logistic regression 1 Department of Statistics Pennsylvania State University 597 Presentation 2016 1/29 Outline Introduction 1 Introduction

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

1 Systems of Differential Equations

1 Systems of Differential Equations March, 20 7- Systems of Differential Equations Let U e an open suset of R n, I e an open interval in R and : I R n R n e a function from I R n to R n The equation ẋ = ft, x is called a first order ordinary

More information

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University

More information

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables Minimizing a convex separale exponential function suect to linear equality constraint and ounded variales Stefan M. Stefanov Department of Mathematics Neofit Rilski South-Western University 2700 Blagoevgrad

More information

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction Sankhyā : The Indian Journal of Statistics 2007, Volume 69, Part 4, pp. 700-716 c 2007, Indian Statistical Institute More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order

More information

arxiv: v1 [cs.fl] 24 Nov 2017

arxiv: v1 [cs.fl] 24 Nov 2017 (Biased) Majority Rule Cellular Automata Bernd Gärtner and Ahad N. Zehmakan Department of Computer Science, ETH Zurich arxiv:1711.1090v1 [cs.fl] 4 Nov 017 Astract Consider a graph G = (V, E) and a random

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA The Annals of Statistics 2009, Vol. 37, No. 1, 246 270 DOI: 10.1214/07-AOS582 Institute of Mathematical Statistics, 2009 LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA BY NICOLAI

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model Luis Manuel Santana Gallego 100 Appendix 3 Clock Skew Model Xiaohong Jiang and Susumu Horiguchi [JIA-01] 1. Introduction The evolution of VLSI chips toward larger die sizes and faster clock speeds makes

More information

CPM: A Covariance-preserving Projection Method

CPM: A Covariance-preserving Projection Method CPM: A Covariance-preserving Projection Method Jieping Ye Tao Xiong Ravi Janardan Astract Dimension reduction is critical in many areas of data mining and machine learning. In this paper, a Covariance-preserving

More information

Advanced Statistics I : Gaussian Linear Model (and beyond)

Advanced Statistics I : Gaussian Linear Model (and beyond) Advanced Statistics I : Gaussian Linear Model (and beyond) Aurélien Garivier CNRS / Telecom ParisTech Centrale Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

The properties of L p -GMM estimators

The properties of L p -GMM estimators The properties of L p -GMM estimators Robert de Jong and Chirok Han Michigan State University February 2000 Abstract This paper considers Generalized Method of Moment-type estimators for which a criterion

More information

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

Upper Bounds for Stern s Diatomic Sequence and Related Sequences Upper Bounds for Stern s Diatomic Sequence and Related Sequences Colin Defant Department of Mathematics University of Florida, U.S.A. cdefant@ufl.edu Sumitted: Jun 18, 01; Accepted: Oct, 016; Pulished:

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

On Model Selection Consistency of Lasso

On Model Selection Consistency of Lasso On Model Selection Consistency of Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367

More information

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai Divide-and-Conquer Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, 33.4. CSE 6331 Algorithms Steve Lai Divide and Conquer Given an instance x of a prolem, the method works as follows: divide-and-conquer

More information

1 Hoeffding s Inequality

1 Hoeffding s Inequality Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

Modifying Shor s algorithm to compute short discrete logarithms

Modifying Shor s algorithm to compute short discrete logarithms Modifying Shor s algorithm to compute short discrete logarithms Martin Ekerå Decemer 7, 06 Astract We revisit Shor s algorithm for computing discrete logarithms in F p on a quantum computer and modify

More information

Weak bidders prefer first-price (sealed-bid) auctions. (This holds both ex-ante, and once the bidders have learned their types)

Weak bidders prefer first-price (sealed-bid) auctions. (This holds both ex-ante, and once the bidders have learned their types) Econ 805 Advanced Micro Theory I Dan Quint Fall 2007 Lecture 9 Oct 4 2007 Last week, we egan relaxing the assumptions of the symmetric independent private values model. We examined private-value auctions

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

arxiv: v3 [stat.me] 10 Mar 2016

arxiv: v3 [stat.me] 10 Mar 2016 Submitted to the Annals of Statistics ESTIMATING THE EFFECT OF JOINT INTERVENTIONS FROM OBSERVATIONAL DATA IN SPARSE HIGH-DIMENSIONAL SETTINGS arxiv:1407.2451v3 [stat.me] 10 Mar 2016 By Preetam Nandy,,

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Essential Maths 1. Macquarie University MAFC_Essential_Maths Page 1 of These notes were prepared by Anne Cooper and Catriona March.

Essential Maths 1. Macquarie University MAFC_Essential_Maths Page 1 of These notes were prepared by Anne Cooper and Catriona March. Essential Maths 1 The information in this document is the minimum assumed knowledge for students undertaking the Macquarie University Masters of Applied Finance, Graduate Diploma of Applied Finance, and

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Lasso-type recovery of sparse representations for high-dimensional data

Lasso-type recovery of sparse representations for high-dimensional data Lasso-type recovery of sparse representations for high-dimensional data Nicolai Meinshausen and Bin Yu Department of Statistics, UC Berkeley December 5, 2006 Abstract The Lasso (Tibshirani, 1996) is an

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

A note on the group lasso and a sparse group lasso

A note on the group lasso and a sparse group lasso A note on the group lasso and a sparse group lasso arxiv:1001.0736v1 [math.st] 5 Jan 2010 Jerome Friedman Trevor Hastie and Robert Tibshirani January 5, 2010 Abstract We consider the group lasso penalty

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Online Supplementary Appendix B

Online Supplementary Appendix B Online Supplementary Appendix B Uniqueness of the Solution of Lemma and the Properties of λ ( K) We prove the uniqueness y the following steps: () (A8) uniquely determines q as a function of λ () (A) uniquely

More information

Log-mean linear regression models for binary responses with an application to multimorbidity

Log-mean linear regression models for binary responses with an application to multimorbidity Log-mean linear regression models for inary responses with an application to multimoridity arxiv:1410.0580v3 [stat.me] 16 May 2016 Monia Lupparelli and Alerto Roverato March 30, 2016 Astract In regression

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line Chapter 27: Inferences for Regression And so, there is one more thing which might vary one more thing aout which we might want to make some inference: the slope of the least squares regression line. The

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Proximity-Based Anomaly Detection using Sparse Structure Learning

Proximity-Based Anomaly Detection using Sparse Structure Learning Proximity-Based Anomaly Detection using Sparse Structure Learning Tsuyoshi Idé (IBM Tokyo Research Lab) Aurelie C. Lozano, Naoki Abe, and Yan Liu (IBM T. J. Watson Research Center) 2009/04/ SDM 2009 /

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Linear inversion methods and generalized cross-validation

Linear inversion methods and generalized cross-validation Online supplement to Using generalized cross-validation to select parameters in inversions for regional caron fluxes Linear inversion methods and generalized cross-validation NIR Y. KRAKAUER AND TAPIO

More information

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary

More information

The variance for partial match retrievals in k-dimensional bucket digital trees

The variance for partial match retrievals in k-dimensional bucket digital trees The variance for partial match retrievals in k-dimensional ucket digital trees Michael FUCHS Department of Applied Mathematics National Chiao Tung University January 12, 21 Astract The variance of partial

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

CHAPTER V MULTIPLE SCALES..? # w. 5?œ% 0 a?ß?ß%.?.? # %?œ!.>#.>

CHAPTER V MULTIPLE SCALES..? # w. 5?œ% 0 a?ß?ß%.?.? # %?œ!.>#.> CHAPTER V MULTIPLE SCALES This chapter and the next concern initial value prolems of oscillatory type on long intervals of time. Until Chapter VII e ill study autonomous oscillatory second order initial

More information

Lecture 6 January 15, 2014

Lecture 6 January 15, 2014 Advanced Graph Algorithms Jan-Apr 2014 Lecture 6 January 15, 2014 Lecturer: Saket Sourah Scrie: Prafullkumar P Tale 1 Overview In the last lecture we defined simple tree decomposition and stated that for

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Chapter 2 Canonical Correlation Analysis

Chapter 2 Canonical Correlation Analysis Chapter 2 Canonical Correlation Analysis Canonical correlation analysis CCA, which is a multivariate analysis method, tries to quantify the amount of linear relationships etween two sets of random variales,

More information

1. Define the following terms (1 point each): alternative hypothesis

1. Define the following terms (1 point each): alternative hypothesis 1 1. Define the following terms (1 point each): alternative hypothesis One of three hypotheses indicating that the parameter is not zero; one states the parameter is not equal to zero, one states the parameter

More information

Statistical Learning with the Lasso, spring The Lasso

Statistical Learning with the Lasso, spring The Lasso Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information