The annals of Statistics (2006)

Size: px

Start display at page:

Download "The annals of Statistics (2006)"

Theodora York
6 years ago
Views:

1 High dimensional graphs and variable selection with the Lasso Nicolai Meinshausen and Peter Buhlmann The annals of Statistics (2006) presented by Jee Young Moon Feb High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

2 Inverse covariance matrix 0 in Σ 1 = conditional independence (X a X b all the remaining variables) = no edge between these two variables (nodes) Traditionally, Dempster (1972): introduced covariance selection. Discovering the conditional independence. Forward search: edges are added iteratively. MLE fit (Speed and Kiiveri 1986) for O(p 2 ) different models. But, the existence of MLE is not guaranteed in general if the number of observation is smaller than the number of nodes (Buhl 1993) Neighborhood selection with the Lasso (here) : optimization of a convex function, applied consecutively to each node in the graph. High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

3 Neighborhood Neighborhood ne a of a node a Γ = smallest subset of Γ\{a} so that X a all the remaining X nea = {b Γ\{a} : (a, b) E}. High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

4 Notations p(n) = Γ(n) = the number of nodes (the number of variables) n: the number of observartions Optimal prediction of X a given all remaining variables θ a = arg min E(X a θ k X k ) 2 θ:θ a=0 k Γ(n) Optimal prediction θ a,a where A Γ(n)\{a} θ a,a = arg A: active set. Relation to conditional independence is θb a = Σ 1 ab /Σ 1 aa. ne a = {b Γ(n) : θb a 0}. min E(X a θ k X k ) 2 θ:θ k =0, k / A k Γ(n) High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

5 Neighborhood selection with Lasso Lasso estimate ˆθ a,λ of θ a Neighborhood estimate ˆθ a,λ = arg min θ:θ a=0 (n 1 X a X θ 2 + λ θ 1 ) (3) ne ˆ λ a = {b Γ(n) ˆθ a,λ b 0} High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

6 (unavailable) prediction-oracle value λ oracle = arg min E(X a ˆθ a,λ λ k X k) 2 k Γ(n) Proposition 1. Let the number of variables grow to infinity, p(n) for n with p(n) = o(n γ ) for some γ > 0. Assume that the covariance matrices Σ(n) are identical to the identity matrix except for some pair (a, b) Γ(n) Γ(n) for which Σ ab (n) = Σ ba (n) = s for some 0 < s < 1 and all n N. The probability of selecting the wrong neighborhood for node a converges to 1 under the prediction-oracle penalty P( ne ˆ λ oracle a ne a ) 1 for n. High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

7 θ a = (0, K ab /K aa, 0, 0,...) = (0, s, 0, 0,...). To be ne ˆ λ a = ne a, ˆθ a,λ = (0, τ, 0, 0,...) is the oracle Lasso solution for some τ 0. Then, it is the same as 1.P( λ, τ s : ˆθ a,λ = (0, τ, 0, 0,...)) 0 as n. and 2. (0, τ, 0, 0,...) cannot be the oracle Lasso solution as long as τ < s. 1. If ˆθ = (0, τ, 0,...) is a Lasso solution, from Lemma 1 and positiviity of τ, < X 1 τx 2, X 2 > < X 1 τx 2, X k > k Γ(n), k > 2. Substituting X 1 = sx s + W 1 yields < W 1, X 2 > (τ s) < X 2, X 2 > < W 1, X k > (τ s) < X 2, X k >. Let U k =< W 1, X k >. U k, k = 2,..., p(n) are exchangeable. Let It is sufficient to show D =< X 2, X 2 > max < X 2, X k >. k Γ(n),k>2 P(U 2 > max U k + (τ s)d) 0 for n. k Γ(n),k>2 High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

8 Since τ s > 0, P(U 2 > P(U 2 > P(U 2 > max U k + (τ s)d) P(U 2 > max U k) when D >= 0. k Γ(n),k>2 k Γ(n),k>2 max k Γ(n),k>2 U k + (τ s)d) 1 when D < 0. max U k + (τ s)d) P(U 2 > max U k) + P(D < 0). k Γ(n),k>2 k Γ(n),k>2 By Berstein inequality and p(n) = o(n γ ), P(D < 0) 0 for n. Since U 2,..., U p(n) are exchangeable, P(U 2 > max k Γ(n),k>2 U k ) = P(U 3 > max k Γ(n),k=2,>3 U k ) = = P(U p > max 2<=k<p 1 U k ) and sum of those should be 1. Therefore, P(U 2 > max k Γ(n),k>2 U k ) = (p(n) 1) 1 0 for n. High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

9 2. (0, τ, 0,...) with τ < s cannot be the oracle Lasso solution. Suppose (0, τ max, 0, 0,...) is the Lasso solution ˆθ a,λ for some λ = λ > 0 with τ max < s. Since τ max is the maximal value such that (0, τ, 0,...) is a Lasso solution, there exists some k Γ(n) > 2 such that n 1 < X 1 τ max X 2, X 2 > = n 1 < X 1 τ max X 2, X k > = λ. For sufficiently small δλ 0, a Lasso solution for the penalty λ δλ is given by (0, τ max + δθ 2, δθ 3, 0,...). From LARS, δθ 2 = δθ 3. If we compare the squared error for these solution L δθ L 0 = 2(s τ max )δθ + 2δθ 2 < 0 for any 0 < δθ < 1/2(s τ max ) High dimensional graphs and variable selection with the LassoFeb. Nicolai 19. Meinshausen and / 21Pe

10 Lemma 1 Lasso estimate ˆθ a,a,λ of θ a,a is given by. Lemma 1 ˆθ a,a,λ = arg min θ:θ k =0 k / A (n 1 X a X θ 2 + λ θ 1 ) (10) Given θ R p(n), let G(θ) be a p(n)-dimensional vector with elements G b (θ) = 2n 1 < X a X θ, X b >. A vector ˆθ with ˆθ k = 0, k Γ(n)\A is a solution to the above for all b A, G b (ˆθ) = sign( ˆθ b )λ in case ˆθ b 0 and G b (ˆθ) λ in case ˆθ b = 0. Moreover, if the solution is not unique and G b (ˆθ) < λ for some solution θ, then ˆθ b = 0 for all solution of the above. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 10 and / 21Pe

11 Proof. D(θ) = subdifferential of (n 1 X a X θ 2 + λ θ 1 ) with respect to θ = {G(θ) + λe, e S} where S R p(n) is given by S = {e R p(n) e b = sign(θ b ) if θ b 0 and e b [ 1, 1]}. ˆθ is a solution to the above iff d D(θ) so that d b = 0 b A. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 11 and / 21Pe

12 Assumptions X N(0, Σ) High-dimensionality Assumption 1. There exists γ > 0 so that p(n) = O(n γ ) for n. Non-singularity Assumption 2. (a) For all a Γ(n) and n N, Var(X a ) = 1. (b) There exists v 2 > 0 so that for all n N and a Γ(n), Var(X a X Γ(n)\{a} ) v 2.[This excludes singular or nearly singular covariance matrices.] Sparsity Assumption 3. There exists some 0 κ < 1 so that max a Γ(n) ne a = O(n κ ) for n. [restriction on the size of the neighborhood]. Assumption 4. There exists some ϑ < so that for all neighboring nodes a, b Γ(n) and all n N, θ a,ne b\{a} 1 ϑ. [This is fulfilled if assumption 2 holds and the size of the overlap of neighborhoods is bounded by an arbitrarily large number from above. ] High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 12 and / 21Pe

13 Magnitude of partial correlations Assumption 5. There exists a constant δ > 0 and some ξ > κ so that for every (a, b) E, π a,b δn (1 ξ)/2. Neighborhood stability S a (b) := k ne a sign(θ a,nea k )θ b,nea k. Assumption 6. There exists some δ < 1 so that for all a, b Γ(n) with b / ne a, S a (b) < δ. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 13 and / 21Pe

14 Theorem 1: controlling type-i error Theorem 1 Let assumptions 1-6 be fulfilled. Let the penalty parameter satisfy λ n dn (1 ɛ)/2 with some κ < ɛ < ξ and d > 0. There exists some c > 0 so that, for all a Γ(n), P( ne ˆ λ a ne a ) = 1 O(exp( cn ɛ )) for n. It means that the probability of falsely including any of the non-neighboring variables is vanishing exponentially fast. Proposition 3 says that assumption 6 cannot be relaxed. Proposition 3 If there exists some a, b Γ(n) with b / ne a and S a (b) > 1, then P( ne ˆ λ a ne a ) 0 for n. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 14 and / 21Pe

15 Proof of Thm 1 P( ne ˆ λ a,λ a ne a ) = 1 P( b Γ(n)\cl a : ˆθ b 0). Consider the Lasso estimate ˆθ a,nea,λ which is constrained to have non-zero components only in ne a. Let E be the event max G k (ˆθ a,nea,λ ) < λ. k Γ(n)\cl a On this event, by Lemma 1, ˆθ a,nea,λ is a solution of (3) with A = Γ(n)\{a} as well as a solution of (10). P( b Γ(n)\cl a : ˆθ a,λ b 0) 1 P(E) = P( max k Γ(n)\cl a G k (ˆθ a,nea,λ ) λ ). It is sufficient to show there exists a constant c > 0 so that for all b Γ(n)\cl a, P( G b (ˆθ a,nea,λ ) λ) = O(exp( cn ɛ )). High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 15 and / 21Pe

16 cont. prof of Thm 1. One can write for any b Γ(n)\cl a, X b = m ne a θ b,nea m X m + V b, where V b N(0, σ 2 b ) for some σ2 b 1 and V b is independent of {X m m cl a }. Plugging this in gradient calculation, G b (ˆθ a,nea,λ ) = 2n 1 θm b,nea < X a X ˆθ a,nea,λ, X m > m ne a 2n 1 < X a X ˆθ a,nea,λ, V b >. By lemma 2, there exists some c > 0 so that with probability 1 O(exp( cn ɛ )), sign(ˆθ a,nea,λ ) = sign(θ a,nea k ), k ne a. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 16 and / 21Pe

17 cont. prof of Thm 1. With Lemma1, assumption 6, we get with probability 1 O(exp( cn ɛ )) and some δ < 1, G b (ˆθ a,nea,λ ) δλ + 2n 1 < X a X ˆθ a,nea,λ, V b >. Then, it remains to be shown that P( 2n 1 < X a, V b > (1 δ)λ) = O(exp( cn ɛ )). High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 17 and / 21Pe

18 Theorem 2 Let the assumptions of Theorem 1 be fulfilled. For λ = λ n as a in Theorem 1, it holds for some c > 0 that P(ne a ne ˆ λ a ) = 1 O(exp( cn ɛ )) for n. Proposition 4 says that assumption 5 cannot be relaxed. Proposition 4 Let the assumptions of Theorem 1 be fulfilled with ϑ < 1 in Assumption 4. For a Γ(n), let there be some b γ(n)\{a} with π ab 0 and π ab = O(n (1 ξ)/2 ) for n for some ξ < ɛ. Then P(b ne ˆ λ a ) 0 for n. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 18 and / 21Pe

19 Proof of Thm2 Let E be the event P(ne a ne ˆ λ a,λ a ) = 1 P( b ne a : ˆθ b = 0) max G k (ˆθ a,nea,λ ) < λ. k Γ(n)\cl a As in Thm 1, ˆθ a,nea,λ is a solution of (3). Then, P( b ne a : θ a,λ ˆ a,nea,λ b = 0) P( b ne a : ˆθ b = 0) + P(E c ). P(E C ) = O(exp( cn ɛ )) by theorem 1 and P(ˆθ a,nea,λ = 0) = O(exp( cn ɛ )) by lemma 2. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 19 and / 21Pe

20 Edge set Ideally, edge set can be given by An estimate of the edge set is E = {(a, b) : a ne b b ne a }. Ê λ, = {(a, b) : a ne ˆ λ b b neλ ˆ a }. Or Ê λ, = {(a, b) : a ne ˆ λ b b neλ ˆ a }. Corollary 1 Under the conditions of Theorem 2, for some c > 0, P(Ê λ = E) = 1 P(exp( cn ɛ )) for n. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 20 and / 21Pe

21 λ? How to choose the penalty? For any level 0 < α < 1, the penalty λ(α) = 2ˆσ a Φ 1 α ( n 2p(n) 2 ). Theorem 3 Assumptions 1-6 be fulfilled. Using the penalty λ(α), it holds for all n N that P( a Γ(n) : Ĉ λ a C a ) α. This constrains the probability of (falsely) connecting two distinct connectivity components of the true graph. High dimensional graphs and variable selection with the LassoFeb. Nicolai Meinshausen 21 and / 21Pe

HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO

The Annals of Statistics 2006, Vol. 34, No. 3, 1436 1462 DOI: 10.1214/009053606000000281 Institute of Mathematical Statistics, 2006 HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO BY NICOLAI