Gaussian Graphical Models and Graphical Lasso

ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017

Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf { 1 f(x) = (2π) p/2 exp 1 } 1/2 det (Σ) 2 x Σ 1 x { det (Θ) 1/2 exp 1 } 2 x Θx where Σ = E[xx ] 0 is covariance matrix, and Θ = Σ 1 is inverse covariance matrix / precision matrix Graphical lasso 5-2

Undirected graphical models x 1 x 4 {x 2, x 3, x 5, x 6, x 7, x 8 } Represent a collection of variables x = [x 1,, x p ] by a vertex set V = {1,, p} Encode conditional independence by a set E of edges For any pair of vertices u and v, (u, v) / E x u x v x V\{u,v} Graphical lasso 5-3

Gaussian graphical models Fact 5.1 (Homework) Consider a Gaussian vector x N (0, Σ). For any u and v, x u x v x V\{u,v} iff Θ u,v = 0, where Θ = Σ 1. conditional independence sparsity Graphical lasso 5-4

Gaussian graphical models 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } {{ } Θ Graphical lasso 5-5

Likelihoods for Gaussian models Draw n i.i.d. samples x (1),, x (n) N (0, Σ), then log-likelihood (up to additive constant) is l (Θ) = 1 n log f(x (i) ) = 1 n 2 i=1 = 1 2 log det (Θ) 1 S, Θ, 2 log det (Θ) 1 2n n x (i) Θx (i) i=1 where S := 1 n ni=1 x (i) x (i) is sample covariance; S, Θ = tr(sθ) Maximum likelihood estimation maximize Θ 0 log det (Θ) S, Θ Graphical lasso 5-6

Challenge in high-dimensional regime Classical theory says MLE coverges to the truth as sample size n Practically, we are often in the regime where sample size n is small (n < p) In this regime, S is rank-deficient, and MLE does not even exist Graphical lasso 5-7

Graphical lasso (Friedman, Hastie, &Tibshirani 08) Many pairs of variables are conditionally independent many missing links in the graphical model (sparsity) Key idea: apply lasso by treating each node as a response variable maximize Θ 0 log det (Θ) S, Θ λ Θ 1 }{{} lasso penalty It is a convex program! (homework) First-order optimality condition Θ 1 S λ Θ 1 }{{} = 0 (5.1) subgradient = (Θ 1 ) i,i = S i,i + λ, 1 i p Graphical lasso 5-8

Blockwise coordinate descent Idea: repeatedly cycle through all columns/rows and, in each step, optimize only a single column/row Notation: use W to denote working version of Θ 1. Partition all matrices into 1 column/row vs. the rest [ ] [ ] [ ] Θ11 θ Θ = 12 S11 s θ12 S = 12 W11 w θ 22 s W = 12 12 s 22 w12 w 22 Graphical lasso 5-9

Blockwise coordinate descent Blockwise step: suppose we fix all but the last row / column. It follows from (5.1) that 0 W 11 β s 12 λ θ 12 1 = W 11 β s 12 + λ β 1 (5.2) where β = θ 12 / θ 22 (since [ ] 1 [ ] Θ 11 θ 12 1 Θ 1 θ12 = θ 22 11 θ 12 θ 22 ) with }{{} θ 22 = θ 22 θ 12 Θ 1 11 θ 12 > 0 matrix inverse formula This coincides with the optimality condition for 1 minimize β W 1/2 11 2 β W 1/2 11 s 12 2 2 + λ β 1 (5.3) Graphical lasso 5-10

Blockwise coordinate descent Algorithm 5.1 Block coordinate descent for graphical lasso Initialize W = S + λi and fix its diagonals {w i,i }. Repeat until covergence: for t = 1, p: (i) Partition W (resp. S) into 4 parts, where the upper-left part consists of all but the jth row / column (ii) Solve (iii) Update w 12 = W 11 β 1 1/2 minimize β W11 2 β W 1/2 11 s 12 2 + λ β 1 Set ˆθ 12 = ˆθ 22 β with ˆθ 22 = 1/(w 22 w 12 β) Graphical lasso 5-11

Blockwise coordinate descent The only remaining thing is to ensure W 0. This is automatically satisfied: Lemma 5.2 (Mazumder & Hastie, 12) If we start with W 0 satisfying W S λ, then every row/column update maintains positive definiteness of W. If we start with W (0) = S + λi, then W (t) will always be positive definite Graphical lasso 5-12

Proof of Lemma 5.2 A key observation for the proof of Lemma 5.2 Fact 5.3 (Lemma 2, Mazumder & Hastie, 12) Solving (5.3) is equivalent to solving minimize γ (s 12 + γ) W 1 11 (s 12 + γ) s.t. γ λ (5.4) where solutions to 2 problems are related by ˆβ = W 1 11 (s 12 + ˆγ) Check that optimality condition of (5.3) and that of (5.4) match Graphical lasso 5-13

Proof of Lemma 5.2 Suppose in t th iteration one has W (t) S λ and W (t) 11 0; w 22 w (t) ( 12 W (t) 0 W (t) 11 ) 1w (t) 12 We only update w 12, so it suffices to show w 22 w (t+1) ( 12 W (t) 11 > 0 (Schur complement) ) 1w (t+1) 12 > 0 (5.5) Recall that w (t+1) 12 = W t 11 βt+1. It follows from Fact 5.3 that and w (t+1) 12 s 12 λ; w (t+1) ( (t)) 1w (t+1) 12 W 11 12 w (t) 12 ( (t)) 1w (t) W 11 12. Since w 22 = s 22 + λ remains unchanged, we establish (5.5). Graphical lasso 5-14

Reference [1] Sparse inverse covariance estimation with the graphical lasso, J. Friedman, T. Hastie, and R. Tibshirani, Biostatistics, 2008. [2] The graphical lasso: new insights and alternatives, R. Mazumder and T. Hastie, Electronic journal of statistics, 2012. [3] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, 2015. Graphical lasso 5-15