MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Size: px

Start display at page:

Download "MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)"

Gwendoline May
5 years ago
Views:

1 1/12 MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) Dominique Guillot Departments of Mathematical Sciences University of Delaware May 6, 2016

2 Estimating the conditional independence structure of a GGM 2/12 During the last lecture, we have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0.

3 Estimating the conditional independence structure of a GGM 2/12 During the last lecture, we have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1.

4 Estimating the conditional independence structure of a GGM 2/12 During the last lecture, we have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso.

5 Estimating the conditional independence structure of a GGM 2/12 During the last lecture, we have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso. Suppose x (1),..., x (n) R p are iid observations of X. The associated log-likelihood of (µ, Σ) is given by l(µ, Σ) := n 2 log det Σ 1 2 n i=1 (x (i) µ) T Σ 1 (x (i) µ) np 2 log(2π).

6 Estimating the conditional independence structure of a GGM 2/12 During the last lecture, we have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso. Suppose x (1),..., x (n) R p are iid observations of X. The associated log-likelihood of (µ, Σ) is given by l(µ, Σ) := n 2 log det Σ 1 2 n i=1 (x (i) µ) T Σ 1 (x (i) µ) np 2 log(2π). Classical result: the MLE of (µ, Σ) is given by ˆµ := 1 n n x (i), S := 1 n n (x (i) ˆµ)(x (i) ˆµ) T. i=1 i=1

7 Estimating the CI structure of a GGM (cont.) 3/12 Using ˆµ and Σ, we can conveniently rewrite the log-likelihood as: l(µ, Σ) = n 2 log det Σ n 2 Tr(SΣ 1 ) np 2 log(2π) n 2 Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ). (use the identity x T Ax = Tr(Axx T ).

8 Estimating the CI structure of a GGM (cont.) 3/12 Using ˆµ and Σ, we can conveniently rewrite the log-likelihood as: l(µ, Σ) = n 2 log det Σ n 2 Tr(SΣ 1 ) np 2 log(2π) n 2 Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ). (use the identity x T Ax = Tr(Axx T ). Note that the last term is minimized when µ = ˆµ (independently of Σ) since Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ) = (ˆµ µ) T Σ 1 (ˆµ µ) 0. (The last inequality holds since Σ 1 is positive denite.)

9 Estimating the CI structure of a GGM (cont.) 3/12 Using ˆµ and Σ, we can conveniently rewrite the log-likelihood as: l(µ, Σ) = n 2 log det Σ n 2 Tr(SΣ 1 ) np 2 log(2π) n 2 Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ). (use the identity x T Ax = Tr(Axx T ). Note that the last term is minimized when µ = ˆµ (independently of Σ) since Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ) = (ˆµ µ) T Σ 1 (ˆµ µ) 0. (The last inequality holds since Σ 1 is positive denite.) Therefore the log-likelihood of Ω := Σ 1 is l(ω) log det Ω Tr(SΩ) (up to a constant).

10 The Graphical Lasso 4/12 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter.

11 The Graphical Lasso 4/12 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω.

12 The Graphical Lasso 4/12 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω. Just like in the lasso problem, using a 1-norm tends to introduce many zeros into Ω.

13 The Graphical Lasso 4/12 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω. Just like in the lasso problem, using a 1-norm tends to introduce many zeros into Ω. The regularization parameter ρ can be chosen by cross-validation.

14 The Graphical Lasso 4/12 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω. Just like in the lasso problem, using a 1-norm tends to introduce many zeros into Ω. The regularization parameter ρ can be chosen by cross-validation. The above problem can be eciently solved for problems with up to a few thousand variables (see e.g. ESL, Algorithm 17.2).

15 The Graphical Lasso (cont.) 5/12 We need to maximize F (Ω) := log det Ω Tr(SΩ) ρ p Ω 1. i,j=1 Since F is concave, we can use the sub-gradient to identify optimal points of F

16 The Graphical Lasso (cont.) 5/12 We need to maximize F (Ω) := log det Ω Tr(SΩ) ρ p Ω 1. i,j=1 Since F is concave, we can use the sub-gradient to identify optimal points of F (to be really rigorous, we should be working with F in order to use the sub-gradient, but the derivation is the same).

17 The Graphical Lasso (cont.) 5/12 We need to maximize F (Ω) := log det Ω Tr(SΩ) ρ p Ω 1. i,j=1 Since F is concave, we can use the sub-gradient to identify optimal points of F (to be really rigorous, we should be working with F in order to use the sub-gradient, but the derivation is the same). We have Ω log det Ω = Ω 1, Tr(SΩ) = S. Ω

18 The Graphical Lasso (cont.) 5/12 We need to maximize F (Ω) := log det Ω Tr(SΩ) ρ p Ω 1. i,j=1 Since F is concave, we can use the sub-gradient to identify optimal points of F (to be really rigorous, we should be working with F in order to use the sub-gradient, but the derivation is the same). We have Ω log det Ω = Ω 1, Tr(SΩ) = S. Ω Also, where p Ω ij = Sign(Ω) i,j=1 Sign(Ω) ij = 1 if Ω ij > 0 1 if Ω ij < 0. [ 1, 1] if Ω ij = 0

19 The Graphical Lasso (cont.) 6/12 Putting everything together, we get F = Ω 1 S ρ Sign(Ω).

20 The Graphical Lasso (cont.) 6/12 Putting everything together, we get F = Ω 1 S ρ Sign(Ω). Just like for the lasso problem, we will derive a coordinate-wise approach to solve the glasso problem.

21 The Graphical Lasso (cont.) 6/12 Putting everything together, we get F = Ω 1 S ρ Sign(Ω). Just like for the lasso problem, we will derive a coordinate-wise approach to solve the glasso problem. Let W = Ω 1. Write W and Ω in block form ( W11 w W = 12 w12 T w 22 where W 11, Ω 11 R (p 1) (p 1). ), Ω = ( Ω11 ω 12 ω T 12 ω 22 ),

22 The Graphical Lasso (cont.) 6/12 Putting everything together, we get F = Ω 1 S ρ Sign(Ω). Just like for the lasso problem, we will derive a coordinate-wise approach to solve the glasso problem. Let W = Ω 1. Write W and Ω in block form ( W11 w W = 12 w12 T w 22 ), Ω = ( Ω11 ω 12 ω T 12 ω 22 ), where W 11, Ω 11 R (p 1) (p 1). We will cyclically optimize F, one column/row at a time.

23 The Graphical Lasso (cont.) 6/12 Putting everything together, we get F = Ω 1 S ρ Sign(Ω). Just like for the lasso problem, we will derive a coordinate-wise approach to solve the glasso problem. Let W = Ω 1. Write W and Ω in block form ( W11 w W = 12 w12 T w 22 ), Ω = ( Ω11 ω 12 ω T 12 ω 22 where W 11, Ω 11 R (p 1) (p 1). We will cyclically optimize F, one column/row at a time. Note that since W Ω = I, we have ( W11 Ω 11 + w 12 ω12 T ) ( ) W 11 ω 12 + w 12 ω 22 I(p 1) (p 1) 0 w12 T Ω 11 + w 22 ω12 T w12 T ω = (p 1) w 22 ω (p 1) 0 ),

24 The Graphical Lasso (cont.) 7/12 In particular, we have W 11 ω 12 + w 12 ω 22 = 0, i.e., where β := ω 12 /ω 22. w 12 = W 11 ω 12 ω 22 = W 11 β,

25 The Graphical Lasso (cont.) 7/12 In particular, we have W 11 ω 12 + w 12 ω 22 = 0, i.e., w 12 = W 11 ω 12 ω 22 = W 11 β, where β := ω 12 /ω 22. Now, the upper right block of Ω 1 S ρ Sign(Ω) is equal to since ω 22 > 0. w 12 s 12 + ρ Sign(β)

26 The Graphical Lasso (cont.) 7/12 In particular, we have W 11 ω 12 + w 12 ω 22 = 0, i.e., w 12 = W 11 ω 12 ω 22 = W 11 β, where β := ω 12 /ω 22. Now, the upper right block of Ω 1 S ρ Sign(Ω) is equal to w 12 s 12 + ρ Sign(β) since ω 22 > 0. We need to choose w 12 such that 0 w 12 s 12 + ρ Sign(β) 0 W 11 β s 12 + ρ Sign(β).

27 The Graphical Lasso (cont.) 7/12 In particular, we have W 11 ω 12 + w 12 ω 22 = 0, i.e., w 12 = W 11 ω 12 ω 22 = W 11 β, where β := ω 12 /ω 22. Now, the upper right block of Ω 1 S ρ Sign(Ω) is equal to w 12 s 12 + ρ Sign(β) since ω 22 > 0. We need to choose w 12 such that 0 w 12 s 12 + ρ Sign(β) 0 W 11 β s 12 + ρ Sign(β). 1 Observation: in the lasso problem min β 2 y Zβ 2 + ρ β 1, we have ( ) 1 2 y Zβ 2 + ρ β 1 = Z T Zβ Z T y + ρ Sign(β).

28 The Graphical Lasso (cont.) 8/12 So, we have the two optimality conditions: Glasso update: 0 W 11 β s 12 + ρ Sign(β) Lasso problem: 0 Z T Zβ Z T y + ρ Sign(β)

29 The Graphical Lasso (cont.) 8/12 So, we have the two optimality conditions: Glasso update: 0 W 11 β s 12 + ρ Sign(β) Lasso problem: 0 Z T Zβ Z T y + ρ Sign(β) Now, let Z := W 1/2 1/2 11, and y := W11 s 12.

30 The Graphical Lasso (cont.) 8/12 So, we have the two optimality conditions: Glasso update: 0 W 11 β s 12 + ρ Sign(β) Lasso problem: 0 Z T Zβ Z T y + ρ Sign(β) Now, let Z := W 1/2 1/2 11, and y := W11 s 12. The glasso update is thus equivalent to solving the lasso problem: 1/2 min W11 s 12 W 1/ β ρ β 1. β 1

31 The Graphical Lasso (cont.) 8/12 So, we have the two optimality conditions: Glasso update: 0 W 11 β s 12 + ρ Sign(β) Lasso problem: 0 Z T Zβ Z T y + ρ Sign(β) Now, let Z := W 1/2 1/2 11, and y := W11 s 12. The glasso update is thus equivalent to solving the lasso problem: 1/2 min W11 s 12 W 1/ β ρ β 1. β 1 We can therefore solve the glasso problem by cycling through the row/columns of W, and updating them by solving a lasso problem!

32 The Graphical Lasso (cont.) 9/12 We therefore have the following algorithm to solve the glasso problem. ESL, Algorithm 17.2.

33 MLE estimation of a GGM 10/12 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ) by examining the zeros in the estimated inverse covariance matrix.

34 MLE estimation of a GGM 10/12 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ) by examining the zeros in the estimated inverse covariance matrix. Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}.

35 MLE estimation of a GGM 10/12 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ) by examining the zeros in the estimated inverse covariance matrix. Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}. We can now estimate the optimal covariance matrix with the given graph structure by solving: ˆΣ G := argmax l(σ), Σ : Ω=Σ 1 P G where l(σ) denotes the log-likelihood of Σ.

36 MLE estimation of a GGM 10/12 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ) by examining the zeros in the estimated inverse covariance matrix. Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}. We can now estimate the optimal covariance matrix with the given graph structure by solving: ˆΣ G := argmax l(σ), Σ : Ω=Σ 1 P G where l(σ) denotes the log-likelihood of Σ. Note: Instead of maximizing the log-likelihood over all possible psd matrices as in the classical case, we restrict ourselves to the matrices having the right conditional independence structure.

37 MLE estimation of a GGM From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ) by examining the zeros in the estimated inverse covariance matrix. Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}. We can now estimate the optimal covariance matrix with the given graph structure by solving: ˆΣ G := argmax l(σ), Σ : Ω=Σ 1 P G where l(σ) denotes the log-likelihood of Σ. Note: Instead of maximizing the log-likelihood over all possible psd matrices as in the classical case, we restrict ourselves to the matrices having the right conditional independence structure. The graphical MLE problem can be solved eciently for up to a few thousand variables (see e.g. ESL, Algorithm 17.1). 10/12

38 MLE estimation of a GGM (cont.) Computing the Gaussian MLE of a multivariate normal random vector with known conditional independence graph G: ESL, Algorithm The derivation of the algorithm is similar to the derivation of the glasso algorithm (see ESL, Section ). 11/12

39 Application Example: Estimating the conditional independencies in temperature elds (Guillot et al., 2015) 12/12

40 Application Example: Estimating the conditional independencies in temperature elds (Guillot et al., 2015) 12/12

41 Application Example: Estimating the conditional independencies in temperature elds (Guillot et al., 2015) Reconstructing climate elds using paleoclimate proxies: Estimate conditional independence graph on instrumental period. 12/12

42 Application Example: Estimating the conditional independencies in temperature elds (Guillot et al., 2015) Reconstructing climate elds using paleoclimate proxies: Estimate conditional independence graph on instrumental period. Use an EM algorithm with an embedded graphical model. 12/12

43 Application Example: Estimating the conditional independencies in temperature elds (Guillot et al., 2015) Reconstructing climate elds using paleoclimate proxies: Estimate conditional independence graph on instrumental period. Use an EM algorithm with an embedded graphical model. The resulting algorithm is called GraphEM. 12/12

, 2015) Reconstructing climate elds using paleoclimate proxies: Estimate conditional

44 Application Example: Estimating the conditional independencies in temperature elds (Guillot et al., 2015) Reconstructing climate elds using paleoclimate proxies: Estimate conditional independence graph on instrumental period. Use an EM algorithm with an embedded graphical model. The resulting algorithm is called GraphEM. See Guillot et al.(2015) for more details. 12/12

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

1/13 MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models Dominique Guillot Departments of Mathematical Sciences University of Delaware May 4, 2016 Recall