MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Size: px

Start display at page:

Download "MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models"

Letitia Gibbs
5 years ago
Views:

1 1/13 MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models Dominique Guillot Departments of Mathematical Sciences University of Delaware May 4, 2016

2 Recall 2/13 An undirected graphical model is a set of random variables {X 1,..., X p } satisfying a Markov property.

3 Recall 2/13 An undirected graphical model is a set of random variables {X 1,..., X p } satisfying a Markov property. Let G = (V, E) be a graph on {1,..., p}.

4 Recall 2/13 An undirected graphical model is a set of random variables {X 1,..., X p } satisfying a Markov property. Let G = (V, E) be a graph on {1,..., p}. The pairwise Markov property: X i X j rest whenever (i, j) E.

5 Recall 2/13 An undirected graphical model is a set of random variables {X 1,..., X p } satisfying a Markov property. Let G = (V, E) be a graph on {1,..., p}. The pairwise Markov property: X i X j rest whenever (i, j) E. If the density of X = (X 1,..., X p ) is continuous and positive, then pairwise local global.

6 Recall 2/13 An undirected graphical model is a set of random variables {X 1,..., X p } satisfying a Markov property. Let G = (V, E) be a graph on {1,..., p}. The pairwise Markov property: X i X j rest whenever (i, j) E. If the density of X = (X 1,..., X p ) is continuous and positive, then pairwise local global. The HammersleyCliord theorem provides a necessary and sucient condition for a random vector to have a Markov random eld structure with respect to a given graph G.

7 Recall 2/13 An undirected graphical model is a set of random variables {X 1,..., X p } satisfying a Markov property. Let G = (V, E) be a graph on {1,..., p}. The pairwise Markov property: X i X j rest whenever (i, j) E. If the density of X = (X 1,..., X p ) is continuous and positive, then pairwise local global. The HammersleyCliord theorem provides a necessary and sucient condition for a random vector to have a Markov random eld structure with respect to a given graph G. We will now turn our attention to the special case of a random vector with a multivariate Gaussian distribution.

8 Recall: Multivariate Gaussian/normal distribution 3/13 Recall: X = (X 1,..., X p ) N(µ, Σ) where µ R p and Σ = (σ ij ) R p p is positive denite if 1 P (X A) = e 1 2 (x µ)t Σ 1 (x µ) dx 1... dx p. (2π)p det Σ A

9 Recall: Multivariate Gaussian/normal distribution 3/13 Recall: X = (X 1,..., X p ) N(µ, Σ) where µ R p and Σ = (σ ij ) R p p is positive denite if 1 P (X A) = e 1 2 (x µ)t Σ 1 (x µ) dx 1... dx p. (2π)p det Σ Bivariate case: A

10 Recall: Multivariate Gaussian/normal distribution 3/13 Recall: X = (X 1,..., X p ) N(µ, Σ) where µ R p and Σ = (σ ij ) R p p is positive denite if 1 P (X A) = e 1 2 (x µ)t Σ 1 (x µ) dx 1... dx p. (2π)p det Σ Bivariate case: A We have E(X) = µ, Cov(X i, X j ) = σ ij.

11 Recall: Multivariate Gaussian/normal distribution 3/13 Recall: X = (X 1,..., X p ) N(µ, Σ) where µ R p and Σ = (σ ij ) R p p is positive denite if 1 P (X A) = e 1 2 (x µ)t Σ 1 (x µ) dx 1... dx p. (2π)p det Σ Bivariate case: A We have E(X) = µ, Cov(X i, X j ) = σ ij. If Y = c + BX, where c R p and B R m p, then Y N(c + Bµ, BΣB T ).

12 Recall: Multivariate Gaussian/normal distribution Recall: X = (X 1,..., X p ) N(µ, Σ) where µ R p and Σ = (σ ij ) R p p is positive denite if 1 P (X A) = e 1 2 (x µ)t Σ 1 (x µ) dx 1... dx p. (2π)p det Σ Bivariate case: A We have E(X) = µ, Cov(X i, X j ) = σ ij. If Y = c + BX, where c R p and B R m p, then Y N(c + Bµ, BΣB T ). Note: Ω := Σ 1 is called the precision matrix or the concentration matrix of the distribution. 3/13

13 The Schur complement 4/13 Let ( ) A B M := C D where A = A m m, B = B m n, C = C n m, and D = D n n.

14 The Schur complement 4/13 Let ( ) A B M := C D where A = A m m, B = B m n, C = C n m, and D = D n n. Assuming D is invertible, the Schur complement of D in M is M/D := A BD 1 C.

15 The Schur complement 4/13 Let ( ) A B M := C D where A = A m m, B = B m n, C = C n m, and D = D n n. Assuming D is invertible, the Schur complement of D in M is Important properties: 1 det M = det D det(m/d). M/D := A BD 1 C.

16 The Schur complement 4/13 Let ( ) A B M := C D where A = A m m, B = B m n, C = C n m, and D = D n n. Assuming D is invertible, the Schur complement of D in M is Important properties: 1 det M = det D det(m/d). M/D := A BD 1 C. 2 M P n+m if and only if D P n and M/D P m. where P k = denotes the cone of k k real symmetric positive semidenite matrices.

17 The Schur complement 4/13 Let ( ) A B M := C D where A = A m m, B = B m n, C = C n m, and D = D n n. Assuming D is invertible, the Schur complement of D in M is Important properties: 1 det M = det D det(m/d). M/D := A BD 1 C. 2 M P n+m if and only if D P n and M/D P m. where P k = denotes the cone of k k real symmetric positive semidenite matrices. Proof: ( Im BD M = 1 ) ( A BD 1 ) ( C 0 Im 0 0 I n 0 D D 1 C I n ).

18 Multivariate Gaussian/normal distribution (cont.) 5/13 Conditional distribution: if A B is a partition of {1,..., p}, then X A X B = x B N(µ A B, Σ A B ),

19 Multivariate Gaussian/normal distribution (cont.) 5/13 Conditional distribution: if A B is a partition of {1,..., p}, then X A X B = x B N(µ A B, Σ A B ), with and µ A B := µ A + Σ AB Σ 1 BB (x B µ B ), Σ A B := Σ AA Σ AB Σ 1 BB Σ BA.

20 Multivariate Gaussian/normal distribution (cont.) 5/13 Conditional distribution: if A B is a partition of {1,..., p}, then X A X B = x B N(µ A B, Σ A B ), with µ A B := µ A + Σ AB Σ 1 BB (x B µ B ), and Σ A B := Σ AA Σ AB Σ 1 BB Σ BA. Marginals: to obtain the joint distribution of (X i, X j ), note that where (X i, X j ) T = B(X 1,..., X p ) T

21 Multivariate Gaussian/normal distribution (cont.) 5/13 Conditional distribution: if A B is a partition of {1,..., p}, then X A X B = x B N(µ A B, Σ A B ), with µ A B := µ A + Σ AB Σ 1 BB (x B µ B ), and Σ A B := Σ AA Σ AB Σ 1 BB Σ BA. Marginals: to obtain the joint distribution of (X i, X j ), note that where (X i, X j ) T = B(X 1,..., X p ) T B = ( I (p 2) ) R 2 p.

22 Multivariate Gaussian/normal distribution (cont.) Conditional distribution: if A B is a partition of {1,..., p}, then X A X B = x B N(µ A B, Σ A B ), with µ A B := µ A + Σ AB Σ 1 BB (x B µ B ), and Σ A B := Σ AA Σ AB Σ 1 BB Σ BA. Marginals: to obtain the joint distribution of (X i, X j ), note that where Therefore and Bµ = (X i, X j ) T = B(X 1,..., X p ) T B = ( I (p 2) ) R 2 p. (X i, X j ) T N(Bµ, BΣB T ), ( µ1 µ 2 ), BΣB T = ( σ11 σ 12 σ 21 σ 22 ). 5/13

23 Multivariate Gaussian/normal distribution (cont.) 6/13 Now, suppose X N(µ, Σ) with µ R p and Σ = (σ ij ) R p p psd.

24 Multivariate Gaussian/normal distribution (cont.) 6/13 Now, suppose X N(µ, Σ) with µ R p and Σ = (σ ij ) R p p psd. Claim: 1 X i X j i σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0.

25 Multivariate Gaussian/normal distribution (cont.) 6/13 Now, suppose X N(µ, Σ) with µ R p and Σ = (σ ij ) R p p psd. Claim: 1 X i X j i σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. Proof of (1): L X i X j X i X j = x j = Xi x j.

26 Multivariate Gaussian/normal distribution (cont.) 6/13 Now, suppose X N(µ, Σ) with µ R p and Σ = (σ ij ) R p p psd. Claim: 1 X i X j i σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. Proof of (1): L X i X j X i X j = x j = Xi x j. Now X i X j = x j N ( µ i + σ ) ii ρ(x j µ j ), (1 ρ 2 )σii 2, σ jj where ρ = σ ij σ ii σ jj is the correlation coecient between X i and X j.

27 6/13 Multivariate Gaussian/normal distribution (cont.) Now, suppose X N(µ, Σ) with µ R p and Σ = (σ ij ) R p p psd. Claim: 1 X i X j i σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. Proof of (1): X i X j X i X j = x j L = Xi x j. Now X i X j = x j N ( µ i + σ ) ii ρ(x j µ j ), (1 ρ 2 )σii 2, σ jj where ρ = σ ij σ ii σ jj is the correlation coecient between X i and X j. Therefore X i X j i ρ = 0 i σ ij = 0.

28 Multivariate Gaussian/normal distribution (cont.) 7/13 Proof of (2): Without loss of generality, assume (i, j) = (1, 2). Write µ, Σ in block form according to the partition A = {1, 2}, B = {3,..., p}: ( ) µ = (µ A, µ B ) T ΣAA Σ, Σ = AB. Σ BA Σ BB

29 Multivariate Gaussian/normal distribution (cont.) 7/13 Proof of (2): Without loss of generality, assume (i, j) = (1, 2). Write µ, Σ in block form according to the partition A = {1, 2}, B = {3,..., p}: ( ) µ = (µ A, µ B ) T ΣAA Σ, Σ = AB. Σ BA Σ BB Now where and (X 1, X 2 ) T rest = x B N(µ A B, Σ A B ), µ A B := µ A + Σ AB Σ 1 BB (x B µ B ), Σ A B := Σ AA Σ AB Σ 1 BB Σ BA

30 Multivariate Gaussian/normal distribution (cont.) 7/13 Proof of (2): Without loss of generality, assume (i, j) = (1, 2). Write µ, Σ in block form according to the partition A = {1, 2}, B = {3,..., p}: ( ) µ = (µ A, µ B ) T ΣAA Σ, Σ = AB. Σ BA Σ BB Now (X 1, X 2 ) T rest = x B N(µ A B, Σ A B ), where and µ A B := µ A + Σ AB Σ 1 BB (x B µ B ), Σ A B := Σ AA Σ AB Σ 1 BB Σ BA By part (1), X 1 X 2 rest i (Σ A B ) 12 = 0.

31 The inverse of a block matrix 8/13 Computing the inverse of a block matrix: Ref.: Petersen and Pedersen, The matrix cookbook.

32 The inverse of a block matrix 8/13 Computing the inverse of a block matrix: Ref.: Petersen and Pedersen, The matrix cookbook. It follows that Σ 1 A B = (Σ 1 ) 1:2,1:2

33 Multivariate Gaussian/normal distribution (cont.) 9/13 We have shown Σ 1 A B = (Σ 1 ) 1:2,1:2.

34 Multivariate Gaussian/normal distribution (cont.) 9/13 We have shown Also, we have ( ) 1 a b = b c Σ 1 A B = (Σ 1 ) 1:2,1:2. 1 ac b 2 ( c b b a ).

35 Multivariate Gaussian/normal distribution (cont.) 9/13 We have shown Also, we have Finally, ( ) 1 a b = b c Σ 1 A B = (Σ 1 ) 1:2,1:2. 1 ac b 2 ( c b b a ). (Σ A B ) 12 = 0 (Σ 1 A B ) 12 = 0 (Σ 1 ) 12 = 0.

36 Multivariate Gaussian/normal distribution (cont.) 9/13 We have shown Also, we have Finally, ( ) 1 a b = b c Σ 1 A B = (Σ 1 ) 1:2,1:2. 1 ac b 2 ( c b b a ). (Σ A B ) 12 = 0 (Σ 1 A B ) 12 = 0 (Σ 1 ) 12 = 0. Therefore, X i X j rest i (Σ 1 ) ij = 0.

37 Estimating the conditional independence structure of a GGM 10/13 We have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0.

38 Estimating the conditional independence structure of a GGM 10/13 We have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1.

39 Estimating the conditional independence structure of a GGM 10/13 We have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso.

40 Estimating the conditional independence structure of a GGM 10/13 We have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1.

41 Estimating the conditional independence structure of a GGM 10/13 We have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. Suppose x (1),..., x (n) R p are iid observations of X. The associated log-likelihood of (µ, Σ) is given by l(µ, Σ) := n n 2 log det Σ 1 (x (i) µ) T Σ 1 (x (i) µ) np 2 2 log(2π). i=1

42 Estimating the conditional independence structure of a GGM We have shown that when X N(µ, Σ), 1 X i X j i Σ ij = 0. 2 X i X j rest i (Σ 1 ) ij = 0. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. We will proceed in a way that is similar to the lasso. To discover the conditional structure of X, we need to estimate the structure of zeros of the precision matrix Ω = Σ 1. Suppose x (1),..., x (n) R p are iid observations of X. The associated log-likelihood of (µ, Σ) is given by l(µ, Σ) := n n 2 log det Σ 1 (x (i) µ) T Σ 1 (x (i) µ) np 2 2 log(2π). i=1 Classical result: the MLE of (µ, Σ) is given by ˆµ := 1 n x (i), S := 1 n (x (i) ˆµ)(x (i) ˆµ) T. n n i=1 i=1 10/13

43 Estimating the CI structure of a GGM (cont.) 11/13 Using ˆµ and Σ, we can conveniently rewrite the log-likelihood as: l(µ, Σ) = n 2 log det Σ n 2 Tr(SΣ 1 ) np 2 log(2π) n 2 Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ). (use the identity x T Ax = Tr(Axx T ).

44 Estimating the CI structure of a GGM (cont.) 11/13 Using ˆµ and Σ, we can conveniently rewrite the log-likelihood as: l(µ, Σ) = n 2 log det Σ n 2 Tr(SΣ 1 ) np 2 log(2π) n 2 Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ). (use the identity x T Ax = Tr(Axx T ). Note that the last term is minimized when µ = ˆµ (independently of Σ) since Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ) = (ˆµ µ) T Σ 1 (ˆµ µ) 0. (The last inequality holds since Σ 1 is positive denite.)

45 Estimating the CI structure of a GGM (cont.) 11/13 Using ˆµ and Σ, we can conveniently rewrite the log-likelihood as: l(µ, Σ) = n 2 log det Σ n 2 Tr(SΣ 1 ) np 2 log(2π) n 2 Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ). (use the identity x T Ax = Tr(Axx T ). Note that the last term is minimized when µ = ˆµ (independently of Σ) since Tr(Σ 1 (ˆµ µ)(ˆµ µ) T ) = (ˆµ µ) T Σ 1 (ˆµ µ) 0. (The last inequality holds since Σ 1 is positive denite.) Therefore the log-likelihood of Ω := Σ 1 is l(ω) log det Ω Tr(SΩ) (up to a constant).

46 The Graphical Lasso 12/13 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter.

47 The Graphical Lasso 12/13 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω.

48 The Graphical Lasso 12/13 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω. Just like in the lasso problem, using a 1-norm tends to introduce many zeros into Ω.

49 The Graphical Lasso 12/13 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω. Just like in the lasso problem, using a 1-norm tends to introduce many zeros into Ω. The regularization parameter ρ can be chosen by cross-validation.

50 The Graphical Lasso 12/13 The Graphical Lasso (glasso) algorithm (Friedman, Hastie, Tibshirani, 2007), Banerjee et al. (2007), solves the penalized likelihood problem: ˆΩ ρ = argmax Ω psd log det Ω Tr(SΩ) ρ p Ω 1, i,j=1 where Ω 1 := p i,j=1 Ω ij, and ρ > 0 is a xed regularization parameter. Idea: Make a trade-o between maximizing the likelihood and having a sparse Ω. Just like in the lasso problem, using a 1-norm tends to introduce many zeros into Ω. The regularization parameter ρ can be chosen by cross-validation. The above problem can be eciently solved for problems with up to a few thousand variables (see e.g. ESL, Algorithm 17.2).

51 MLE estimation of a GGM 13/13 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ).

52 MLE estimation of a GGM 13/13 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ). Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}.

53 MLE estimation of a GGM 13/13 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ). Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}. We can now estimate the optimal covariance matrix with the given graph structure by solving: ˆΣ G := argmax l(σ), Σ : Ω=Σ 1 P G where l(σ) denotes the log-likelihood of Σ.

54 MLE estimation of a GGM 13/13 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ). Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}. We can now estimate the optimal covariance matrix with the given graph structure by solving: ˆΣ G := argmax l(σ), Σ : Ω=Σ 1 P G where l(σ) denotes the log-likelihood of Σ. Note: Instead of maximizing the log-likelihood over all possible psd matrices as in the classical case, we restrict ourselves to the matrices having the right conditional independence structure.

55 MLE estimation of a GGM 13/13 From the glasso solution, one infers a conditional independence graph for X = (X 1,..., X p ). Given a graph G = (V, E) with p vertices, let P G := {A P p : A ij = 0 if (i, j) E}. We can now estimate the optimal covariance matrix with the given graph structure by solving: ˆΣ G := argmax l(σ), Σ : Ω=Σ 1 P G where l(σ) denotes the log-likelihood of Σ. Note: Instead of maximizing the log-likelihood over all possible psd matrices as in the classical case, we restrict ourselves to the matrices having the right conditional independence structure. The graphical MLE problem can be solved eciently for up to a few thousand variables (see e.g. ESL, Algorithm 17.1).

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

1/12 MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) Dominique Guillot Departments of Mathematical Sciences University of Delaware May 6, 2016