Structure estimation for Gaussian graphical models

Size: px

Start display at page:

Download "Structure estimation for Gaussian graphical models"

Sharyl Russell
5 years ago
Views:

1 Faculty of Science Structure estimation for Gaussian graphical models Steffen Lauritzen, University of Copenhagen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 3 Slide 1/48

2 Overview of lectures Lecture 1 Markov Properties and the Multivariate Gaussian Distribution Lecture 2 Likelihood Analysis of Gaussian Graphical Models Lecture 3 Structure Estimation for Gaussian Graphical Models. For reference, if nothing else is mentioned, see Lauritzen (1996), Chapters 3 and 4. Slide 2/48

3 Gaussian graphical model S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. Slide 3/48

4 Gaussian graphical model S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. A Gaussian graphical model for X specifies X as multivariate normal with K S + (G) and otherwise unknown. Slide 3/48

5 Gaussian graphical model S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. A Gaussian graphical model for X specifies X as multivariate normal with K S + (G) and otherwise unknown. The likelihood function based on a sample of size n is L(K) (det K) n/2 e tr(kw)/2, where w is the (Wishart) matrix of sums of squares and products and Σ 1 = K S + (G). Slide 3/48

6 Representation via basis matrices Define the matrices T u, u V E as those with elements 1 if u V and i = j = u Tij u = 1 if u E and u = {i, j} ; 0 otherwise. then T u, u V E forms a basis for the linear space S(G) of symmetric matrices over V which have zero entries ij whenever i and j are non-adjacent in G. Slide 4/48

7 Representation via basis matrices Define the matrices T u, u V E as those with elements 1 if u V and i = j = u Tij u = 1 if u E and u = {i, j} ; 0 otherwise. then T u, u V E forms a basis for the linear space S(G) of symmetric matrices over V which have zero entries ij whenever i and j are non-adjacent in G. We can then identify the family as a (regular and canonical) exponential family with tr(t u W )/2, u V E as canonical sufficient statistics. This yields the likelihood equations tr(t u w) = n tr(t u Σ), u V E. Slide 4/48

8 Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and ( n(wcc ) M c K = 1 + K ca (K aa ) 1 ) K ac K ca. (1) K ac K aa This operation is clearly well defined if w cc is positive definite. Slide 5/48

9 Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and ( n(wcc ) M c K = 1 + K ca (K aa ) 1 ) K ac K ca. (1) K ac K aa This operation is clearly well defined if w cc is positive definite. Next we choose any ordering (c 1,..., c k ) of the cliques in G. Choose further K 0 = I and define for r = 0, 1,... K r+1 = (M c1 M ck )K r. Slide 5/48

10 Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and ( n(wcc ) M c K = 1 + K ca (K aa ) 1 ) K ac K ca. (1) K ac K aa This operation is clearly well defined if w cc is positive definite. Next we choose any ordering (c 1,..., c k ) of the cliques in G. Choose further K 0 = I and define for r = 0, 1,... Then we have: K r+1 = (M c1 M ck )K r. ˆK = lim r K r, provided the maximum likelihood estimate ˆK of K exists. Slide 5/48

11 Characterizing decomposable graphs A graph is chordal if all cycles of length 4 have chords. The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable; (iii) All maximal prime subgraphs of G are cliques; There are also many other useful characterizations of chordal graphs and algorithms that identify them. Trees are chordal graphs and thus decomposable. Slide 6/48

12 If the graph G is chordal, we say that the graphical model is decomposable. In this case, the IPS-algorithm converges in a finite number of steps. We also have the factorization of densities C C f (x Σ) = f (x C Σ C ) S S f (x S Σ S ) ν(s) (2) where ν(s) is the number of times S appear as intersection between neighbouring cliques of a junction tree for C. Similar factorizations naturally hold for the maximum likelihood estimate ˆΣ. Slide 7/48

13 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) Slide 8/48

14 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Slide 8/48

15 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Structural learning (AI or machine learning) Slide 8/48

16 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Structural learning (AI or machine learning) Graphical models describe conditional independence structures, so good case for formal analysis. Slide 8/48

17 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Structural learning (AI or machine learning) Graphical models describe conditional independence structures, so good case for formal analysis. Methods must scale well with data size, as many structures and huge collections of data are to be considered. Slide 8/48

18 Why estimation of structure? Parallel to e.g. density estimation Slide 9/48

19 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Slide 9/48

20 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Slide 9/48

21 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Gene regulatory networks Slide 9/48

22 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Gene regulatory networks Reconstructing family trees from DNA information Slide 9/48

23 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Gene regulatory networks Reconstructing family trees from DNA information General interest in sparsity. Slide 9/48

24 Markov mesh model Slide 10/48

25 PC algorithm Crudest algorithm (HUGIN), simulated cases Slide 11/48

26 Bayesian GES Crudest algorithm (WinMine), simulated cases Slide 12/48

27 Tree model PC algorithm, cases, correct reconstruction Slide 13/48

28 Bayesian GES on tree Slide 14/48

29 Chest clinic Slide 15/48

30 PC algorithm simulated cases Slide 16/48

31 Bayesian GES Slide 17/48

32 SNPs and gene expressions min BIC forest Slide 18/48

33 Methods for structure identification in graphical models can be classified into three types: score-based methods: For example optimizing a penalized likelihood by using convex programming e.g. glasso; Slide 19/48

34 Methods for structure identification in graphical models can be classified into three types: score-based methods: For example optimizing a penalized likelihood by using convex programming e.g. glasso; Bayesian methods: Identifying posterior distributions over graphs; can also use posterior probability as score. Slide 19/48

35 Methods for structure identification in graphical models can be classified into three types: score-based methods: For example optimizing a penalized likelihood by using convex programming e.g. glasso; Bayesian methods: Identifying posterior distributions over graphs; can also use posterior probability as score. constraint-based methods: Querying conditional independences and identifying compatible independence structures, for example PC, PC*, NPC, IC, CI, FCI, SIN, QP,... Slide 19/48

36 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Slide 20/48

37 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Trade off goodness-of-fit, measured by the maximized likelihood, against the complexity of the model. Slide 20/48

38 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Trade off goodness-of-fit, measured by the maximized likelihood, against the complexity of the model. IC κ (G) = 2 log L G (ˆθ G ) + κ dim(g), ˆθ G is the MLE, dim(g) is the number of free parameters, and κ is a constant that gives the exchange rate for trading fit and parameters. Slide 20/48

39 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Trade off goodness-of-fit, measured by the maximized likelihood, against the complexity of the model. IC κ (G) = 2 log L G (ˆθ G ) + κ dim(g), ˆθ G is the MLE, dim(g) is the number of free parameters, and κ is a constant that gives the exchange rate for trading fit and parameters. κ may depend on the number n of observations, but is constant over the set of possible graphs G. Slide 20/48

40 Akaike s Information Criterion The criterion AIC has κ = 2 independently of the number of observations. It is meant to optimize the prediction error for predicting the next observation. Slide 21/48

41 Akaike s Information Criterion The criterion AIC has κ = 2 independently of the number of observations. It is meant to optimize the prediction error for predicting the next observation. AIC is not consistent for n as it will tend to have too many parameters. Slide 21/48

42 Akaike s Information Criterion The criterion AIC has κ = 2 independently of the number of observations. It is meant to optimize the prediction error for predicting the next observation. AIC is not consistent for n as it will tend to have too many parameters. Hence used for a Gaussian graphical model for large n, the model will tend not to be sparse. Slide 21/48

43 Bayesian Information Criterion An asymptotic Bayesian argument leads to BIC, which has κ = log n, where n is the number of observations. Slide 22/48

44 Bayesian Information Criterion An asymptotic Bayesian argument leads to BIC, which has κ = log n, where n is the number of observations. The BIC ensures consistent estimation of the graph. Slide 22/48

45 Bayesian Information Criterion An asymptotic Bayesian argument leads to BIC, which has κ = log n, where n is the number of observations. The BIC ensures consistent estimation of the graph. However, the true structure can be identified faster if, say for some C > 1. κ n = C log log n Slide 22/48

46 Other penalized methods Other penalized likelihood methods use criteria of the form l κ (G, θ G ) = 2 log L G (θ G ) + κ θ G, where θ G is measuring the size of the parameter, for example using a vector space norm. Slide 23/48

47 Other penalized methods Other penalized likelihood methods use criteria of the form l κ (G, θ G ) = 2 log L G (θ G ) + κ θ G, where θ G is measuring the size of the parameter, for example using a vector space norm. An example of this for Gaussian graphical models is the so-called graphical lasso based on minimizing l κ (K) = 2 log L(K) + κ K 1 where now the graph G is only implicitly represented through K itself. Slide 23/48

48 Other penalized methods Other penalized likelihood methods use criteria of the form l κ (G, θ G ) = 2 log L G (θ G ) + κ θ G, where θ G is measuring the size of the parameter, for example using a vector space norm. An example of this for Gaussian graphical models is the so-called graphical lasso based on minimizing l κ (K) = 2 log L(K) + κ K 1 where now the graph G is only implicitly represented through K itself. This is a convex optimization problem and in some sense l κ is a convex variant of the IC criteria. Slide 23/48

49 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. Slide 24/48

50 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + Slide 24/48

51 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + and further π(g x) f (x G)π(G). Slide 24/48

52 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + and further π(g x) f (x G)π(G). Attempting, say, to maximize π(g x) over G leads to the MAP estimate of G. Slide 24/48

53 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + and further π(g x) f (x G)π(G). Attempting, say, to maximize π(g x) over G leads to the MAP estimate of G. Asymptotically for large n this would be equivalent to BIC. Slide 24/48

54 Estimating trees and forests The simplest case to consider is the case where the unknown conditional independence structure is a tree T T(V ); Slide 25/48

55 Estimating trees and forests The simplest case to consider is the case where the unknown conditional independence structure is a tree T T(V ); since a tree is decomposable, any distribution P which factorizes w.r.t. T = (V, E) has a density of the form e E f (x) = f e(x e ) v V f v (x v ) d(v) 1 = f uv (x uv ) f v (x v ). f u (x u )f v (x v ) uv E v V (3) Slide 25/48

56 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. Slide 26/48

57 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. In other words, we assume the unknown concentration matrix K satisfies K T T(V ) S + (T ). Slide 26/48

58 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. In other words, we assume the unknown concentration matrix K satisfies K T T(V ) S + (T ). To maximize the likelihood function over this parameter space, we first maximize for a fixed tree to get the profile likelihood ˆL(T x), where ˆL(T ) = ˆL(T x) = sup L(K x); K S + (T ) Slide 26/48

59 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. In other words, we assume the unknown concentration matrix K satisfies K T T(V ) S + (T ). To maximize the likelihood function over this parameter space, we first maximize for a fixed tree to get the profile likelihood ˆL(T x), where ˆL(T ) = ˆL(T x) = sup L(K x); K S + (T ) we then further maximize ˆL(T ) over all trees T T(V ). Slide 26/48

60 Since a tree is decomposable, the profile likelihood satisfies ˆL(T x) = f (x ˆK T ) ˆf e E [e] (x e ) = ˆf v V [v] (x v ) d(v) 1 = uv E ˆf [uv] (x uv ) ˆf [u] (x u )ˆf [v] (x v ) v V ˆf [v] (x v ). Slide 27/48

61 Since a tree is decomposable, the profile likelihood satisfies ˆL(T x) = f (x ˆK T ) ˆf e E [e] (x e ) = ˆf v V [v] (x v ) d(v) 1 = uv E ˆf [uv] (x uv ) ˆf [u] (x u )ˆf [v] (x v ) v V ˆf [v] (x v ). Here ˆf [A] (x) denotes the maximized likelihood for the marginal distribution of X A based on data x a only and using the saturated model for X A. Slide 27/48

62 More precisely, for x = (x 1,..., x n ) we have ˆf [uv] (x) = (2π) n det(w {uv} /n) n/2 e tr{n(w {uv}) 1 W {uv} }/2 and = n n (2π) n (w uu w vv w 2 uv) n/2 exp( n) ˆf [v] (x) = (2π) n/2 (W vv /n) n/2 exp( n/2) = n n/2 (2π) n/2 (w vv ) n/2 exp( n/2), where W = {w uv, u, v V } is the Wishart matrix of sums and squares of products. Slide 28/48

63 Thus we get in particular log ˆf [uv] (x uv ) ˆf [u] (x u )ˆf [v] (x v ) = n 2 log w uuw vv w 2 uv w uu w vv where r uv is the empirical correlation coefficient r uv = w uv / w uu w vv. = n 2 log(1 r 2 uv) Slide 29/48

64 Define the empirical correlation weight ω uv of the edge uv as ω uv = n 2 log(1 r uv) 2 and let ω(t ) = uv E(T ) ω uv denote the total empirical weight of the tree T. Slide 30/48

65 Define the empirical correlation weight ω uv of the edge uv as ω uv = n 2 log(1 r uv) 2 and let ω(t ) = uv E(T ) ω uv denote the total empirical weight of the tree T. The matrix Ω = {ω uv } is the correlation weight matrix. Slide 30/48

66 Further let ˆL( ) denote the maximized likelihood under independence ˆL( ) = ˆf [v] (x v ). v V Slide 31/48

67 Further let ˆL( ) denote the maximized likelihood under independence ˆL( ) = ˆf [v] (x v ). v V Then, clearly, it holds that log ˆL(T ) log ˆL( ) = ω(t ) = ω uv. (4) uv E(T ) Slide 31/48

68 We say that ˆT is a maximum likelihood tree based on a sample x = x 1,..., x n if ˆT satisfies L( ˆT ) = sup ˆL(T ). T T(V ) Slide 32/48

69 We say that ˆT is a maximum likelihood tree based on a sample x = x 1,..., x n if ˆT satisfies L( ˆT ) = sup ˆL(T ). T T(V ) A spanning tree T of a connected G(V, E) is a subtree T = (V, E T ) of G which has the same vertex set and is a tree. That is, E T E(G). Slide 32/48

70 We then have the following result: Theorem A tree ˆT T(V ) is a maximum likelihood tree if and only if it is a maximum weight spanning tree (MWST) of the complete graph with vertex set V for the weight matrix Ω with ω uv = n 2 log(1 r 2 uv) that is ˆT = arg max T T(V ) ˆL(T ) ˆT = arg max ω(t ). T T(V ) Slide 33/48

71 Kruskal s algorithm This runs as follows and outputs a MWST: Slide 34/48

72 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. Slide 34/48

73 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. 1 Let F = (V, ) Slide 34/48

74 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). Slide 34/48

75 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). 4 return F. Slide 34/48

76 Penalized likelihood forests If we instead wish to estimate an unknown forest, i.e. assume that K S + (F) where F is unknown, we use a penalized form of the likelihood: IC κ (F) = 2 log ˆL(F) + κ{ V + E(F) } since V + E(F) is the dimension of the model determined by F. Slide 35/48

77 Penalized likelihood forests If we instead wish to estimate an unknown forest, i.e. assume that K S + (F) where F is unknown, we use a penalized form of the likelihood: IC κ (F) = 2 log ˆL(F) + κ{ V + E(F) } since V + E(F) is the dimension of the model determined by F. Using (4) yields IC κ (F) = 2 {Ω(F) κ E(F) /2} + const = 2 (ω uv κ/2) + const. uv E(F) Slide 35/48

78 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Slide 36/48

79 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. Slide 36/48

80 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. 1 Let F = (V, ) Slide 36/48

81 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). Slide 36/48

82 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). 4 return F. Slide 36/48

83 BodyFat: min BIC forest Data in Højsgaard et al. (2012). Measurements of body parts interesting for prediction of body fat. Age BodyFat Wrist Height Abdomen Chest Neck Weight Knee Forearm Biceps Ankle Hip Thigh Slide 37/48

84 SNPs and gene expressions min BIC forest Slide 38/48

85 Random graphs for posterior analysis A Bayesian approach to graphical model analysis implies setting up a prior distribution over a class of graphs, say undirected trees, and then finding the posterior distribution. Slide 39/48

86 Random graphs for posterior analysis A Bayesian approach to graphical model analysis implies setting up a prior distribution over a class of graphs, say undirected trees, and then finding the posterior distribution. For example, if the prior is uniform over trees, and parameters are hyper Markov (Dawid and Lauritzen, 1993), the posterior distribution based on data x is p (τ x) w(τ) = BF e e E(τ) where BF e is the Bayes factor for independence among the variables at the endpoints of e; Slide 39/48

87 Random graphs for posterior analysis A Bayesian approach to graphical model analysis implies setting up a prior distribution over a class of graphs, say undirected trees, and then finding the posterior distribution. For example, if the prior is uniform over trees, and parameters are hyper Markov (Dawid and Lauritzen, 1993), the posterior distribution based on data x is p (τ x) w(τ) = BF e e E(τ) where BF e is the Bayes factor for independence among the variables at the endpoints of e; The unknown normalization constant τ w(τ) can be found as a determinant using the matrix tree theorem. Slide 39/48

88 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. Slide 40/48

89 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. How can posterior distributions of this form be represented and/or simulated and what are the properties of such distributions? Slide 40/48

90 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. How can posterior distributions of this form be represented and/or simulated and what are the properties of such distributions? Even case where the graphs considered is all forests is difficult Slide 40/48

91 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. How can posterior distributions of this form be represented and/or simulated and what are the properties of such distributions? Even case where the graphs considered is all forests is difficult Recent progress concerning structural Markov properties of distributions in (5) has been made by Byrne and Dawid (2015). Slide 40/48

92 Summary for trees and forests Direct likelihood methods (ignoring penalty terms) lead to sensible results. (Boootstrap) sampling distribution of tree MLE can be simulated Penalty terms additive along branches, so highest AIC or BIC scoring tree (forest) also available using a MWST algorithm. Tree methods scale extremely well with both sample size and number of variables; Pairwise marginal counts are sufficient statistics for the tree problem (empirical covariance matrix in the Gaussian case). Note sufficiency holds despite parameter space very different from open subset of R k. Slide 41/48

93 Graphical lasso Consider an undirected Gaussian graphical model and the l 1 -penalized log-likelihood function 2l pen (K) = log det K tr(k W ) κ K 1. The penalty K 1 = u,v k uv is essentially a convex relaxation of the number of edges in the graph and optimization of the penalized likelihood will typically lead to several k uv = 0 and thus in effect estimate a particular graph. This penalized likelihood can be maximized efficiently (Banerjee et al., 2008) as implemented in the graphical lasso (Friedman et al., 2008). Beware: not scale-invariant! Slide 42/48

94 glasso for bodyfat Weight Knee Thigh Hip Ankle Biceps Forearm Abdomen Height Wrist Neck Chest Age BodyFat Slide 43/48

95 Optimizing the convex glasso problem We shall maximize the penalized likelihood function l(k) = log det(k) tr( W K) κ K 1. Slide 44/48

96 Optimizing the convex glasso problem We shall maximize the penalized likelihood function l(k) = log det(k) tr( W K) κ K 1. This has subgradient equation l(k) = 0, where l(k) = Σ W κγ and Γ = sign(k) where sign(k uv ) = 1 if k uv > 0, sign(k uv ) = 1 if k uv < 0, and sign(k uv ) [ 1, 1] if k uv = 0. Slide 44/48

97 Optimizing the convex glasso problem We shall maximize the penalized likelihood function l(k) = log det(k) tr( W K) κ K 1. This has subgradient equation l(k) = 0, where l(k) = Σ W κγ and Γ = sign(k) where sign(k uv ) = 1 if k uv > 0, sign(k uv ) = 1 if k uv < 0, and sign(k uv ) [ 1, 1] if k uv = 0. Hence the glasso estimate ˇΣ of Σ satisfies Compare to MLE ˇΣ = W + κγ. ˆΣ = W + Γ where γ uv = 0 whenever u v. Slide 44/48

98 Blocking the subgradient equation Write the subgradient equation in block matrix form with S = W, the lower right corner being 1 1 we get ( ) ( ) ( ) S11 s 12 Σ11 σ s12 12 Γ11 γ s 22 σ12 + κ 12 σ 22 γ12 = 0. 1 Slide 45/48

99 Blocking the subgradient equation Write the subgradient equation in block matrix form with S = W, the lower right corner being 1 1 we get ( ) ( ) ( ) S11 s 12 Σ11 σ s12 12 Γ11 γ s 22 σ12 + κ 12 σ 22 γ12 = 0. 1 Focusing on the upper right block of this equation we get s 12 σ 12 + κγ 12 = 0. Slide 45/48

100 Blocking the subgradient equation Write the subgradient equation in block matrix form with S = W, the lower right corner being 1 1 we get ( ) ( ) ( ) S11 s 12 Σ11 σ s12 12 Γ11 γ s 22 σ12 + κ 12 σ 22 γ12 = 0. 1 Focusing on the upper right block of this equation we get s 12 σ 12 + κγ 12 = 0. Using the identity (Σ 11 ) 1 σ 12 = k 1 22 k 12 = β and thus sign(k 12 ) = sign(β) we can rewrite this equation as Σ 11 β s 12 + κ sign(β) = 0. Slide 45/48

101 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. Slide 46/48

102 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. The subgradient equation for this problem becomes Z Zβ Z y + κ sign(β) = 0. Slide 46/48

103 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. The subgradient equation for this problem becomes Z Zβ Z y + κ sign(β) = 0. Compare this to the subgradient equation for the graphical lasso Σ 11 β s 12 + κ sign(β) = 0. Slide 46/48

104 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. The subgradient equation for this problem becomes Z Zβ Z y + κ sign(β) = 0. Compare this to the subgradient equation for the graphical lasso Σ 11 β s 12 + κ sign(β) = 0. There is a simple iterative cyclic descent algorithm for solving the first equation, and this can of course be used to solve the second equation. Slide 46/48

105 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Slide 47/48

106 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Slide 47/48

107 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. Slide 47/48

108 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. Slide 47/48

109 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence Slide 47/48

110 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; Slide 47/48

111 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; Slide 47/48

112 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: Slide 47/48

113 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: 1 ˆk vv 1/(σ vv w v σ vw β wv ) Slide 47/48

114 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: 1 ˆk vv 1/(σ vv w v σ vw β wv ) 2 For u V \ v do ˆk uv β uv k vv. Slide 47/48

115 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: 1 ˆk vv 1/(σ vv w v σ vw β wv ) 2 For u V \ v do ˆk uv β uv k vv. 4 Return K and incidence graph of K. Slide 47/48

116 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Slide 48/48

117 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Slide 48/48

118 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Using Schur complements, the objective function becomes equivalent to log det(k cc K ca K 1 aa K ac ) + tr(k cc S cc ) + κ K cc 1 where c = {u, v} and a = V \ {u, v}. Slide 48/48

119 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Using Schur complements, the objective function becomes equivalent to log det(k cc K ca K 1 aa K ac ) + tr(k cc S cc ) + κ K cc 1 where c = {u, v} and a = V \ {u, v}. This problem is trivial to solve without iteration. Slide 48/48

120 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Using Schur complements, the objective function becomes equivalent to log det(k cc K ca K 1 aa K ac ) + tr(k cc S cc ) + κ K cc 1 where c = {u, v} and a = V \ {u, v}. This problem is trivial to solve without iteration. Iterating through edges in order of decreasing unexplained correlation should give a very efficient algorithm. Slide 48/48

121 Banerjee, O., Ghaoui, L. E., and daspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. 9: Byrne, S. and Dawid, A. P. (2015). Structural Markov graph laws for Bayesian model uncertainty. Annals of Statistics, 43:1647. Dawid, A. P. and Lauritzen, S. L. (1993). Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21: Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): Højsgaard, S., Edwards, D., and Lauritzen, S. (2012). Graphical Models with R. Springer-Verlag, New York. Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford, United Kingdom. Slide 48/48

Likelihood Analysis of Gaussian Graphical Models

Faculty of Science Likelihood Analysis of Gaussian Graphical Models Ste en Lauritzen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 2 Slide 1/43 Overview of lectures Lecture 1 Markov Properties