Structure estimation for Gaussian graphical models

Size: px
Start display at page:

Download "Structure estimation for Gaussian graphical models"

Transcription

1 Faculty of Science Structure estimation for Gaussian graphical models Steffen Lauritzen, University of Copenhagen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 3 Slide 1/48

2 Overview of lectures Lecture 1 Markov Properties and the Multivariate Gaussian Distribution Lecture 2 Likelihood Analysis of Gaussian Graphical Models Lecture 3 Structure Estimation for Gaussian Graphical Models. For reference, if nothing else is mentioned, see Lauritzen (1996), Chapters 3 and 4. Slide 2/48

3 Gaussian graphical model S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. Slide 3/48

4 Gaussian graphical model S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. A Gaussian graphical model for X specifies X as multivariate normal with K S + (G) and otherwise unknown. Slide 3/48

5 Gaussian graphical model S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. A Gaussian graphical model for X specifies X as multivariate normal with K S + (G) and otherwise unknown. The likelihood function based on a sample of size n is L(K) (det K) n/2 e tr(kw)/2, where w is the (Wishart) matrix of sums of squares and products and Σ 1 = K S + (G). Slide 3/48

6 Representation via basis matrices Define the matrices T u, u V E as those with elements 1 if u V and i = j = u Tij u = 1 if u E and u = {i, j} ; 0 otherwise. then T u, u V E forms a basis for the linear space S(G) of symmetric matrices over V which have zero entries ij whenever i and j are non-adjacent in G. Slide 4/48

7 Representation via basis matrices Define the matrices T u, u V E as those with elements 1 if u V and i = j = u Tij u = 1 if u E and u = {i, j} ; 0 otherwise. then T u, u V E forms a basis for the linear space S(G) of symmetric matrices over V which have zero entries ij whenever i and j are non-adjacent in G. We can then identify the family as a (regular and canonical) exponential family with tr(t u W )/2, u V E as canonical sufficient statistics. This yields the likelihood equations tr(t u w) = n tr(t u Σ), u V E. Slide 4/48

8 Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and ( n(wcc ) M c K = 1 + K ca (K aa ) 1 ) K ac K ca. (1) K ac K aa This operation is clearly well defined if w cc is positive definite. Slide 5/48

9 Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and ( n(wcc ) M c K = 1 + K ca (K aa ) 1 ) K ac K ca. (1) K ac K aa This operation is clearly well defined if w cc is positive definite. Next we choose any ordering (c 1,..., c k ) of the cliques in G. Choose further K 0 = I and define for r = 0, 1,... K r+1 = (M c1 M ck )K r. Slide 5/48

10 Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and ( n(wcc ) M c K = 1 + K ca (K aa ) 1 ) K ac K ca. (1) K ac K aa This operation is clearly well defined if w cc is positive definite. Next we choose any ordering (c 1,..., c k ) of the cliques in G. Choose further K 0 = I and define for r = 0, 1,... Then we have: K r+1 = (M c1 M ck )K r. ˆK = lim r K r, provided the maximum likelihood estimate ˆK of K exists. Slide 5/48

11 Characterizing decomposable graphs A graph is chordal if all cycles of length 4 have chords. The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable; (iii) All maximal prime subgraphs of G are cliques; There are also many other useful characterizations of chordal graphs and algorithms that identify them. Trees are chordal graphs and thus decomposable. Slide 6/48

12 If the graph G is chordal, we say that the graphical model is decomposable. In this case, the IPS-algorithm converges in a finite number of steps. We also have the factorization of densities C C f (x Σ) = f (x C Σ C ) S S f (x S Σ S ) ν(s) (2) where ν(s) is the number of times S appear as intersection between neighbouring cliques of a junction tree for C. Similar factorizations naturally hold for the maximum likelihood estimate ˆΣ. Slide 7/48

13 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) Slide 8/48

14 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Slide 8/48

15 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Structural learning (AI or machine learning) Slide 8/48

16 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Structural learning (AI or machine learning) Graphical models describe conditional independence structures, so good case for formal analysis. Slide 8/48

17 Structure estimation Advances in computing has set focus on estimation of structure: Model selection (e.g. subset selection in regression) System identification (engineering) Structural learning (AI or machine learning) Graphical models describe conditional independence structures, so good case for formal analysis. Methods must scale well with data size, as many structures and huge collections of data are to be considered. Slide 8/48

18 Why estimation of structure? Parallel to e.g. density estimation Slide 9/48

19 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Slide 9/48

20 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Slide 9/48

21 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Gene regulatory networks Slide 9/48

22 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Gene regulatory networks Reconstructing family trees from DNA information Slide 9/48

23 Why estimation of structure? Parallel to e.g. density estimation Obtain quick overview of relations between variables in complex systems Data mining Gene regulatory networks Reconstructing family trees from DNA information General interest in sparsity. Slide 9/48

24 Markov mesh model Slide 10/48

25 PC algorithm Crudest algorithm (HUGIN), simulated cases Slide 11/48

26 Bayesian GES Crudest algorithm (WinMine), simulated cases Slide 12/48

27 Tree model PC algorithm, cases, correct reconstruction Slide 13/48

28 Bayesian GES on tree Slide 14/48

29 Chest clinic Slide 15/48

30 PC algorithm simulated cases Slide 16/48

31 Bayesian GES Slide 17/48

32 SNPs and gene expressions min BIC forest Slide 18/48

33 Methods for structure identification in graphical models can be classified into three types: score-based methods: For example optimizing a penalized likelihood by using convex programming e.g. glasso; Slide 19/48

34 Methods for structure identification in graphical models can be classified into three types: score-based methods: For example optimizing a penalized likelihood by using convex programming e.g. glasso; Bayesian methods: Identifying posterior distributions over graphs; can also use posterior probability as score. Slide 19/48

35 Methods for structure identification in graphical models can be classified into three types: score-based methods: For example optimizing a penalized likelihood by using convex programming e.g. glasso; Bayesian methods: Identifying posterior distributions over graphs; can also use posterior probability as score. constraint-based methods: Querying conditional independences and identifying compatible independence structures, for example PC, PC*, NPC, IC, CI, FCI, SIN, QP,... Slide 19/48

36 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Slide 20/48

37 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Trade off goodness-of-fit, measured by the maximized likelihood, against the complexity of the model. Slide 20/48

38 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Trade off goodness-of-fit, measured by the maximized likelihood, against the complexity of the model. IC κ (G) = 2 log L G (ˆθ G ) + κ dim(g), ˆθ G is the MLE, dim(g) is the number of free parameters, and κ is a constant that gives the exchange rate for trading fit and parameters. Slide 20/48

39 Penalized likelihood Methods based on pure maximum likelihood are not feasible when the dimension of the parameter space varies. Trade off goodness-of-fit, measured by the maximized likelihood, against the complexity of the model. IC κ (G) = 2 log L G (ˆθ G ) + κ dim(g), ˆθ G is the MLE, dim(g) is the number of free parameters, and κ is a constant that gives the exchange rate for trading fit and parameters. κ may depend on the number n of observations, but is constant over the set of possible graphs G. Slide 20/48

40 Akaike s Information Criterion The criterion AIC has κ = 2 independently of the number of observations. It is meant to optimize the prediction error for predicting the next observation. Slide 21/48

41 Akaike s Information Criterion The criterion AIC has κ = 2 independently of the number of observations. It is meant to optimize the prediction error for predicting the next observation. AIC is not consistent for n as it will tend to have too many parameters. Slide 21/48

42 Akaike s Information Criterion The criterion AIC has κ = 2 independently of the number of observations. It is meant to optimize the prediction error for predicting the next observation. AIC is not consistent for n as it will tend to have too many parameters. Hence used for a Gaussian graphical model for large n, the model will tend not to be sparse. Slide 21/48

43 Bayesian Information Criterion An asymptotic Bayesian argument leads to BIC, which has κ = log n, where n is the number of observations. Slide 22/48

44 Bayesian Information Criterion An asymptotic Bayesian argument leads to BIC, which has κ = log n, where n is the number of observations. The BIC ensures consistent estimation of the graph. Slide 22/48

45 Bayesian Information Criterion An asymptotic Bayesian argument leads to BIC, which has κ = log n, where n is the number of observations. The BIC ensures consistent estimation of the graph. However, the true structure can be identified faster if, say for some C > 1. κ n = C log log n Slide 22/48

46 Other penalized methods Other penalized likelihood methods use criteria of the form l κ (G, θ G ) = 2 log L G (θ G ) + κ θ G, where θ G is measuring the size of the parameter, for example using a vector space norm. Slide 23/48

47 Other penalized methods Other penalized likelihood methods use criteria of the form l κ (G, θ G ) = 2 log L G (θ G ) + κ θ G, where θ G is measuring the size of the parameter, for example using a vector space norm. An example of this for Gaussian graphical models is the so-called graphical lasso based on minimizing l κ (K) = 2 log L(K) + κ K 1 where now the graph G is only implicitly represented through K itself. Slide 23/48

48 Other penalized methods Other penalized likelihood methods use criteria of the form l κ (G, θ G ) = 2 log L G (θ G ) + κ θ G, where θ G is measuring the size of the parameter, for example using a vector space norm. An example of this for Gaussian graphical models is the so-called graphical lasso based on minimizing l κ (K) = 2 log L(K) + κ K 1 where now the graph G is only implicitly represented through K itself. This is a convex optimization problem and in some sense l κ is a convex variant of the IC criteria. Slide 23/48

49 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. Slide 24/48

50 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + Slide 24/48

51 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + and further π(g x) f (x G)π(G). Slide 24/48

52 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + and further π(g x) f (x G)π(G). Attempting, say, to maximize π(g x) over G leads to the MAP estimate of G. Slide 24/48

53 Bayesian methods A full Bayesian approach will use suitable prior distributions, in the Gaussian case known as hyper Markov Wishart and hyper Markov inverse Wishart prior distributions. One then writes: f (x G) = f (x K)π G (dk) K S(G) + and further π(g x) f (x G)π(G). Attempting, say, to maximize π(g x) over G leads to the MAP estimate of G. Asymptotically for large n this would be equivalent to BIC. Slide 24/48

54 Estimating trees and forests The simplest case to consider is the case where the unknown conditional independence structure is a tree T T(V ); Slide 25/48

55 Estimating trees and forests The simplest case to consider is the case where the unknown conditional independence structure is a tree T T(V ); since a tree is decomposable, any distribution P which factorizes w.r.t. T = (V, E) has a density of the form e E f (x) = f e(x e ) v V f v (x v ) d(v) 1 = f uv (x uv ) f v (x v ). f u (x u )f v (x v ) uv E v V (3) Slide 25/48

56 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. Slide 26/48

57 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. In other words, we assume the unknown concentration matrix K satisfies K T T(V ) S + (T ). Slide 26/48

58 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. In other words, we assume the unknown concentration matrix K satisfies K T T(V ) S + (T ). To maximize the likelihood function over this parameter space, we first maximize for a fixed tree to get the profile likelihood ˆL(T x), where ˆL(T ) = ˆL(T x) = sup L(K x); K S + (T ) Slide 26/48

59 Maximum likelihood trees Next we shall consider the situation where we have a sample x = (x 1,..., x n ) from a distribution P of X = X V which is Gaussian and is known to factorize according to a tree T T(V ) but both P and T is otherwise unknown. In other words, we assume the unknown concentration matrix K satisfies K T T(V ) S + (T ). To maximize the likelihood function over this parameter space, we first maximize for a fixed tree to get the profile likelihood ˆL(T x), where ˆL(T ) = ˆL(T x) = sup L(K x); K S + (T ) we then further maximize ˆL(T ) over all trees T T(V ). Slide 26/48

60 Since a tree is decomposable, the profile likelihood satisfies ˆL(T x) = f (x ˆK T ) ˆf e E [e] (x e ) = ˆf v V [v] (x v ) d(v) 1 = uv E ˆf [uv] (x uv ) ˆf [u] (x u )ˆf [v] (x v ) v V ˆf [v] (x v ). Slide 27/48

61 Since a tree is decomposable, the profile likelihood satisfies ˆL(T x) = f (x ˆK T ) ˆf e E [e] (x e ) = ˆf v V [v] (x v ) d(v) 1 = uv E ˆf [uv] (x uv ) ˆf [u] (x u )ˆf [v] (x v ) v V ˆf [v] (x v ). Here ˆf [A] (x) denotes the maximized likelihood for the marginal distribution of X A based on data x a only and using the saturated model for X A. Slide 27/48

62 More precisely, for x = (x 1,..., x n ) we have ˆf [uv] (x) = (2π) n det(w {uv} /n) n/2 e tr{n(w {uv}) 1 W {uv} }/2 and = n n (2π) n (w uu w vv w 2 uv) n/2 exp( n) ˆf [v] (x) = (2π) n/2 (W vv /n) n/2 exp( n/2) = n n/2 (2π) n/2 (w vv ) n/2 exp( n/2), where W = {w uv, u, v V } is the Wishart matrix of sums and squares of products. Slide 28/48

63 Thus we get in particular log ˆf [uv] (x uv ) ˆf [u] (x u )ˆf [v] (x v ) = n 2 log w uuw vv w 2 uv w uu w vv where r uv is the empirical correlation coefficient r uv = w uv / w uu w vv. = n 2 log(1 r 2 uv) Slide 29/48

64 Define the empirical correlation weight ω uv of the edge uv as ω uv = n 2 log(1 r uv) 2 and let ω(t ) = uv E(T ) ω uv denote the total empirical weight of the tree T. Slide 30/48

65 Define the empirical correlation weight ω uv of the edge uv as ω uv = n 2 log(1 r uv) 2 and let ω(t ) = uv E(T ) ω uv denote the total empirical weight of the tree T. The matrix Ω = {ω uv } is the correlation weight matrix. Slide 30/48

66 Further let ˆL( ) denote the maximized likelihood under independence ˆL( ) = ˆf [v] (x v ). v V Slide 31/48

67 Further let ˆL( ) denote the maximized likelihood under independence ˆL( ) = ˆf [v] (x v ). v V Then, clearly, it holds that log ˆL(T ) log ˆL( ) = ω(t ) = ω uv. (4) uv E(T ) Slide 31/48

68 We say that ˆT is a maximum likelihood tree based on a sample x = x 1,..., x n if ˆT satisfies L( ˆT ) = sup ˆL(T ). T T(V ) Slide 32/48

69 We say that ˆT is a maximum likelihood tree based on a sample x = x 1,..., x n if ˆT satisfies L( ˆT ) = sup ˆL(T ). T T(V ) A spanning tree T of a connected G(V, E) is a subtree T = (V, E T ) of G which has the same vertex set and is a tree. That is, E T E(G). Slide 32/48

70 We then have the following result: Theorem A tree ˆT T(V ) is a maximum likelihood tree if and only if it is a maximum weight spanning tree (MWST) of the complete graph with vertex set V for the weight matrix Ω with ω uv = n 2 log(1 r 2 uv) that is ˆT = arg max T T(V ) ˆL(T ) ˆT = arg max ω(t ). T T(V ) Slide 33/48

71 Kruskal s algorithm This runs as follows and outputs a MWST: Slide 34/48

72 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. Slide 34/48

73 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. 1 Let F = (V, ) Slide 34/48

74 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). Slide 34/48

75 Kruskal s algorithm This runs as follows and outputs a MWST: smallskip Order all off-diagonal elements in the matrix Ω from largest to smallest so that for E = {e 1,..., e k } where k = V (V 1)/2 ω ei ω ej whenever i > j. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). 4 return F. Slide 34/48

76 Penalized likelihood forests If we instead wish to estimate an unknown forest, i.e. assume that K S + (F) where F is unknown, we use a penalized form of the likelihood: IC κ (F) = 2 log ˆL(F) + κ{ V + E(F) } since V + E(F) is the dimension of the model determined by F. Slide 35/48

77 Penalized likelihood forests If we instead wish to estimate an unknown forest, i.e. assume that K S + (F) where F is unknown, we use a penalized form of the likelihood: IC κ (F) = 2 log ˆL(F) + κ{ V + E(F) } since V + E(F) is the dimension of the model determined by F. Using (4) yields IC κ (F) = 2 {Ω(F) κ E(F) /2} + const = 2 (ω uv κ/2) + const. uv E(F) Slide 35/48

78 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Slide 36/48

79 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. Slide 36/48

80 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. 1 Let F = (V, ) Slide 36/48

81 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). Slide 36/48

82 Thus we can minimize the IC-score by using a modification of Kruskal s algorithm on Ω κ, where ω κ uv = ω uv κ/2: Discard all negative off-diagonal elements in the matrix Ω κ and order the remaining from largest to smallest. 1 Let F = (V, ) 2 for i = 1, i + 1 until F is a spanning tree do: 3 if E(F) e i is a forest let E(F) = E(F) e i, else let E(F) = E(F). 4 return F. Slide 36/48

83 BodyFat: min BIC forest Data in Højsgaard et al. (2012). Measurements of body parts interesting for prediction of body fat. Age BodyFat Wrist Height Abdomen Chest Neck Weight Knee Forearm Biceps Ankle Hip Thigh Slide 37/48

84 SNPs and gene expressions min BIC forest Slide 38/48

85 Random graphs for posterior analysis A Bayesian approach to graphical model analysis implies setting up a prior distribution over a class of graphs, say undirected trees, and then finding the posterior distribution. Slide 39/48

86 Random graphs for posterior analysis A Bayesian approach to graphical model analysis implies setting up a prior distribution over a class of graphs, say undirected trees, and then finding the posterior distribution. For example, if the prior is uniform over trees, and parameters are hyper Markov (Dawid and Lauritzen, 1993), the posterior distribution based on data x is p (τ x) w(τ) = BF e e E(τ) where BF e is the Bayes factor for independence among the variables at the endpoints of e; Slide 39/48

87 Random graphs for posterior analysis A Bayesian approach to graphical model analysis implies setting up a prior distribution over a class of graphs, say undirected trees, and then finding the posterior distribution. For example, if the prior is uniform over trees, and parameters are hyper Markov (Dawid and Lauritzen, 1993), the posterior distribution based on data x is p (τ x) w(τ) = BF e e E(τ) where BF e is the Bayes factor for independence among the variables at the endpoints of e; The unknown normalization constant τ w(τ) can be found as a determinant using the matrix tree theorem. Slide 39/48

88 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. Slide 40/48

89 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. How can posterior distributions of this form be represented and/or simulated and what are the properties of such distributions? Slide 40/48

90 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. How can posterior distributions of this form be represented and/or simulated and what are the properties of such distributions? Even case where the graphs considered is all forests is difficult Slide 40/48

91 For chordal graphs G the similar expression becomes p C C(G) w(c x) (G x) S S(G) w(s x)ν G(S), (5) where C(G) are the maximal cliques of G, S(G) the minimal complete separators, and ν G (S) are certain graph invariants. How can posterior distributions of this form be represented and/or simulated and what are the properties of such distributions? Even case where the graphs considered is all forests is difficult Recent progress concerning structural Markov properties of distributions in (5) has been made by Byrne and Dawid (2015). Slide 40/48

92 Summary for trees and forests Direct likelihood methods (ignoring penalty terms) lead to sensible results. (Boootstrap) sampling distribution of tree MLE can be simulated Penalty terms additive along branches, so highest AIC or BIC scoring tree (forest) also available using a MWST algorithm. Tree methods scale extremely well with both sample size and number of variables; Pairwise marginal counts are sufficient statistics for the tree problem (empirical covariance matrix in the Gaussian case). Note sufficiency holds despite parameter space very different from open subset of R k. Slide 41/48

93 Graphical lasso Consider an undirected Gaussian graphical model and the l 1 -penalized log-likelihood function 2l pen (K) = log det K tr(k W ) κ K 1. The penalty K 1 = u,v k uv is essentially a convex relaxation of the number of edges in the graph and optimization of the penalized likelihood will typically lead to several k uv = 0 and thus in effect estimate a particular graph. This penalized likelihood can be maximized efficiently (Banerjee et al., 2008) as implemented in the graphical lasso (Friedman et al., 2008). Beware: not scale-invariant! Slide 42/48

94 glasso for bodyfat Weight Knee Thigh Hip Ankle Biceps Forearm Abdomen Height Wrist Neck Chest Age BodyFat Slide 43/48

95 Optimizing the convex glasso problem We shall maximize the penalized likelihood function l(k) = log det(k) tr( W K) κ K 1. Slide 44/48

96 Optimizing the convex glasso problem We shall maximize the penalized likelihood function l(k) = log det(k) tr( W K) κ K 1. This has subgradient equation l(k) = 0, where l(k) = Σ W κγ and Γ = sign(k) where sign(k uv ) = 1 if k uv > 0, sign(k uv ) = 1 if k uv < 0, and sign(k uv ) [ 1, 1] if k uv = 0. Slide 44/48

97 Optimizing the convex glasso problem We shall maximize the penalized likelihood function l(k) = log det(k) tr( W K) κ K 1. This has subgradient equation l(k) = 0, where l(k) = Σ W κγ and Γ = sign(k) where sign(k uv ) = 1 if k uv > 0, sign(k uv ) = 1 if k uv < 0, and sign(k uv ) [ 1, 1] if k uv = 0. Hence the glasso estimate ˇΣ of Σ satisfies Compare to MLE ˇΣ = W + κγ. ˆΣ = W + Γ where γ uv = 0 whenever u v. Slide 44/48

98 Blocking the subgradient equation Write the subgradient equation in block matrix form with S = W, the lower right corner being 1 1 we get ( ) ( ) ( ) S11 s 12 Σ11 σ s12 12 Γ11 γ s 22 σ12 + κ 12 σ 22 γ12 = 0. 1 Slide 45/48

99 Blocking the subgradient equation Write the subgradient equation in block matrix form with S = W, the lower right corner being 1 1 we get ( ) ( ) ( ) S11 s 12 Σ11 σ s12 12 Γ11 γ s 22 σ12 + κ 12 σ 22 γ12 = 0. 1 Focusing on the upper right block of this equation we get s 12 σ 12 + κγ 12 = 0. Slide 45/48

100 Blocking the subgradient equation Write the subgradient equation in block matrix form with S = W, the lower right corner being 1 1 we get ( ) ( ) ( ) S11 s 12 Σ11 σ s12 12 Γ11 γ s 22 σ12 + κ 12 σ 22 γ12 = 0. 1 Focusing on the upper right block of this equation we get s 12 σ 12 + κγ 12 = 0. Using the identity (Σ 11 ) 1 σ 12 = k 1 22 k 12 = β and thus sign(k 12 ) = sign(β) we can rewrite this equation as Σ 11 β s 12 + κ sign(β) = 0. Slide 45/48

101 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. Slide 46/48

102 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. The subgradient equation for this problem becomes Z Zβ Z y + κ sign(β) = 0. Slide 46/48

103 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. The subgradient equation for this problem becomes Z Zβ Z y + κ sign(β) = 0. Compare this to the subgradient equation for the graphical lasso Σ 11 β s 12 + κ sign(β) = 0. Slide 46/48

104 Lasso regression The Lasso regression problem is minimize (y Zβ) (y Zβ)/2 + κ β 1. The subgradient equation for this problem becomes Z Zβ Z y + κ sign(β) = 0. Compare this to the subgradient equation for the graphical lasso Σ 11 β s 12 + κ sign(β) = 0. There is a simple iterative cyclic descent algorithm for solving the first equation, and this can of course be used to solve the second equation. Slide 46/48

105 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Slide 47/48

106 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Slide 47/48

107 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. Slide 47/48

108 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. Slide 47/48

109 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence Slide 47/48

110 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; Slide 47/48

111 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; Slide 47/48

112 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: Slide 47/48

113 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: 1 ˆk vv 1/(σ vv w v σ vw β wv ) Slide 47/48

114 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: 1 ˆk vv 1/(σ vv w v σ vw β wv ) 2 For u V \ v do ˆk uv β uv k vv. Slide 47/48

115 Cyclic descent algorithm for graphical lasso Define the soft threshold function T (x, t) = sign(x)( x t) + ; Input: Empirical covariance matrix S; penalty parameter κ; Output: Glasso estimate ˆK κ ; concentration graph Ĝ κ. 1 Initialize Σ S + κi ; β uv 0, u, v V. 2 Repeat for v V until convergence 1 For u V (\ v until convergence: β uv T s uv ) w v σ uw β wv ; κ /σ vv ; 2 For u V \ {v} do σ uv w v σ uw β wv ; 3 For v V do: 1 ˆk vv 1/(σ vv w v σ vw β wv ) 2 For u V \ v do ˆk uv β uv k vv. 4 Return K and incidence graph of K. Slide 47/48

116 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Slide 48/48

117 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Slide 48/48

118 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Using Schur complements, the objective function becomes equivalent to log det(k cc K ca K 1 aa K ac ) + tr(k cc S cc ) + κ K cc 1 where c = {u, v} and a = V \ {u, v}. Slide 48/48

119 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Using Schur complements, the objective function becomes equivalent to log det(k cc K ca K 1 aa K ac ) + tr(k cc S cc ) + κ K cc 1 where c = {u, v} and a = V \ {u, v}. This problem is trivial to solve without iteration. Slide 48/48

120 An alternative algorithm This algorithm updates 2 2 submatrices of K and resembles the IPS algorithm but also in some sense Kruskal s algorithm. Consider the restricted convex optimization problem: minimize log det(k) + tr(ks) + κ K 1 subject to k ij = kij for i u or j v. Using Schur complements, the objective function becomes equivalent to log det(k cc K ca K 1 aa K ac ) + tr(k cc S cc ) + κ K cc 1 where c = {u, v} and a = V \ {u, v}. This problem is trivial to solve without iteration. Iterating through edges in order of decreasing unexplained correlation should give a very efficient algorithm. Slide 48/48

121 Banerjee, O., Ghaoui, L. E., and daspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. 9: Byrne, S. and Dawid, A. P. (2015). Structural Markov graph laws for Bayesian model uncertainty. Annals of Statistics, 43:1647. Dawid, A. P. and Lauritzen, S. L. (1993). Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21: Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): Højsgaard, S., Edwards, D., and Lauritzen, S. (2012). Graphical Models with R. Springer-Verlag, New York. Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford, United Kingdom. Slide 48/48

Likelihood Analysis of Gaussian Graphical Models

Likelihood Analysis of Gaussian Graphical Models Faculty of Science Likelihood Analysis of Gaussian Graphical Models Ste en Lauritzen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 2 Slide 1/43 Overview of lectures Lecture 1 Markov Properties

More information

Elements of Graphical Models DRAFT.

Elements of Graphical Models DRAFT. Steffen L. Lauritzen Elements of Graphical Models DRAFT. Lectures from the XXXVIth International Probability Summer School in Saint-Flour, France, 2006 December 2, 2009 Springer Contents 1 Introduction...................................................

More information

Decomposable and Directed Graphical Gaussian Models

Decomposable and Directed Graphical Gaussian Models Decomposable Decomposable and Directed Graphical Gaussian Models Graphical Models and Inference, Lecture 13, Michaelmas Term 2009 November 26, 2009 Decomposable Definition Basic properties Wishart density

More information

Decomposable Graphical Gaussian Models

Decomposable Graphical Gaussian Models CIMPA Summerschool, Hammamet 2011, Tunisia September 12, 2011 Basic algorithm This simple algorithm has complexity O( V + E ): 1. Choose v 0 V arbitrary and let v 0 = 1; 2. When vertices {1, 2,..., j}

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) 1/12 MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) Dominique Guillot Departments of Mathematical Sciences University of Delaware May 6, 2016

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Gaussian Graphical Models: An Algebraic and Geometric Perspective

Gaussian Graphical Models: An Algebraic and Geometric Perspective Gaussian Graphical Models: An Algebraic and Geometric Perspective Caroline Uhler arxiv:707.04345v [math.st] 3 Jul 07 Abstract Gaussian graphical models are used throughout the natural sciences, social

More information

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models 1/13 MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models Dominique Guillot Departments of Mathematical Sciences University of Delaware May 4, 2016 Recall

More information

An Introduction to Graphical Lasso

An Introduction to Graphical Lasso An Introduction to Graphical Lasso Bo Chang Graphical Models Reading Group May 15, 2015 Bo Chang (UBC) Graphical Lasso May 15, 2015 1 / 16 Undirected Graphical Models An undirected graph, each vertex represents

More information

Graphical Model Selection

Graphical Model Selection May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Total positivity in Markov structures

Total positivity in Markov structures 1 based on joint work with Shaun Fallat, Kayvan Sadeghi, Caroline Uhler, Nanny Wermuth, and Piotr Zwiernik (arxiv:1510.01290) Faculty of Science Total positivity in Markov structures Steffen Lauritzen

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Inequalities on partial correlations in Gaussian graphical models

Inequalities on partial correlations in Gaussian graphical models Inequalities on partial correlations in Gaussian graphical models containing star shapes Edmund Jones and Vanessa Didelez, School of Mathematics, University of Bristol Abstract This short paper proves

More information

Multivariate Normal Models

Multivariate Normal Models Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

A Robust Approach to Regularized Discriminant Analysis

A Robust Approach to Regularized Discriminant Analysis A Robust Approach to Regularized Discriminant Analysis Moritz Gschwandtner Department of Statistics and Probability Theory Vienna University of Technology, Austria Österreichische Statistiktage, Graz,

More information

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results David Prince Biostat 572 dprince3@uw.edu April 19, 2012 David Prince (UW) SPICE April 19, 2012 1 / 11 Electronic

More information

Multivariate Normal Models

Multivariate Normal Models Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

CS281A/Stat241A Lecture 19

CS281A/Stat241A Lecture 19 CS281A/Stat241A Lecture 19 p. 1/4 CS281A/Stat241A Lecture 19 Junction Tree Algorithm Peter Bartlett CS281A/Stat241A Lecture 19 p. 2/4 Announcements My office hours: Tuesday Nov 3 (today), 1-2pm, in 723

More information

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering

More information

Probability Propagation

Probability Propagation Graphical Models, Lectures 9 and 10, Michaelmas Term 2009 November 13, 2009 Characterizing chordal graphs The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable;

More information

Graphical Models with Symmetry

Graphical Models with Symmetry Wald Lecture, World Meeting on Probability and Statistics Istanbul 2012 Sparse graphical models with few parameters can describe complex phenomena. Introduce symmetry to obtain further parsimony so models

More information

Graphical model inference with unobserved variables via latent tree aggregation

Graphical model inference with unobserved variables via latent tree aggregation Graphical model inference with unobserved variables via latent tree aggregation Geneviève Robin Christophe Ambroise Stéphane Robin UMR 518 AgroParisTech/INRA MIA Équipe Statistique et génome 15 decembre

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

Conditional Independence and Markov Properties

Conditional Independence and Markov Properties Conditional Independence and Markov Properties Lecture 1 Saint Flour Summerschool, July 5, 2006 Steffen L. Lauritzen, University of Oxford Overview of lectures 1. Conditional independence and Markov properties

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Sparse Graph Learning via Markov Random Fields

Sparse Graph Learning via Markov Random Fields Sparse Graph Learning via Markov Random Fields Xin Sui, Shao Tang Sep 23, 2016 Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 1 / 36 Outline 1 Introduction to graph learning

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:

More information

Directed acyclic graphs and the use of linear mixed models

Directed acyclic graphs and the use of linear mixed models Directed acyclic graphs and the use of linear mixed models Siem H. Heisterkamp 1,2 1 Groningen Bioinformatics Centre, University of Groningen 2 Biostatistics and Research Decision Sciences (BARDS), MSD,

More information

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = = Exact Inference I Mark Peot In this lecture we will look at issues associated with exact inference 10 Queries The objective of probabilistic inference is to compute a joint distribution of a set of query

More information

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix

More information

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction

More information

Belief Update in CLG Bayesian Networks With Lazy Propagation

Belief Update in CLG Bayesian Networks With Lazy Propagation Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)

More information

arxiv: v1 [math.st] 16 Aug 2011

arxiv: v1 [math.st] 16 Aug 2011 Retaining positive definiteness in thresholded matrices Dominique Guillot Stanford University Bala Rajaratnam Stanford University August 17, 2011 arxiv:1108.3325v1 [math.st] 16 Aug 2011 Abstract Positive

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Multivariate Gaussian Analysis

Multivariate Gaussian Analysis BS2 Statistical Inference, Lecture 7, Hilary Term 2009 February 13, 2009 Marginal and conditional distributions For a positive definite covariance matrix Σ, the multivariate Gaussian distribution has density

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Jiahua Chen, University of British Columbia Zehua Chen, National University of Singapore (Biometrika, 2008) 1 / 18 Variable

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Learning Gaussian Graphical Models with Unknown Group Sparsity

Learning Gaussian Graphical Models with Unknown Group Sparsity Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation

More information

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo Regulatory Inferece from Gene Expression CMSC858P Spring 2012 Hector Corrada Bravo 2 Graphical Model Let y be a vector- valued random variable Suppose some condi8onal independence proper8es hold for some

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Dependence Modeling in Ultra High Dimensions with Vine Copulas and the Graphical Lasso

Dependence Modeling in Ultra High Dimensions with Vine Copulas and the Graphical Lasso Dependence Modeling in Ultra High Dimensions with Vine Copulas and the Graphical Lasso Dominik Müller Claudia Czado arxiv:1709.05119v1 [stat.ml] 15 Sep 2017 September 18, 2017 Abstract To model high dimensional

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Undirected Graphical Models

Undirected Graphical Models 17 Undirected Graphical Models 17.1 Introduction A graph consists of a set of vertices (nodes), along with a set of edges joining some pairs of the vertices. In graphical models, each vertex represents

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Probability Propagation

Probability Propagation Graphical Models, Lecture 12, Michaelmas Term 2010 November 19, 2010 Characterizing chordal graphs The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable; (iii)

More information

Dynamic Matrix-Variate Graphical Models A Synopsis 1

Dynamic Matrix-Variate Graphical Models A Synopsis 1 Proc. Valencia / ISBA 8th World Meeting on Bayesian Statistics Benidorm (Alicante, Spain), June 1st 6th, 2006 Dynamic Matrix-Variate Graphical Models A Synopsis 1 Carlos M. Carvalho & Mike West ISDS, Duke

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Graphical Gaussian Models with Edge and Vertex Symmetries

Graphical Gaussian Models with Edge and Vertex Symmetries Graphical Gaussian Models with Edge and Vertex Symmetries Søren Højsgaard Aarhus University, Denmark Steffen L. Lauritzen University of Oxford, United Kingdom Summary. In this paper we introduce new types

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Bayesian Inference of Multiple Gaussian Graphical Models

Bayesian Inference of Multiple Gaussian Graphical Models Bayesian Inference of Multiple Gaussian Graphical Models Christine Peterson,, Francesco Stingo, and Marina Vannucci February 18, 2014 Abstract In this paper, we propose a Bayesian approach to inference

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Posterior convergence rates for estimating large precision. matrices using graphical models

Posterior convergence rates for estimating large precision. matrices using graphical models Biometrika (2013), xx, x, pp. 1 27 C 2007 Biometrika Trust Printed in Great Britain Posterior convergence rates for estimating large precision matrices using graphical models BY SAYANTAN BANERJEE Department

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Example: multivariate Gaussian Distribution

Example: multivariate Gaussian Distribution School of omputer Science Probabilistic Graphical Models Representation of undirected GM (continued) Eric Xing Lecture 3, September 16, 2009 Reading: KF-chap4 Eric Xing @ MU, 2005-2009 1 Example: multivariate

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Bayesian Model Comparison

Bayesian Model Comparison BS2 Statistical Inference, Lecture 11, Hilary Term 2009 February 26, 2009 Basic result An accurate approximation Asymptotic posterior distribution An integral of form I = b a e λg(y) h(y) dy where h(y)

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Recitation 3 1 Gaussian Graphical Models: Schur s Complement Consider

More information

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton) Bayesian (conditionally) conjugate inference for discrete data models Jon Forster (University of Southampton) with Mark Grigsby (Procter and Gamble?) Emily Webb (Institute of Cancer Research) Table 1:

More information

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Robust and sparse Gaussian graphical modelling under cell-wise contamination Robust and sparse Gaussian graphical modelling under cell-wise contamination Shota Katayama 1, Hironori Fujisawa 2 and Mathias Drton 3 1 Tokyo Institute of Technology, Japan 2 The Institute of Statistical

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

A Local Inverse Formula and a Factorization

A Local Inverse Formula and a Factorization A Local Inverse Formula and a Factorization arxiv:60.030v [math.na] 4 Oct 06 Gilbert Strang and Shev MacNamara With congratulations to Ian Sloan! Abstract When a matrix has a banded inverse there is a

More information

Uncertainty quantification and visualization for functional random variables

Uncertainty quantification and visualization for functional random variables Uncertainty quantification and visualization for functional random variables MascotNum Workshop 2014 S. Nanty 1,3 C. Helbert 2 A. Marrel 1 N. Pérot 1 C. Prieur 3 1 CEA, DEN/DER/SESI/LSMR, F-13108, Saint-Paul-lez-Durance,

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Fisher Information in Gaussian Graphical Models

Fisher Information in Gaussian Graphical Models Fisher Information in Gaussian Graphical Models Jason K. Johnson September 21, 2006 Abstract This note summarizes various derivations, formulas and computational algorithms relevant to the Fisher information

More information

Sparse Inverse Covariance Estimation for a Million Variables

Sparse Inverse Covariance Estimation for a Million Variables Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina

More information