Likelihood Analysis of Gaussian Graphical Models

Similar documents
Decomposable and Directed Graphical Gaussian Models

Markov properties for undirected graphs

Conditional Independence and Markov Properties

Markov properties for undirected graphs

Decomposable Graphical Gaussian Models

Structure estimation for Gaussian graphical models

Total positivity in Markov structures

The Maximum Likelihood Threshold of a Graph

Probability Propagation

Gaussian Graphical Models: An Algebraic and Geometric Perspective

Lecture 12: May 09, Decomposable Graphs (continues from last time)

Graphical Models with Symmetry

Bounded Treewidth Graphs A Survey German Russian Winter School St. Petersburg, Russia

Tutorial: Gaussian conditional independence and graphical models. Thomas Kahle Otto-von-Guericke Universität Magdeburg

Multivariate Gaussian Analysis

Probability Propagation

Graph coloring, perfect graphs

The Strong Largeur d Arborescence

Wilks Λ and Hotelling s T 2.

CS281A/Stat241A Lecture 19

Undirected Graphical Models

Algebraic Representations of Gaussian Markov Combinations

Maximum likelihood in log-linear models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Independence, Decomposability and functions which take values into an Abelian Group

Graphical Models and Independence Models

The Lefthanded Local Lemma characterizes chordal dependency graphs

Probabilistic Graphical Models

Tree-width and planar minors

Faithfulness of Probability Distributions and Graphs

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

Representing Independence Models with Elementary Triplets

Strongly chordal and chordal bipartite graphs are sandwich monotone

On Markov Properties in Evidence Theory

Markov properties for directed graphs

Near-domination in graphs

k-blocks: a connectivity invariant for graphs

FORBIDDEN MINORS AND MINOR-CLOSED GRAPH PROPERTIES

Probabilistic Graphical Models

The doubly negative matrix completion problem

Tree-chromatic number

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

Chordal Graphs, Interval Graphs, and wqo

arxiv: v1 [math.st] 16 Aug 2011

Coloring square-free Berge graphs

4.1 Notation and probability review

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Learning Marginal AMP Chain Graphs under Faithfulness

Learning Multivariate Regression Chain Graphs under Faithfulness

Large Cliques and Stable Sets in Undirected Graphs

arxiv: v1 [math.co] 17 Jul 2017

Markov properties for mixed graphs

An Algebraic and Geometric Perspective on Exponential Families

Partitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12

p L yi z n m x N n xi

Lecture 4 October 18th

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Chordal networks of polynomial ideals

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Branchwidth of graphic matroids.

Matroid Representation of Clique Complexes

Connectivity of addable graph classes

Matrix representations and independencies in directed acyclic graphs

Undirected Graphical Models: Markov Random Fields

3 : Representation of Undirected GM

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Consistency Under Sampling of Exponential Random Graph Models

FORBIDDEN MINORS FOR THE CLASS OF GRAPHS G WITH ξ(g) 2. July 25, 2006

Edge-maximal graphs of branchwidth k

The intersection axiom of

k-blocks: a connectivity invariant for graphs

Dynamic Programming on Trees. Example: Independent Set on T = (V, E) rooted at r V.

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Representations of All Solutions of Boolean Programming Problems

arxiv: v1 [cs.ds] 2 Oct 2018

Tree-width. September 14, 2015

Bayesian Learning in Undirected Graphical Models

Lecture 17: May 29, 2002

Connectivity of addable graph classes

4 : Exact Inference: Variable Elimination

Graph structure in polynomial systems: chordal networks

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Marginal log-linear parameterization of conditional independence models

On Conditional Independence in Evidence Theory

Learning discrete graphical models via generalized inverse covariance matrices

Graphical Gaussian Models with Edge and Vertex Symmetries

GUOLI DING AND STAN DZIOBIAK. 1. Introduction

Claw-free Graphs. III. Sparse decomposition

Connectivity and tree structure in finite graphs arxiv: v5 [math.co] 1 Sep 2014

Preliminaries and Complexity Theory

Computing branchwidth via efficient triangulations and blocks

Probabilistic Graphical Models

Mic ael Flohr Representation theory of semi-simple Lie algebras: Example su(3) 6. and 20. June 2003

Some hard families of parameterised counting problems

Graph structure in polynomial systems: chordal networks

Applying Bayesian networks in the game of Minesweeper

On Injective Colourings of Chordal Graphs

Chapter 17: Undirected Graphical Models

Transcription:

Faculty of Science Likelihood Analysis of Gaussian Graphical Models Ste en Lauritzen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 2 Slide 1/43

Overview of lectures Lecture 1 Markov Properties and the Multivariate Gaussian Distribution Lecture 2 Likelihood Analysis of Gaussian Graphical Models Lecture 3 Gaussian Graphical Models with Additional Restrictions; structure identification. For reference, if nothing else is mentioned, see Lauritzen (1996), Chapters 3 and 4. Slide 2/43

Fundamental properties For random variables X, Y, Z, andw it holds (C1) If X? Y Z then Y? X Z; (C2) If X? Y Z and U = g(y ), then X? U Z; (C3) If X? Y Z and U = g(y ), then X? Y (Z, U); (C4) If X? Y Z and X? W (Y, Z), then X? (Y, W ) Z; If density w.r.t. product measure f (x, y, z, w) > 0also (C5) If X? Y (Z, W )andx? Z (Y, W )then X? (Y, Z) W. Slide 3/43

Semi-graphoid An independence model (Studený, 2005)? is a ternary relation over subsets of a finite set V.Itisagraphoid if for all disjoint subsets A, B, C, D: (S1) if A? B C then B? A C (symmetry); (S2) if A? (B [ D) C then A? B C and A? D C (decomposition); (S3) if A? (B [ D) C then A? B (C [ D) (weak union); (S4) if A? B C and A? D (B [ C), then A? (B [ D) C (contraction); (S5) if A? B (C [ D) anda? C (B [ D) then A? (B [ C) D (intersection). Semigraphoid if only (S1) (S4). It is compositional if also (S6) if A? B C and A? D C then A? (B [ D) C (composition). Slide 4/43

Separation in undirected graphs Let G =(V, E) be finite and simple undirected graph (no self-loops, no multiple edges). For subsets A, B, S of V,letA? G B S denote that S separates A from B in G, i.e. that all paths from A to B intersect S. Fact: The relation? G on subsets of V is a compositional graphoid. This fact is the reason for choosing the name graphoid for such independence model. Slide 5/43

Probabilistic Independence Model For a system V of labeled random variables X v, v 2 V,we use A? B C () X A? X B X C, where X A =(X v, v 2 A) denotes the variables with labels in A. The properties (C1) (C4) imply that? satisfies the semi-graphoid axioms and the graphoid axioms if the joint density of the variables is strictly positive. A regular multivariate Gaussian distribution defines a compositional graphoid independence model, as we shall see later. Slide 6/43

Markov properties for undirected graphs G =(V, E) simple undirected graph; An independence model? satisfies (P) the pairwise Markov property if (L) the local Markov property if 6 =)? V \{, }; 8 2 V :? V \ cl( ) bd( ); (G) the global Markov property if A? G B S =) A? B S. Slide 7/43

Structural relations among Markov properties For any semigraphoid it holds that (G) =) (L) =) (P) If? satisfies graphoid axioms it further holds that (P) =) (G) so that in the graphoid case (G) () (L) () (P). The latter holds in particular for?,whenf(x) > 0. Slide 8/43

The multivariate Gaussian A d-dimensional random vector X =(X 1,...,X d )hasa multivariate Gaussian distribution or normal distribution on R d if there is a vector 2R d and a d d matrix such that > X N( >, > ) for all 2 R d. (1) We then write X N d (, ). Then X i N( i, ii), Cov(X i, X j )= ij. Hence is the mean vector and the covariance matrix of the distribution. A multivariate Gaussian distribution is determined by its mean vector and covariance matrix. Slide 9/43

Density of multivariate Gaussian If is positive definite, i.e.if > >0for 6= 0, the distribution has density on R d f (x, ) = (2 ) d/2 (det K) 1/2 e (x )> K(x )/2, (2) where K = 1 is the concentration matrix of the distribution. Since a positive semidefinite matrix is positive definite if and only if it is invertible, we then also say that is regular. Slide 10/43

Adding two independent Gaussians yields a Gaussian: If X N d ( 1, 1 )andx 2 N d ( 2, 2 )andx 1? X 2 X 1 + X 2 N d ( 1 + 2, 1 + 2 ). A ne transformations preserve multivariate normality: If L is an r d matrix, b 2R r and X N d (, ), then Y = LX + b N r (L + b, L L > ). Slide 11/43

Marginal and conditional distributions Partition X into into X A and X B,whereX A 2R A and X B 2R B with A [ B = V. Partition mean vector, concentration and covariance matrix accordingly as = A B, K = KAA K BA K AB K BB Then, if X N(, ) it holds that X B N s ( B, BB )., = AA BA AB BB. Also X A X B = x B N A ( A B, A B ). Slide 12/43

Here A B = A + AB 1 BB (x B B ) and A B = AA AB 1 BB BA. Using the matrix identities K 1 AA = AA AB 1 BB BA (3) and it follows that K 1 AA K AB = AB 1 BB, (4) A B = A K 1 AA K AB(x B B ) and K A B = K AA. Note that the marginal covariance is simply expressed in terms of whereas the conditional concentration is simply expressed in terms of K. Slide 13/43

Further, since A B = A K 1 AA K AB(x B B ) and K A B = K AA, X A and X B are independent if and only if K AB =0, giving K AB =0if and only if AB =0. More generally, if we partition X into X A, X B, X C,the conditional concentration of X A[B given X C = x C is KAA K K A[B C = AB, so K BA K BB X A? X B X C () K AB =0. It follows that agaussianindependencemodelisa compositional graphoid. Slide 14/43

Gaussian graphical model S(G) denotes the symmetric matrices A with a =0unless and S + (G) their positive definite elements. A Gaussian graphical model for X specifies X as multivariate normal with K 2S + (G) and otherwise unknown. Note that the density then factorizes as log f (x) = constant 1 X k x 2 2 2V X {, }2E hence no interaction terms involve more than pairs.. k x x, Slide 15/43

Likelihood with restrictions The likelihood function based on a sample of size n is L(K) / (det K) n/2 e tr(kw)/2, where w is the (Wishart) matrix of sums of squares and products and 1 = K 2S + (G). Define the matrices T u, u 2 V [ E as those with elements T u ij = 8 >< 1 if u 2 V and i = j = u 1 if u 2 E and u = {i, j} ; >: 0 otherwise. then T u, u 2 V [ E forms a basis for the linear space S(G) of symmetric matrices over V which have zero entries ij whenever i and j are non-adjacent in G. Slide 16/43

Further, as K 2S(G), we have K = X k v T v + X k e T e (5) v2v e2e and hence tr(kw) = X k v tr(t v w)+ X k e tr(t e w); v2v e2e leading to the log-likelihood function l(k) = log L(K) n log(det K) tr(kw)/2 2 = n log(det K) 2 X k v tr(t v w)/2+ X k e tr(t e w)/2. v2v e2e Slide 17/43

Hence we can identify the family as a (regular and canonical) exponential family with tr(t u W )/2, u 2 V [ E as canonical su cient statistics. The likelihood equations can be obtained from this fact or by di erentiation, combining the fact that with (5). @ @k u log det(k) =tr(t u ) This eventually yields the likelihood equations tr(t u w)=n tr(t u ), u 2 V [ E. Slide 18/43

The likelihood equations can also be expressed as tr(t u w)=n tr(t u ), u 2 V [ E. nˆvv = w vv, nˆ = w, v 2 V, {, }2E. Remember the model restriction K = 1 2S + (G). This fits variances and covariances along nodes and edges in G so we can write the equations as n ˆ cc = w cc for all cliques c 2C(G). General theory of exponential families ensure the solution to be unique, provided it exists. Slide 19/43

Iterative Proportional Scaling For K 2S + (G) andc 2C, define the operation of adjusting the c-marginal as follows: Let a = V \ c and n(wcc ) M c K = 1 + K ca (K aa ) 1 K ac K ca. (6) K ac K aa This operation is clearly well defined if w cc is positive definite. Recall the identity (K AA ) 1 = AA AB 1 BB BA. Switching the role of K and yields AA =(K 1 ) AA = K AA K AB K 1 BB K BA 1. Slide 20/43

Hence cc =(K 1 ) cc = K cc K ca (K aa ) 1 K ac 1. Thus the C-marginal covariance cc corresponding to the adjusted concentration matrix becomes cc = {(M c K) 1 } cc = n(w cc ) 1 + K ca (K aa ) 1 K ac K ca (K aa ) 1 K ac 1 = w cc /n, hence M c K does indeed adjust the marginals. From (6) it is seen that the pattern of zeros in K is preserved under the operation M c, and it stays positive definite. In fact, M c scales proportionally in the sense that f {x (M c K) 1 } = f (x K 1 ) f (x c w cc /n) f (x c cc ). Slide 21/43

Next we choose any ordering (c 1,...,c k ) of the cliques in G. Choose further K 0 = I and define for r =0, 1,... K r+1 =(M c1 M ck )K r. Then we have: Consider a sample from a covariance selection model with graph G. Then ˆK = lim r!1 K r, provided the maximum likelihood estimate ˆK of K exists. This algorithm is also known as Iterative Proportional Scaling or Iterative Marginal Fitting. Slide 22/43

Factorization Assume density f w.r.t. product measure on X. For a V, a (x) denotes a function which depends on x a only, i.e. x a = y a =) a (x) = a (y). We can then write a (x) = a (x a ) without ambiguity. Definition The distribution of X factorizes w.r.t. G or satisfies (F) if f (x) = Y a2a a(x) where A are complete subsets of G. Complete subsets of a graph are sets with all elements pairwise neighbours. Slide 23/43

2 4 s s 1 5 7 s @ s @ s @ s @ s 3 6 The cliques of this graph are the maximal complete subsets {1, 2}, {1, 3}, {2, 4}, {2, 5}, {3, 5, 6}, {4, 7}, and{5, 6, 7}. A complete set is any subset of these sets. The graph above corresponds to a factorization as f (x) = 12 (x 1, x 2 ) 13 (x 1, x 3 ) 24 (x 2, x 4 ) 25 (x 2, x 5 ) 356 (x 3, x 5, x 6 ) 47 (x 4, x 7 ) 567 (x 5, x 6, x 7 ). Slide 24/43

Theorem Let (F) denote the property that f factorizes w.r.t. G and let (G), (L) and (P) denote Markov properties for?. It then holds that (F) =) (G). If f is continuous and f (x) > 0 for all x, (P) =) (F). The former of these is a simple direct consequence of the factorization whereas the second implication is more subtle. Thus in the case of positive density (but typically only then), all the properties coincide: (F) () (G) () (L) () (P). Slide 25/43

Graph decomposition Consider an undirected graph G =(V, E). A partitioning of V into a triple (A, B, S) of subsets of V forms a decomposition of G if A? G B S and S is complete. The decomposition is proper if A 6= ; and B 6= ;. The components of G are the induced subgraphs G A[S and G B[S. A graph is prime if no proper decomposition exists. Slide 26/43

Example 2 4 s s 1 @ 5 @ 7 s s s The graph to the left is prime @ @ s s 3 6 Decomposition with A = {1, 3}, B = {4, 6, 7} and S = {2, 5} 2 4 2 2 4 s s s s s 1 @ 5 @ 7 1 @ 5 @ 5 @ 7 s s s s s s s @ @ @ @ s s s s 3 6 3 6 Slide 27/43

Decomposability Any graph can be recursively decomposed into its maximal prime subgraphs: 2 4 2 4 2 s s s s s @ 5 @ 7 1 @ 5 @ 7 1 @ 5 s s s s s s s 5 7 @ @ @ s s s s s @ 3 6 3 s 6 A graph is decomposable (or rather fully decomposable) if it is complete or admits a proper decomposition into decomposable subgraphs. Definition is recursive. Alternatively this means that all maximal prime subgraphs are cliques. Slide 28/43

Factorization of Markov distributions Suppose P satisfies (F) w.r.t. G and (A, B, S) isa decomposition. Then (i) P A[S and P B[S satisfy (F) w.r.t. G A[S and G B[S respectively; (ii) f (x)f S (x S )=f A[S (x A[S )f B[S (x B[S ). The converse also holds in the sense that if (i) and (ii) hold, and (A, B, S) is a decomposition of G, thenp factorizes w.r.t. G. Slide 29/43

Recursive decomposition of a decomposable graph into cliques yields the formula: f (x) Y S2S f S (x S ) (S) = Y C2C f C (x C ). Here S is the set of minimal complete separators occurring in the decomposition process and (S) the number of times such a separator appears in this process. Slide 30/43

Characterizing decomposable graphs A graph is chordal if all cycles of length 4havechords. The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable; (iii) All maximal prime subgraphs of G are cliques; There are also many other useful characterizations of chordal graphs and algorithms that identify them. Trees are chordal graphs and thus decomposable. Slide 31/43

If the graph G is chordal, we say that the graphical model is decomposable. In this case, the IPS-algorithm converges in a finite number of steps. We also have the factorization of densities Q C2C f (x ) = f (x C C ) QS2S f (x S S ) (S) (7) where (S) is the number of times S appear as intersection between neighbouring cliques of a junction tree for C. Slide 32/43

Relations for trace and determinant Using the factorization (7) we can for example match the expressions for the trace and determinant of tr(kw )= X X tr(k C W C ) (S)tr(K S W S ) C2C S2S and further det = {det(k)} 1 = Q C2C det{ C } Q S2S {det( S)} (S) These are some of many relations that can be derived using the decomposition property of chordal graphs. Slide 33/43

The same factorization clearly holds for the maximum likelihood estimates: Q C2C f (x ˆ ) = f (x C ˆ C ) QS2S f (x (8) S ˆ S ) (S) Moreover, it follows from the general likelihood equations that ˆ A = W A /n whenever A is complete. Exploiting this, we can obtain an explicit formula for the maximum likelihood estimate in the case of a chordal graph. Slide 34/43

For a d e matrix A = {a µ } 2d,µ2e we let [A] V denote the matrix obtained from A by filling up with zero entries to obtain full dimension V V, i.e. [A] V µ = a µ if 2 d,µ2 e 0 otherwise. The maximum likelihood estimates exists if and only if n for all C 2C. Then the following simple formula holds for the maximum likelihood estimate of K: ( ) X ˆK = n h(w C ) 1i V X (S) h(w S ) 1i V. C2C S2S C Slide 35/43

Mathematics marks 2:Vectors 4:Analysis cp c PPPPP 3:Algebra c P PPPPP c c 1:Mechanics 5:Statistics This graph is chordal with cliques {1, 2, 3}, {3, 4, 5} with separator S = {3} having ({3}) = 1. Slide 36/43

Since one degree of freedom is lost by subtracting the average, we get in this example 0 1 B ˆK = 87 @ w 11 [123] w 12 [123] w 13 [123] 0 0 w 21 [123] w 22 [123] w 23 [123] 0 0 w 31 [123] w 32 [123] w 33 [123] + w 33 [345] 1/w 33 w 34 [345] w 35 [345] 0 0 w 43 [345] w 44 [345] w 45 [345] 0 0 w 53 [345] w 54 [345] w 55 [345] where w ij [123] is the ijth element of the inverse of and so on. 0 W [123] = @ w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 1 A C A Slide 37/43

Existence of the MLE The IPS algorithm converges to the maximum likelihood estimator of ˆK of K provided that the likelihood function does attain its maximum. The question of existence is non-trivial. A chordal cover of G is a chordal graph (no cycles without chords) G 0 of which G is a subgraph. Let n 0 =max C2C 0 C, wherec 0 is the set of cliques in G 0 and let n + denote smallest possible value of n 0. The quantity (G) =n + 1isknownasthetreewidth of G (Halin, 1976; Robertson and Seymour, 1984). The condition n > (G) is su MLE. cient for the existence of the Slide 38/43

Vectors Analysis cp c PPPPP Algebra c P PPPPP c c Mechanics Statistics This graph has treewidth (G)=2 since it is itself chordal and the largest clique has size 3. Hence n =3observations is su MLE. cient for the existence of the Slide 39/43

L 1 e e L 2 B 1 e e B 2 This graph has also treewidth (G)=2 since a chordal cover can be obtained by adding a diagonal edge. Hence also here n =3observations is su existence of the MLE. cient for the Slide 40/43

Determining the treewidth (G) is a di cult combinatorial problem (Robertson and Seymour, 1986), but for any n it can be decided with complexity O( V ) whether (G) < n (Bodlaender, 1997). If we let n denote the maximal clique size of G, anecessary condition is that n n. For n apple n apple (G) it is unclear. Slide 41/43

Buhl (1993) shows for a p-cycle, we have n =2and (G) = 2. If now n = 2, the probability that the MLE exists is strictly between 0 and 1. In fact, P{MLE exists K = I } =1 2 (p 1)!. Similar results hold for the bipartite graphs K 2,m (Uhler, 2012) and other special cases, but general case is unclear. Recently there has been considerable progress (Gross and Sullivant, 2015), for example it can be shown that n =4 observations su ce for any planar graph, aninteresting parallel to the four-colour theorem. Slide 42/43

The r-core of a graph G is obtained by repeatedly removing all vertices with less than r neighbours. It then holds (Gross and Sullivant, 2015) that if the r-core of G is empty, then n = r observations are enough. Slide 43/43

Bodlaender, H. L. (1997). Treewidth: Algorithmic techniques and results. In Prívara, I. and Ručička, P., editors, Mathematical Foundations of Computer Science 1997, volume 1295 of Lecture Notes in Computer Science, pages 19 36. Springer Berlin Heidelberg. Buhl, S. L. (1993). On the existence of maximum likelihood estimators for graphical Gaussian models. Scandinavian Journal of Statistics, 20:263 270. Gross, E. and Sullivant, S. (2015). The maximum likelihood threshold of a graph. ArXiv:1404.6989. Halin, R. (1976). S-functions for graphs. Journal of Geometry, 8:171 186. Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford, United Kingdom. Robertson, N. and Seymour, P. D. (1984). Graph minors III. Planar tree-width. Journal of Combinatorial Theory, Series B, 36:49 64. Slide 43/43

Robertson, N. and Seymour, P. D. (1986). Graph minors II. Algorithmic aspects of tree-width. Journal of Algorithms, 7:309 322. Studený, M. (2005). Probabilistic Conditional Independence Structures. Information Science and Statistics. Springer-Verlag, London. Uhler, C. (2012). Geometry of maximum likelihood estimation in Gaussian graphical models. Annals of Statistics, 40:238 261. Slide 43/43