Decomposable and Directed Graphical Gaussian Models

Size: px

Start display at page:

Download "Decomposable and Directed Graphical Gaussian Models"

Sheila Arnold
5 years ago
Views:

1 Decomposable Decomposable and Directed Graphical Gaussian Models Graphical Models and Inference, Lecture 13, Michaelmas Term 2009 November 26, 2009

2 Decomposable Definition Basic properties Wishart density The Wishart distribution is the sampling distribution of the matrix of sums of squares and products. More precisely: A random d d matrix W has a d-dimensional Wishart distribution with parameter Σ and n degrees of freedom if W D = n X ν (X ν ) i=1 where X ν N d (0, Σ). We then write W W d (n, Σ). The Wishart is the multivariate analogue to the χ 2 : W 1 (n, σ 2 ) = σ 2 χ 2 (n). If W W d (n, Σ) its mean is E(W ) = nσ.

3 Decomposable Definition Basic properties Wishart density If W 1 and W 2 are independent with W i W d (n i, Σ), then W 1 + W 2 W d (n 1 + n 2, Σ). If A is an r d matrix and W W d (n, Σ), then AWA W r (n, AΣA ). For r = 1 we get that when W W d (n, Σ) and λ R d, λ W λ σλ 2 χ2 (n), where σλ 2 = λ Σλ.

4 Decomposable Definition Basic properties Wishart density If W W d (n, Σ), where Σ is regular, then W is regular with probability one if and only if n d. When n d the Wishart distribution has density f d (w n, Σ) = c(d, n) 1 (det Σ) n/2 (det w) (n d 1)/2 e tr(σ 1 w)/2 for w positive definite, and 0 otherwise. The Wishart constant c(d, n) is c(d, n) = 2 nd/2 (2π) d(d 1)/4 d Γ{(n + 1 i)/2}. i=1

5 Decomposable Definition and likelihood function Iterative Proportional Scaling Consider X = (X v, v V ) N V (0, Σ) with Σ regular and K = Σ 1. The concentration matrix of the conditional distribution of (X α, X β ) given X V \{α,β} is Hence ( kαα k K {α,β} = αβ k βα k ββ ). α β V \ {α, β} k αβ = 0. Thus the dependence graph G(K) of a regular Gaussian distribution is given by α β k αβ = 0.

6 Decomposable Definition and likelihood function Iterative Proportional Scaling S(G) denotes the symmetric matrices A with a αβ = 0 unless α β and S + (G) their positive definite elements. A Gaussian graphical model for X specifies X as multivariate normal with K S + (G) and otherwise unknown. The likelihood function based on a sample of size n is L(K) (det K) n/2 e tr(kw )/2, where W is the Wishart matrix of sums of squares and products, W W V (n, Σ) with Σ 1 = K S + (G).

7 Decomposable Definition and likelihood function Iterative Proportional Scaling Define the matrices A u, u V E as those with elements 1 if u V and i = j = u aij u = 1 if u E and u = {i, j}. 0 otherwise. Then, as K S(G), K = v V k v A v + e E k e A e (1) and hence tr(kw ) = k v tr(a v W ) + k e tr(a e W ) v V e E

8 Decomposable Definition and likelihood function Iterative Proportional Scaling Hence we can identify the family as a (regular and canonical) exponential family with tr(a u W )/2, u V E as canonical sufficient statistics. This yields the likelihood equations tr(a u W ) = n tr(a u Σ), u V E. which can also be expressed as or, equivalently nˆσ vv = w vv, nˆσ αβ = w αβ, v V, {α, β} E. n ˆΣ cc = w cc for all cliques c C(G), We should remember the model restriction Σ 1 S + (G).

9 Decomposable Definition and likelihood function Iterative Proportional Scaling For K S + (G) and c C, define the operation of adjusting the c-marginal as follows. Let a = V \ c and ( n(wcc ) T c K = 1 + K ca (K aa ) 1 ) K ac K ca. (2) K ac K aa The C-marginal covariance Σ cc corresponding to the adjusted concentration matrix becomes Σ cc = {(T c K) 1 } cc = { n(w cc ) 1 + K ca (K aa ) 1 K ac K ca (K aa ) 1 K ac } 1 = w cc /n, hence T c K does indeed adjust the marginals. From (2) it is seen that the pattern of zeros in K is preserved under the operation T c, and it can also be seen to stay positive definite.

10 Decomposable Definition and likelihood function Iterative Proportional Scaling Next we choose any ordering (c 1,..., c k ) of the cliques in G. Choose further K 0 = I and define for r = 0, 1,... K r+1 = (T c1 T ck )K r. Then we have: Consider a sample from a covariance selection model with graph G. Then ˆK = lim r K r, provided the maximum likelihood estimate ˆK of K exists.

11 Decomposable Basic factorizations Maximum likelihood estimates An example If the graph G is chordal, we say that the graphical model is decomposable. In this case, the IPS-algorithm converges in a finite number of steps, as in the discrete case. We also have the familiar factorization of densities C C f (x Σ) = f (x C Σ C ) S S f (x S Σ S ) ν(s) (3) where ν(s) is the number of times S appear as intersection between neighbouring cliques of a junction tree for C.

12 Decomposable Relations for trace and determinant Basic factorizations Maximum likelihood estimates An example Using the factorization (3) we can for example match the expressions for the trace and determinant of Σ tr(kw ) = C C tr(k C W C ) S S ν(s) tr(k S W S ) and further det Σ = {det(k)} 1 = C C det{σ C } S S {det(σ S)} ν(s) These are some of many relations that can be derived using the decomposition property of chordal graphs.

13 Decomposable Basic factorizations Maximum likelihood estimates An example The same factorization clearly holds for the maximum likelihood estimates: C C f (x ˆΣ) = f (x C ˆΣ C ) S S f (x (4) S ˆΣ S ) ν(s) Moreover, it follows from the general likelihood equations that ˆΣ A = W A /n whenever A is complete. Exploiting this, we can obtain an explicit formula for the maximum likelihood estimate in the case of a chordal graph.

14 Decomposable Basic factorizations Maximum likelihood estimates An example For a d e matrix A = {a γµ } γ d,µ e we let [A] V denote the matrix obtained from A by filling up with zero entries to obtain full dimension V V, i.e. ( ) { [A] V = aγµ if γ d, µ e γµ 0 otherwise. The maximum likelihood estimates exists if and only if n C for all C C. Then the following simple formula holds for the maximum likelihood estimate of K: { } ˆK = n [(w C ) 1] V ν(s) [(w S ) 1] V. C C S S

15 Decomposable Mathematics marks Basic factorizations Maximum likelihood estimates An example 2:Vectors 4:Analysis 3:Algebra 1:Mechanics 5:Statistics This graph is chordal with cliques {1, 2, 3}, {3, 4, 5} with separator S = {3} having ν({3}) = 1.

16 Decomposable Basic factorizations Maximum likelihood estimates An example Since one degree of freedom is lost by subtracting the average, we get in this example ˆK = 87 w 11 [123] w 12 [123] w 13 [123] 0 0 w 21 [123] w 22 [123] w 23 [123] 0 0 w 31 [123] w 32 [123] w 33 [123] + w 33 [345] 1/w 33 w 34 [345] w 35 [345] 0 0 w 43 [345] w 44 [345] w 45 [345] 0 0 w 53 [345] w 54 [345] w 55 [345] where w ij [123] is the ijth element of the inverse of and so on. W [123] = w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33

17 Decomposable Definition A simple example More general systems Maximum likelihood estimation Consider a directed acyclic graph D and associate for every vertex a random variable X v. Consider now the equation system X v α v X pa(v) + β v + U v, v V (5) where U v, v V are independent random disturbances with U v N (0, σ 2 v ). Such an equation system is known as a recursive structural equation system. Structural equation systems are used heavily in social sciences and in economics. The term structural refers to the fact that the equations are assumed to be stable under intervention so that fixing a value of x v would change the system only by removing the line in the equation system (5) defining x v.

18 Decomposable Definition A simple example More general systems Maximum likelihood estimation A recursive structural equation system defines a multivariate Gaussian distribution which satisfies the directed Markov property of D since the joint density becomes f (x α, σ) = v (2π) 1/2 σv 1 e = (2π) V /2 ( v e v σ 1 v (xv α v x pa(v) βv )2 ) (xv α v x pa(v) βv )2 2σv 2, from which the joint concentration matrix K can easily be derived. 2σ 2 v

19 Decomposable Definition A simple example More general systems Maximum likelihood estimation Consider the system X 1 U 1 X 2 U 2 X 3 α 31 X 1 + U 3 X 4 α 42 X 2 + α 43 X 3 + U 4. The quadratic expression in the exponent becomes x1 2 σ1 2 + x 2 2 σ (x 3 α 31 x 1 ) 2 σ (x 4 α 42 x 2 α 43 x 3 ) 2 σ4 2.

20 Decomposable Definition A simple example More general systems Maximum likelihood estimation Expanding the squares and identifying terms yields the concentration matrix 1 + α2 σ 2 31 α σ3 2 σ α2 σ α 42 α 43 α 42 σ4 2 σ4 2 σ4 2. α 31 α 42 α α2 σ3 2 σ4 2 σ α 43 σ4 2 σ4 2 0 α 42 σ 2 4 α 43 σ σ 2 4

21 Decomposable Definition A simple example More general systems Maximum likelihood estimation The covariance matrix can in principle be found by inverting the above. However, it is easier to express X in terms of the Us as X 1 = U 1 X 2 = U 2 X 3 = α 31 U 1 + U 3 X 4 = α 43 α 31 U 1 + α 42 U 2 + α 43 U 3 + U 4 and then calculate the covariances directly to obtain σ1 2 0 α 31 σ1 2 α 43 α 31 σ1 2 0 σ2 2 0 α 42 σ2 2 α 31 σ1 2 0 σ3 2 + α2 31 σ2 1 α 43 α31 2 σ2 1 + α 43σ3 2, α 43 α 31 σ1 2 α 42 σ2 2 α 43 α31 2 σ2 1 + α 43σ3 2 ω4 2 where ω 2 4 = α 2 43α 2 31σ α 2 42σ α 2 43σ σ 2 4.

22 Decomposable Definition A simple example More general systems Maximum likelihood estimation Systems of structural equations of the type considered are called recursive in contrast to feedback systems of equations where directed cycles in the corresponding graph are allowed. It is also customary to allow correlations between the disturbance terms. This leads to violation of the directed Markov property, so other types of graph and Markov properties must be considered to deal correctly with such models. This type of model and analysis goes back to the geneticist Sewall Wright who coined the term path analysis to the calculus of effects based on this kind of models. This is one of the early precursors for modern graphical modelling.

23 Decomposable Definition A simple example More general systems Maximum likelihood estimation A directed graphical Gaussian model generated by a linear recursive structural equation system is equivalent to a corresponding undirected graphical model if and only if D is perfect, in which case its skeleton σ(d) is chordal. Still, even for non-perfect DAGs the MLE is easy to find for linear recursive structural equation systems: For each v, linear regression of data X ν v, ν = 1,... N onto parents X pa(v) is performed and corresponding regression coefficients ˆα v and residual variance estimates ˆσ 2 v are calculated in the usual way.

Decomposable Graphical Gaussian Models

Decomposable Graphical Gaussian Models CIMPA Summerschool, Hammamet 2011, Tunisia September 12, 2011 Basic algorithm This simple algorithm has complexity O( V + E ): 1. Choose v 0 V arbitrary and let v 0 = 1; 2. When vertices {1, 2,..., j}