Decomposable Graphical Gaussian Models

Size: px

Start display at page:

Download "Decomposable Graphical Gaussian Models"

Carmel Hodge
5 years ago
Views:

1 CIMPA Summerschool, Hammamet 2011, Tunisia September 12, 2011

2 Basic algorithm This simple algorithm has complexity O( V + E ): 1. Choose v 0 V arbitrary and let v 0 = 1; 2. When vertices {1, 2,..., j} have been identified, choose v = j + 1 among V \ {1, 2,..., j} with highest cardinality of its numbered neighbours; 3. If bd(j + 1) {1, 2,..., j} is not complete, G is not chordal; 4. Repeat from 2; 5. If the algorithm continues until no vertices are left, the graph is chordal and the numbering is perfect.

3 Basic algorithm Finding the cliques of a chordal graph From an MCS numbering V = {1,..., V }, let B λ = bd(λ) {1,..., λ 1} and π λ = B λ. Call λ a ladder vertex if λ = V or if π λ+1 < π λ + 1. Let Λ be the set of ladder vertices π λ : 0,1,2,2,2,1,1. The cliques are C λ = {λ} B λ, λ Λ.

4 Definition Characterizing chordal graphs Construction of junction tree of prime components Let A be a collection of finite subsets of a set V. A junction tree T of sets in A is an undirected tree with A as a vertex set, satisfying the junction tree property: If A, B A and C is on the unique path in T between A and B it holds that A B C. If the sets in an arbitrary A are pairwise incomparable, they can be arranged in a junction tree if and only if A = C where C are the cliques of a chordal graph

5 Definition Characterizing chordal graphs Construction of junction tree of prime components The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable; (iii) All prime components of G are cliques; (iv) G admits a perfect numbering; (v) Every minimal (α, β)-separator are complete. (vi) The cliques of G can be arranged in a junction tree.

6 Definition Characterizing chordal graphs Construction of junction tree of prime components The junction tree can be constructed directly from the MCS ordering C λ, λ Λ, where C λ are the cliques: Since the MCS-numbering is perfect, C λ, λ > λ min all satisfy for some λ < λ. C λ ( λ <λc λ ) = C λ C λ = S λ A junction tree is now easily constructed by attaching C λ to any C λ satisfying the above. Although λ may not be uniquely determined, S λ is. Indeed, the sets S λ are the minimal complete separators and the numbers ν(s) are ν(s) = {λ Λ : S λ = S}.

7 A chordal graph Maximum cardinality search Definition Characterizing chordal graphs Construction of junction tree of prime components

8 Junction tree Maximum cardinality search Definition Characterizing chordal graphs Construction of junction tree of prime components Cliques of graph arranged into a tree with C 1 C 2 D for all cliques D on path between C 1 and C 2.

9 Definition Characterizing chordal graphs Construction of junction tree of prime components In general, the prime components of any undirected graph can be arranged in a junction tree in a similar way. Then every pair of neighbours (C, D) in the junction tree represents a decomposition of G into G C and G D, where C is the set of vertices in prime components connected to C but separated from D in the junction tree, and similarly with D. The corresponding algorithm is based on a slightly more sophisticated algorithm known as Lexicographic Search (LEX) which runs in O( V 2 ) time.

10 Basic factorizations Maximum likelihood estimates An example If the graph G is chordal, we say that the graphical model is decomposable. In this case, the IPS-algorithm converges in a finite number of steps. We also have the familiar factorization of densities C C f (x Σ) = f (x C Σ C ) S S f (x S Σ S ) ν(s) (1) where ν(s) is the number of times S appear as intersection between neighbouring cliques of a junction tree for C.

11 Relations for trace and determinant Basic factorizations Maximum likelihood estimates An example Using the factorization (1) we can for example match the expressions for the trace and determinant of Σ tr(kw ) = C C tr(k C W C ) S S ν(s) tr(k S W S ) and further det Σ = {det(k)} 1 = C C det{σ C } S S {det(σ S)} ν(s) These are some of many relations that can be derived using the decomposition property of chordal graphs.

12 Basic factorizations Maximum likelihood estimates An example The same factorization clearly holds for the maximum likelihood estimates: C C f (x ˆΣ) = f (x C ˆΣ C ) S S f (x (2) S ˆΣ S ) ν(s) Moreover, it follows from the general likelihood equations that ˆΣ A = W A /n whenever A is complete. Exploiting this, we can obtain an explicit formula for the maximum likelihood estimate in the case of a chordal graph.

13 Basic factorizations Maximum likelihood estimates An example For a d e matrix A = {a γµ } γ d,µ e we let [A] V denote the matrix obtained from A by filling up with zero entries to obtain full dimension V V, i.e. ( ) { [A] V = aγµ if γ d, µ e γµ 0 otherwise. The maximum likelihood estimates exists if and only if n C for all C C. Then the following simple formula holds for the maximum likelihood estimate of K: { } ˆK = n [(w C ) 1] V ν(s) [(w S ) 1] V. C C S S

14 Mathematics marks Basic factorizations Maximum likelihood estimates An example 2:Vectors 4:Analysis 3:Algebra 1:Mechanics 5:Statistics This graph is chordal with cliques {1, 2, 3}, {3, 4, 5} with separator S = {3} having ν({3}) = 1.

15 Basic factorizations Maximum likelihood estimates An example Since one degree of freedom is lost by subtracting the average, we get in this example ˆK = 87 w 11 [123] w 12 [123] w 13 [123] 0 0 w 21 [123] w 22 [123] w 23 [123] 0 0 w 31 [123] w 32 [123] w 33 [123] + w 33 [345] 1/w 33 w 34 [345] w 35 [345] 0 0 w 43 [345] w 44 [345] w 45 [345] 0 0 w 53 [345] w 54 [345] w 55 [345] where w ij [123] is the ijth element of the inverse of and so on. W [123] = w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33

16 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws The formula ˆΣ = n 1 { C C [(W C ) 1] V specifies ˆΣ as a random matrix. S S } 1 ν(s) [(W S ) 1] V The distribution of this random Wishart-type matrix is partly reflecting Markov properties of the graph G. This is also true for the distribution of ˆΣ for a non-chordal graph G but not to the same degree. Before we delve further into this, we shall need some more terminology.

17 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws Laws and distributions Generally we think of P = {P θ, θ Θ} and sometimes identify Θ with P which is justified when the parametrization θ P θ is one-to-one and onto. In a Gaussian graphical model θ = K S + (G) is uniquely identifying any regular Gaussian distribution satisfying the Markov properties w.r.t. G.

18 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws In any case, any probability measure on P (or on Θ) represents a random element of P, i.e. a random distribution. The sampling distribution of the MLE ˆΣ is an example of such a measure. To keep heads straight we refer to a probability measure on P as a law, whereas a distribution is a probability measure on X. Thus we shall e.g. speak of the Wishart law as we think of it specifying a distribution of f ( Σ).

19 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws We identify θ Θ and P θ P, so e.g. θ A for A V denotes the distribution of X A under P θ and θ A B the family of conditional distributions of X A given X B, etc. For a law L on Θ we write A L B S θ A S L θ B S θ S. A law L on Θ is hyper Markov w.r.t. G if (i) All θ Θ are globally Markov w.r.t. G; (ii) A L B S whenever S is complete and A G B S. Note the conditional independence is only required to hold for graph decompositions.

20 Hyper Markov property Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws If θ follows a hyper Markov law for this graph, it holds for example that θ 1235 θ θ 25. This is indeed true for ˆΣ in the graphical model with this graph, i.e. ˆΣ 1235 ˆΣ ˆΣ 25.

21 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws Consequences of the hyper Markov property Clearly, if A L B S, we have for example also (using property (C2) of conditional independence) θ A L θ B θ S since θ A and θ B are functions of θ A S and θ B S respectively. But the converse is false! θ A L θ B θ S does not imply θ A S L θ B S θ S, since θ A S is not a function of (θ A, θ S ). In contrast, X A B is indeed a (one-to-one) function of (X A, X B ). However it generally holds that A L B S θ A S L θ B S θ S.

22 Chordal graphs Maximum cardinality search Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws If G is chordal and θ is hyper Markov on G, it holds that A G B S A L B S i.e. it is not necessary to specify that S is a complete separator to obtain the relevant conditional independence. This follows essentially because for a chordal graph it holds that A G B S S S : A G B S with S complete. If G is not chordal, we can form G by completing all prime components of G. If θ is hyper Markov on G it is also hyper Markov on G, and thus A G B S A L B S. But the similar result is false for an arbitrary chordal cover of G.

23 Directed hyper Markov property Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws We have similar notions and results in the directed case. Say L = L(θ) is directed hyper Markov w.r.t. a DAG D if θ is directed Markov on D for all θ Θ and θ v pa(v) L θ nd(v) θ pa(v), or equivalently θ v pa(v) L θ nd(v) θ pa(v), or equivalently for a well-ordering θ v pa(v) L θ pr(v) θ pa(v). In general there is no similar statement corresponding to the global property and d-separation. However, if D is perfect, L is directed hyper Markov w.r.t. D if and only if L is hyper Markov w.r.t. G = σ(d) = D m.

24 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws The distributions of maximum likelihood estimators are important examples of hyper Markov laws. But for chordal graphs there is a canonical construction of such laws. Let C be the cliques of a chordal graph G and let L C, C C be a family of laws over Θ C P(X C ). The family of laws are hyperconsistent if for any C and D with C D = S, L C and L D induce the same law for θ S. If L C, C C are hyperconsistent, there is a unique hyper Markov law L over G with L(θ C ) = L C, C C.

25 Distribution of MLE in Gaussian graphical model Hyper Markov Laws Canonical construction of hyper Markov laws Strong hyper Markov laws In some cases it is of interest to consider a stronger version of the hyper Markov property. A hyper Markov law is strongly hyper Markov if θ A S L θ S for all complete separators S. A directed hyper Markov law is strongly directed hyper Markov if θ v pa(v) L θ pa(v) for all v V.

26 Basic setup Conjugate families Stability of hyper Markov properties Conjugate exponential families Parameter θ Θ, data X = x, likelihood L(θ x) p(x θ) = dp θ(x) dµ(x). Express knowledge about θ through a prior law π on θ. Use also π to denote density of the prior law w.r.t. some measure ν on Θ. Inference about θ from x is then represented through posterior law π (θ) = p(θ x). Then, from Bayes formula π (θ) = p(x θ)π(θ)/p(x) L(θ x)π(θ) so the likelihood function is equal to the density of the posterior w.r.t. the prior modulo a constant.

27 Basic setup Conjugate families Stability of hyper Markov properties Conjugate exponential families A family P of laws on Θ is said to be stable under sampling from x if π P π P. If the family of priors is parametrised: P = {P α, α A} we sometimes say that α is a hyperparameter. Then, Bayesian inference can be made by just updating hyperparameters.

28 Basic setup Conjugate families Stability of hyper Markov properties Conjugate exponential families If L is a prior law over Θ and X = x is an observation from θ, L = L(θ X = x) denotes the posterior law over Θ. If L is hyper Markov w.r.t. G so is L. If L is strongly hyper Markov w.r.t. G so is L. In the latter case, the update of L is local to prime components, i.e. L (θ Q ) = L Q (θ Q) = L Q (θ Q X Q = x Q ) and the marginal distribution p of X is globally Markov w.r.t. G, where p(x) = P(X = x θ)l(dθ). Θ

29 Basic setup Conjugate families Stability of hyper Markov properties Conjugate exponential families For a k-dimensional exponential family p(x θ) = b(x)e θ t(x) ψ(θ) the standard conjugate family is given as π(θ a, κ) e θ a κψ(θ) for (a, κ) A R k R +, where A is determined so that the normalisation constant is finite. The conjugate family is stable under sampling and posterior updating from (x 1,..., x n ) with t = i t(x i) is then made as (a, κ ) = (a + t, κ + n).

30 Basic setup Conjugate families Stability of hyper Markov properties Conjugate exponential families Hyper inverse Wishart and laws Gaussian graphical models are canonical exponential families. The standard family of conjugate priors have densities π(k Φ, δ) (det K) δ/2 e tr(kφ), K S + (G). These laws are termed hyper inverse Wishart laws as Σ follows an inverse Wishart law for complete graphs. For chordal graphs, each marginal law L C of Σ C is inverse Wishart.

Decomposable and Directed Graphical Gaussian Models

Decomposable Decomposable and Directed Graphical Gaussian Models Graphical Models and Inference, Lecture 13, Michaelmas Term 2009 November 26, 2009 Decomposable Definition Basic properties Wishart density