Lecture 17: May 29, 2002

Size: px

Start display at page:

Download "Lecture 17: May 29, 2002"

Todd Harper
5 years ago
Views:

1 EE596 Pat. Recog. II: Introduction to Graphical Models University of Washington Spring 2000 Dept. of Electrical Engineering Lecture 17: May 29, 2002 Lecturer: Jeff ilmes Scribe: Kurt Partridge, Salvador Ruiz Correa 17.1 Formal Markov Properties on Directed Graphs We previously discussed two algorithms for determining independence properties from directed graphs: ayes-all and d-separation. We now show that a third mechanism, directed factorization, is also equivalent. Definition Directed Factorization. probability distribution P admits a directed factorization (DF) (also called recursive factorization) with respect to a DG D if P (X) = v V P (X v X pa(v) ) where pa(v) are the parents of v. Lemma If P has (DF) with respect to D, then it factorizes according to D m (the moralized graph) and therefore obeys (G) with respect to D m. Proof. In the moralized graph D m, subsets {v} pa(v) are complete. Define: ψ {v} pa(v) (X) = P (X v X pa(v) ) Therefore, P factorizes according to complete sets, and therefore (F) (G) for directed graphs. The consequence of this lemma is that any inference we do on the moralized graph D m will be valid for the original graph D. We can see that is true because D m corresponds to a larger family of probability models, since D m has more edges, and has erased all the independent statements made by the V-structures of the original models. (DF) can thus be seen to be a directed version of (G). On the other hand, recall that (G) implies (L), which states that α V \cl(α) bd(α). Next, we note that (L) in D m is equivalent to the following statement for directed graphs: α V \bl(α) bl(α). This definition uses the concept of Markov blanket (bl) which is the analogous to the concept of boundary for undirected graphs. The Markov blanket for node α is defined as: bl(α) = pa(α) ch(α) {w : ch(w) ch(α) } 17-1

2 Lecture 17: May 29, C Figure 17.1: D m (right) corresponds to a larger class of models that has fewer independent statements than D. C Informally, the Markov blanket is α s parents, children, and the parents of α s children. It serves the same purpose as the boundary in undirected graphs: it delineates variable dependencies. For example, all nodes in the graph below except α are in bl(α). Conditioning on the blanket isolates alpha from other nodes like γ. α α γ Figure 17.2: Markov blanket for node α (left). Moralized graph (right). Recall that that the set is ancestral if bd(α) for all α. We let n() denote the smallest ancestral set containing. Theorem If P has DF property over D and is an ancestral set, then the marginal P over has DF over D. Proof. P (X) = P (X v X pa(v) ) v = P (X v X pa(v) ) P (X v X pa(v) ). v v }{{}}{{} ny nodes in Nodes entirely not in We see that that the portion of nodes consisting just of will factorize appropriately.

3 Lecture 17: May 29, Directed global Markov property Theorem Let P have (DF) over D. Then S whenever and are separated by S in (D n( S) ) m, the moral graph of the smallest ancestral set containing S. Proof. P has (DF) which implies that n ( S) also has (DF). On the other hand, n ( S) obeys (G) in (D n( S)) m. This implies that S if S separates and. This independence property is called the directed global Markov property (DG). It is in fact equivalent to directed factorization (DF). Here s an example: a b a b Y Y X X Figure 17.9: Directed and Moralized Graph Is a b {X, Y } in the graph at left? Yes, because in the corresponding moralized graph shown to the right, vertex Y S blocks the path between a and b, thus S separates a and b, and the independence holds. d-separation can also be interpreted this way. trail π from a to b is blocked in a directed graph D by S if π contains a node γ π such that either a) γ S and γ or γ, or b) γ / S and γ and no descendents of γ are in S. and are d-separated if all trails from to are blocked, that is S. The cases a) and b) are illustrated in Fig The figure also shows that if descendents of γ are in S, then the trial could not be blocked (Fig 17.3 c)). probability measure P is said to obey ordered directed local (DO) Markov property if any node in the graph is independent of its predecessors (pr) in some well ordering of the graph, given its parents (pa). v pr(v) pa(v), v (17.1) Sets and are d-separated by S if all trails from to are blocked by S (we discussed blocked and d-separation earlier). Theorem Let, and S be disjoint subsets of a dag G. Then, the condition that S d-separates from S separates from in (D n( S) ) m It turns out that all these directed Markov properties on directed graphs are equivalent (the proofs of many of them can be found in Lauritzen, Preposition 3.25).

4 Lecture 17: May 29, a) b) c) Figure 17.3: S must not have descendents. Theorem Let D be a DG. For a probability distribution P over D, we have the following equivalences: DF DG DL DO d-separation un-reachability by the ayes ball algorithm. Thus, by Theorem 13.6, the conditional independence properties of a DG can be read off using any of the above tools. So for causal models, for example, it is often not obvious what the independencies are implied by the graph. This gives you a variety of tools to do so. ll the algorithms produce the same list of independence statements General inference on graphs So far in this course, we have talked about various graphical models (e.g., mixture models, factor analysis models, HMMs, LG-HMMs etc.) and derived inference formulas that often took the form of recursions (e.g, α, β and γ recursions) to do efficient probabilistic inferencing (i.e., for HMM s and LG-HMM s). These recursion formulas are in fact special instances of more general algorithms, which we will be discussing in this lecture and the next one. We will concentrate on the Junction Tree (JT), a data structure that we say corresponds to the case when we have triangulated (or chordal or decomposable, etc.) graphs. You might recall the elimination algorithm which we saw produced triangulated graphs, and which therefore has a JT (in fact elimination is a special case of the JT algorithm). JT is a data structure which has important properties to allow us to do inference efficiently. The key property a JT can achieve which makes it suitable for doing probabilistic inferencing on a graph is that certain properties, when achieved locally are also achieved globally. Why is locality important? In general, local operations on a graph are far less computationally intensive than doing inference on the whole graph. The key aspects of JT inference are locality: 1) we build data structures on which we can perform local operations in

5 Lecture 17: May 29, the right order on the right data-structure, then we will have a globally correct solution. 2) the JT is the key to why local operations imply global correctness, 3) local operations are much less expensive computationally. Here s an outline of the key issues we will be discussing today and in the next lecture. From DGM s to UGM s via moralization (we say most of this already) How to do inference correctly using local operations on a graph in the right order and on the right data structure. This is called achieving local consistency. JT : the key for ensuring correctness, that local consistency implies global consistency. Existence of JT the graph being triangulated. This is the reason we need to work with triangulated graphs. See Fig Directed Undirected Decomposable Figure 17.4:. Complexity : Complexity is exponential in the cardinality of the cliques. So, in general, we would like to have smallest possible clique sizes. However, finding the optimal clique size is difficult (NP-hard) Moralization Recall the directed factorization (DF) property of a DGM and the factorization (F) property of a UGM. p(x) = v p(x v x pa(v) ) (DF) p(x) = c C ψ c (x) (F) where ψ c s represent the clique potentials. The sequence of steps to go from a DGM to a UGM are - Initialize all clique potentials to unity; i.e, set ψ c (x) = 1, c For each p(x v x pa(v) ), choose a clique c such that x v x pa(v) is contained within the clique. Update the clique potential as follows - ψ c (u) (x) = ψ c (x) p(x v x pa(v) ) (17.2) where ψ c (u) (x) is the updated version of ψ c (x). Let s now work out an example to illustrate how the clique potentials can be used to represent the joint probability distribution on a graph.

6 Lecture 17: May 29, D C F E Figure 17.1: Six node DG fter we moralize and triangulate, the cliques of the above graph might be (C), (CDE), and (DEF ) and the associated clique potentials are - ψ C = p() p() p(c, ) ψ CDE = p(d C) p(e C) ψ DEF = p(f D, E) From the graph, the joint distribution of the nodes can be written as - p(cde) = p() p() p(c, ) p(d C) p(e C) p(f D, E) = ψ C ψ CDE ψ DEF In general, clique potentials are not unique. Introducing some scaling constants, we can modify the clique potentials such that the global distribution p(x) is maintained. That is, we can write - p(x) = c ψ c (x) = c ψ c (x) (17.3) In fact, we could allow the clique functions to be arbitrary positive values, and as long as we normzliae to ensure that the probability sums to one, we can have: p(x) = 1 ψ c (x) (17.4) Z where Z = ψ c (x)dx c Note that the zation constant could be absorbed by one of the clique potential functions. The main point is that the clique potentials can, in the general case, be arbitrary. We will continue with this in the next lecture. c

Lecture 12: May 09, Decomposable Graphs (continues from last time)

Lecture 12: May 09, Decomposable Graphs (continues from last time) 596 Pat. Recog. II: Introduction to Graphical Models University of Washington Spring 00 Dept. of lectrical ngineering Lecture : May 09, 00 Lecturer: Jeff Bilmes Scribe: Hansang ho, Izhak Shafran(000).