(Bayesian Networks) Undirected Graphical Models 2: Use d-separation to read off independencies in a Bayesian network Takes a bit of effort! 1 2 (Markov networks) Use separation to determine independencies (really easy!) Z Formally: let H be a Markov network structure, and let X 1... X k be a path in H. Let Z X be a set of observed variables. The path X 1... X k is active given Z if none of the X i s in i=1,..., k, is in Z. X 1 X 2 X 3 Path not active if X 2 is in Z and it separates X 1 and X 3 3 4 1
A set of nodes Z separates X and Y in H, denoted sep H (X; Y Z), if there is no active path between any node X X and Y Y given Z. Separation is monotonic in Z ie. If sep H (X; Y Z) then sep H (X; Y Z ) for any Z Z. Example: We define the global independencies associated with H to be: I(H) = {(X Y Z) : sep H (X; Y Z)}, X 1 X 2 X 3 X 4 X 1 X 2 X 3 X 4 5 Can t encode non-monotonic independence relations with separation in a Markov network (more on this later) 6 We will show: Soundness: Distribution P factorizes over a Markov network H Separation in H Completeness: Separation in H finds almost all conditional independencies in P 7 Soundness Theorem 4.1: Let P be a distribution over X, and H a Markov network structure over X. If P is a Gibbs distribution that factorizes over H, then H is an I-map for P. Informally: separation in H distribution P factorizes over H Recall: A distribution with Φ,, factorizes over a Markov network H if each (k=1,, K) is a complete subgraph of H and 1 8 2
Soundness Soundness Proof: Let X, Y, Z be any three disjoint subsets in X such that Z separates X and Y in H. Want to show: P = (X Y Z) Case 1: X Y Z = X Cliques involving only Cliques involving only,, 1 Cliques involving only Cliques involving only 9 1,, Therefore, 10 Soundness Case 2: X Y Z X Let U = X (X Y Z ). We can partition U into two disjoint sets U 1 and U 2 such that: Soundness,, 1,,,, Therefore, [By the decomposition property] Cliques involving only Cliques involving only Decomposition property: (X {Y,W} Z) (X Y Z) 11 12 3
Soundness So far, we ve shown: separation in H distribution P factorizes over H. Showing direction Hammersley-Clifford Theorem: Let P be a positive distribution over X, and H a Markov network graph over X. If H is an I-map for P, then P is a Gibbs distribution that factorizes over H. (Proof omitted) Completeness Strong version: every pair of nodes X and Y that are not separated in H are dependent in every distribution which factorizes over H (not true) Weaker version needed 13 14 Completeness Theorem 4.3: Let H be a Markov network structure. If X and Y are not separated given Z in H, then X and Y are dependent given Z in some distribution P that factorizes over H. (Proof omitted here) Thus, for almost all distributions P that factorize over H, we have I(P) = I(H). We had two definitions of independencies in Bayesian networks: 1. Global independencies D-separation 2. Local independencies: (X i NonDescendants(X i ) Parents(X i )) 15 16 4
We can do the same thing with Markov Networks: 1. Global independencies: separation 2. Local independencies: a) Pairwise independencies b) Local independencies (Markov Blanket) Pairwise Intuitively: when two variables are not directly connected, we can make them conditionally independent through other mediating variables Let H be a Markov network. We define the pairwise independencies associated with H to be: I p (H) = {(X Y X {X,Y}): X Y H} 17 18 Local Markov Blanket Intuitively: block all influences on a node by conditioning on its immediate neighbors Grey nodes are the Markov blanket Local Markov Blanket Formally: for a given graph H, we define the Markov blanket of X in H, denoted MB H (X), to be the neighbors of X in H. We define the local independencies associated with H to be: I l (H) = {(X X {X} MB H (X) MB H (X)) : X X}. Note: For general distributions, I P (H) is strictly weaker than I l (H) which is strictly weaker than I(H) 19 20 5
Proposition 4.3: For any Markov network H, and any distribution P, we have that if P = I l (H) then P = I P (H). Proposition 4.4: For any Markov network H, and any distribution P, we have that if P = I(H) then P = I l (H) Note: The converse of both propositions is true only for positive distributions Theorem 4.4: Let P be a positive distribution. If P satisfies I P (H), then P satisfies I(H). (Proof omitted here) Corollary 4.1: The following three statements are equivalent for a positive distribution P 1. P = I l (H) 2. P = I P (H) 3. P = I(H) 21 22 Why only positive distributions P? In a nonpositive distribution, we can satisfy one of the weaker properties but not the stronger one (Counterexamples to follow) Counterexample #1: Let be any distribution over,, ; let,,. Construct a distribution, whose marginal over,, is the same as, and where is deterministically equal to. [Note: this is where the nonpositivity comes in] Let be a Markov network over, that contains no edges other than to. X 1 X 2 X 3 23 X 1 X 2 X 3 24 6
In : Thus, satisfies the local independencies for every node in the network. But is not an I-map for : makes many independence assertions regarding the that do not hold in (or in ) Eg. that the are independent Counterexample #2: Let be any distribution over,,, Consider two auxiliary sets of variables and, and define Construct a distribution whose marginal,,,,, and where and are both deterministically equal to. [Note: this is where the nonpositivity comes in] 25 26 Let be the empty Markov network over eg. X 1 X 2 X 1 X 2 X 1 X 2 Satisfies the pairwise assumptions for every pair of nodes in the network ie., : Why? Eg. and are rendered independent because X* - {, } contains (knowing tells you everything about or Similarly, and are independent given. Thus, satisfies the pairwise independencies, but not the local or global independencies 27 28 7
How do we encode the independencies in a distribution P using a graph structure? Need to construct a minimal I-map Use either of the following two algorithms: Theorem 4.5: uses pairwise independencies Theorem 4.6: uses Markov Blanket 29 30 Theorem 4.5: Let P be a positive distribution, and let H be defined by introducing an edge {X,Y} for all X,Y for which P = (X Y X {X,Y}). Then the Markov network H is the unique minimal I- map for P. [Proof omitted] Theorem 4.6: Let P be a positive distribution. For each node X, let MB p (X) be a minimal set of nodes U that form a Markov blanket. We define a graph H by introducing an edge {X, Y} for all X and all Y MB p (X). Then the Markov network H is the unique minimal I-map for P. [Proof omitted] 31 32 8
Counterexample for nonpositive distribution: Consider a nonpositive distribution P over four binary variables A, B, C, D Assigns non-zero probability only to cases where all four variables take on exactly the same value Eg. P(A=1,B=1,C=1,D=1) = 0.5 and P(A=0,B=0,C=0,D=0) = 0.5 What happens when we apply the Markov Blanket construction from Theorem 4.6? A C B D The graph to the left is one possibility (note: not an I-map for the distribution) P = (A {C, D} B) and {B} is a legal choice for MB p (A) 33 34 What happens when we apply the pairwise independence construction from Theorem 4.5? A C B D For example, no edge placed between A and B because P = (A B {C, D}) Resulting network is not an I-map of P 35 Does every distribution have a perfect map? No, even for positive distributions Example: Can t use a Markov network to represent a distribution corresponding a Bayesian network v-structure D G I D G I I dependent on D given G. Minimal I-map is the fully connected graph which doesn t capture (I D) Not a perfect map 36 9