Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1/37
Graphical Models: Efficient Representation Consider a set of (discrete) random variables {X 1,...,X n } where X i takes on each of its r different values x i. A direct representation of p(x 1,...,x n ) requires an n-dimensional table (table of r n values representing the probabilities of each of the r n possible values of X 1,...,X n ). Graphical models provide an efficient means of representing joint probability distributions over many random variables in situations where each random variable is conditionally independent of all but a handful of the n random variables. A graphical model can be thought of as a probabilistic database, a machine that can answer queries regarding the values of sets of random variables. We build up the database in pieces, using probability theory to ensure that the pieces have a consistent overall interpretation. Probability theory also justifies the inferential machinery that allows the pieces to be put together on the fly to answer queries. 2/37
The chain rule gives p(x 1,x 2,x 3,x 4 ) = p(x 4 x 3,x 2,x 1 )p(x 3 x 2,x 1 )p(x 2 x 1 )p(x 1 ). X 1 X 2 X 3 X 4 X 1 X 2 X 3 X 4 GM (chain rule) GM (Markov chain) 3/37
Outline: Directed Graphical Models Directed graphs and joint probabilities Conditional independence and d-separation Three canonical graphs Bayes ball algorithm Characterization of directed graphical models 4/37
Notations Given a set of random variables, {X 1,...,X n }, let x i represent the realization of random variable X i. The probability mass function p(x 1,...,x n ) is defined as P(X 1 = x 1,...,X n = x n ). Use X to stand for {X 1,...,X n } and x to stand for {x 1,...,x n }. X A is shorthand for {X 1,X 2 } if A = {1,2}. 5/37
Directed Graphs A directed graph is a pair, G(V,E), where V is a set of nodes (vertices) and E is a set of oriented edges. Assume that G is acyclic. Nodes Edges Associated with random variables. a One-to-one mapping from nodes to random variables. Parent node, πi, is a set of parents of node i. Edges represent conditional dependence. 6/37
Conditional Independence X A and X B are independent, X A X B, if p(x A,x B ) = p(x A )p(x B ). X A and X C are conditionally independent given X B, X A X C X B, if p(x A,x C x B ) = p(x A x B )p(x C x B ) or p(x A x B,x C ) = p(x A x B ), for all x B such that p(x B ) > 0. 7/37
An Example of DAG X 4 X 2 X 1 X 6 X 3 X 5 p(x 1,x 2,x 3,x 4,x 5,x 6 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) 8/37
Factorization of Joint Probability Distributions Use the locality defined by the parent-child relationship to construct economical representation of joint probability distributions. Associate a function f i(x i,x πi ) to each node i V (Properties of conditional probability distributions: Nonnegativity and sum-to-one). Given a set of functions {f i(x i,x πi ) : i V} for V = 1,2,...,n, we define a joint probability distribution as follows: p(x 1,x 2,...,x n) n f i(x i,x πi ). Given that f i(x i,x πi ) are conditional probabilities, we write in terms of p(x i x πi ): p(x 1,x 2,...,x n) i=1 n p(x i x πi ). i=1 9/37
An Example of DAG: Revisited X 4 X 2 X 1 X 6 X 3 X 5 p(x 1,x 2,x 3,x 4,x 5,x 6 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) 10/37
Economical Representation? Consider n discrete random variables {x 1,x 2,...,x n } with each variable x i ranging over r values. Naive approach PGM Need an n-dimensional table of size r n Need an (mi+1)-dimensional table of size r m i+1 where m i is the number of parents of node X i 11/37
Conditional Independence in DAG Two different factorization of a probability distribution function The chain rule leads to DAG leads to p(x 1,x 2,...,x n ) = n p(x i x 1,...,x i 1 ). i=1 p(x 1,x 2,...,x n ) = n p(x i x πi ). Comparing these expressions, we might interpret that missing variables in the local conditional probability functions corresponds to missing edges (conditional independence) in the underlying graph. i=1 12/37
Basic Conditional Independence Statements An ordering I of the nodes in a graph G is said to be topological if the nodes in π i appear before i in I for every node i V. (I = {1,2,3,4,5,6} is a topological ordering in our example) Let ν i denote the set of all nodes that appear earlier than i in I, excluding π i. (for example, ν 5 = {1,2,4}) Given a topological ordering I, the set of basic conditional independence statements: for i V. {X i X νi X πi }, These statement can be verified by algebraic calculations Example: X 4 {X 1,X 3} X 2 p(x 4 x 1,x 2,x 3) = p(x1,x2,x3,x4) p(x 1,x 2,x 3) = p(x1)p(x2 x1)p(x3 x1)p(x4 x2) p(x 1)p(x 2 x 1)p(x 3 x 1) = p(x 4 x 2) 13/37
Markov Chain X Y Z X Y Z X Z Y p(z x,y) = p(x,y,z) p(x, y) = p(x)p(y x)p(z y) p(x)p(y x) There are no other conditional independencies = p(z y) Asserted conditional independencies always hold for the family of distributions associated with a given graph. Non-asserted conditional independencies sometimes fail to hold but sometimes do hold. 14/37
Hidden Cause Y Y X Z X Z X Z Y p(x,z y) = p(y)p(x y)p(z y) p(y) = p(x y)p(z y) We do not necessarily assume that X and Z are dependent. 15/37
Explaining-Away X Z X Z Y Y X Z p(y x,z) = p(x)p(z)p(y x,z) p(x, z) p(x,z) = p(x)p(z) No X Z Y 16/37
Bayes Ball Algorithm We wish to decide whether a given conditional independence statement, X A X B X C, is true for a directed graph G. A reachability algorithm A d-separation test Shade the nodes in X C. Place balls at each node in X A, let them bounce around according to some rules, and then ask if any of the balls reach any of the nodes in X B. 17/37
Bayes Ball Algorithm - Rules 1,2 X Y Z X Y Z Y Y X Z X Z 18/37
Bayes Ball Algorithm - Rules 3,4,5 X Z X Z Y Y X Y X Y X Y X Y 19/37
Example 1 X 4 X 2 X 1 X 6 X 3 X 5 X 4 {X 1,X 3} X 2 is true. 20/37
Example 2 X 4 X 2 X 1 X 6 X 3 X 5 X 1 X 6 {X 2,X 3} is true. 21/37
Example 3 X 4 X 2 X 1 X 6 X 3 X 5 X 2 X 3 {X 1,X 6} is not true. 22/37
Characterization of Directed Graphical Models A graphical model is associated with a family of probability distributions and this family can be characterized in two equivalent ways. D 1= { p(x) = n i=1 p(xi xπ i )}. D 2= {p(x) subject to X i X νi X πi }. (A family of probability distributions associated with G, that includes all p(x v) that satisfy every conditional independence statement associated with G) D 1 = D 2 (Details will be discussed later) This provides a strong and important link between graph theory and probability theory. 23/37
Markov Blanket Markov blanket: V is a Markov blanket for X iff X Y V for all Y / V. In a DAG, Markov blanket for node i is composed of its parents, its children, and its children s parents. Markov blanket identifies all the variables that shield off the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge that is needed to predict the behaviour of that node. (coined by J. Pearl) 24/37
A 25/37
Outline: Undirected Graphical Models Markov networks Clique potentials Characterization of undirected graphical models 26/37
Markov Networks (Undirected Graphical Models) X 1 X 4 Examples of Markov networks: X 2 X 3 X 5 Boltzmann machines Markov random fields Semantics: Every node is conditionally independent from its non-neighbors, given its neighbors. Markov blanket: V is a Markov blanket for X iff X Y V for all Y / V. Markov boundary: Minimal Markov blanket 27/37
Boltzmann Machines Markov network over a vector of binary variables, x i {0,1}, where some variables may be hidden (xi H ) and some may be observed (visible, xi V ). Learning for Boltzmann machines involves adjusting weights W such that the generative model p(x W) = 1 { } 1 Z exp 2 x Wx is well-matched to a set of examples {x(t)} (t = 1,...,N). The learning is done via EM! 28/37
Conditional Independence For undirected graphs, the conditional independence semantics is the more intuitive. Every path from a node in X A to a node in X C includes at least one node in X B X A X C X B. X B X A X A X C X B X C 29/37
Comparative Semantics W X Y {W,Z} W Z {X,Y} X Y not X Y Z X Y X Y Z X Y Z not X Y Z X Y X Z X Y X Y Z Z 30/37
Clique Potentials Clique: A fully connected subgraph (usually maximal) Denote by x Ci the set of variables in clique C i. X 1 X 2 X 3 X 4 X 5 For each clique C i, we assign a nonnegative function (potential function), ψ Ci (x Ci ) which measures compatibility (agreement, constraint, or energy). where Z = x p(x) = 1 Z ψ Ci (x Ci ), C i C C i C ψ C i (x Ci ) is the normalization. 31/37
Example 0 x 1 1 x 2 0 1 1.5 0.2 0.2 1.5 X 1 0 x 2 1 X 2 x 4 0 1 1.5 0.2 0.2 1.5 X 4 0 x 5 1 x 6 1 0 1.5 0.2 0.2 1.5 0 1 x 2 x 3 X 6 x 0 1 1 0 1 1.5 0.2 0.2 1.5 X 3 x 0 3 1 x 5 0 1 1.5 0.2 0.2 1.5 X 5 32/37
Potential Functions: Parameterization The potential functions must be nonnegative. ψ C (x C ) = exp{ H C (x C )}. Therefore, the joint probability for undirected models can be derived as follows : p(x) = 1 Z exp H Ci (x Ci ). C i C The sum in the expression is generally referred to as the energy : H(x) C i CH Ci (x Ci ) Finally, the joint probability can be represented as a Boltzman distribution : p(x) = 1 Z exp{ H(x)}. 33/37
Potential Functions, again! The factorization of the joint probability distribution in a Markov network is given by p(x) = 1 Z ψ Ci (x Ci ). C i C If all potential functions are strictly positive, then we have p(x) = exp logψ Ci (x Ci ) logz }{{}}{{} C i C. θ i T i (X) A(θ) In some cases, we have { } p(x) = exp θ it i(x) A(θ), i which is known as exponential family. In statistical physics, log Z is referred to as free energy. 34/37
Markov Random Fields Let X = {X 1,...,X n} be a family of random variables defined on the set S in which each random variable X i takes a value in L. Definition The family X is called a random field. Definition X is said to be Markov random field (MRF) with respect to a neighborhood system N if and only if the following two conditions are satisfied: 1. Positivity: P(x) > 0 x X. 2. Markovianity: P(x i x S\i ) = P(x i x Ni ). 35/37
Gibbs Random Fields Definition A set of random variables X is said to be a Gibbs random field (GRF) if and only iff its configurations obey the Gibbs distribution that has the form p(x) = 1 { Z exp 1 } T E(x), where E(x) = ψ C (x). We often consider cliques of size up to 2, E(x) ψ 1(x i)+ ψ 2(x i,x j). i S i S j N i Theorem (Hammersley-Clifford) X is an MRF on S with respect to N if and only iff X is a GRF on S with respect to N. 36/37
Characterization of Undirected Graphical Models U 1 : A family of probability distributions by ranging over all possible choices of positive potential functions on the maximal cliques of the graph. p(x) = 1 } { H(x) Z exp. U 2 : A family of probability distributions via the conditional independence assertions associated (X A X B X C ) with G. U 1 U 2 by Hammersley-Clifford theorem (proof, p85) 37/37