Directed and Undirected Graphical Models

Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2)

Last Lecture Refresher Lecture Plan Directed (Bayesian Networks) Directed Acyclic Graph (DAG) G = (V, E) Nodes v V represent random variables Shaded observed Empty un-observed Edges e E describe the conditional independence relationships Conditional Probability Tables (CPT) local to each node describe the probability distribution given its parents P(Y 1,..., Y N ) = N P(Y i pa(y i )) i=1

Plate Notation Introduction Last Lecture Refresher Lecture Plan A compact representation of replication in graphical models Boxes denote replication for a number of times denoted by the letter in the corner Shaded nodes are observed variables Empty nodes denote un-observed latent variables Black seeds (optional) identify model parameters

Introduction Last Lecture Refresher Lecture Plan A graph whose nodes (vertices) are random variables whose edges (links) represent probabilistic relationships between the variables Different classes of graphs Directed Models Undirected Models Dynamic Models Directed edges express causal relationships Undirected edges express soft constraints Structure changes to reflet dynamic processes

Directed Models - Local Markov Property A variable Y v is independent of its non-descendants given its parents and only its parents: i.e. Y v Y V \ch(v) Y pa(v) Party and Study are marginally independent Party Study However, local Markov property does not support Party Study Headache Tabs Party But Party and Tabs are independent given Headache Tabs Party Headache

Joint Probability Factorization An application of Chain rule and Local Markov Property 1 Pick a topological ordering of nodes 2 Apply chain rule following the order 3 Use the conditional independence assumptions P(PA, S, H, T, C) = P(PA) P(S PA) P(H S, PA) P(T H, S, PA) P(C T, H, S, PA) = P(PA) P(S) P(H S, PA) P(T H) P(C H)

Sampling from a Bayesian Network A BN describes a generative process for observations 1 Pick a topological ordering of nodes 2 Generate data by sampling from the local conditional probabilities following this order Generate i-th sample for each variable PA, S, H, T, C 1 pa i P(PA) 2 s i P(S) 3 h i P(H S = s i, PA = pa i ) 4 t i P(T H = h i ) 5 c i P(C H = h i )

Basic Structures of a Bayesian Network There exist 3 basic substructures that determine the conditional independence relationships in a Bayesian network Tail to tail (Common Cause) Y 2 Y 1 Y 3 Head to tail (Causal Effect) Y 1 Y 2 Y 3 Head to head (Common Effect) Y 2 Y 1 Y 3

Tail to Tail Connections Y 1 Y 1 Y 1 Y 2 Y 2 Y 2 Y 3 Y 3 Y 3 Corresponds to P(Y 1, Y 3 Y 2 ) = P(Y 1 Y 2 )P(Y 3 Y 2 ) If Y 2 is unobserved then Y 1 and Y 3 are marginally dependent Y 1 Y 3 If Y 2 is observed then Y 1 and Y 3 are conditionally independent Y 1 Y 3 Y 2 When Y 2 in observed is said to block the path from Y 1 to Y 3

Head to Tail Connections Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Corresponds to P(Y 1, Y 3 Y 2 ) = P(Y 1 )P(Y 2 Y 1 )P(Y 3 Y 2 ) = P(Y 1 Y 2 )P(Y 3 Y 2 ) If Y 2 is unobserved then Y 1 and Y 3 are marginally dependent Observed Y 2 blocks the path from Y 1 to Y 3 Y 1 Y 3 If Y 2 is observed then Y 1 and Y 3 are conditionally independent Y 1 Y 3 Y 2

Head to Head Connections Y 1 Y 1 Y 2 Y 2 Y 2 Y 3 Y 3 Corresponds to P(Y 1, Y 2, Y 3 ) = P(Y 1 )P(Y 3 )P(Y 2 Y 1, Y 3 ) If Y 2 is observed then Y 1 and Y 3 are conditionally dependent Y 1 Y 3 Y 2 If Y 2 is unobserved then Y 1 and Y 3 are marginally independent Y 1 Y 3 Y 1 Y 3 If any Y 2 descendants is observed it unlocks the path

Derived Conditional Independence Relationships A Bayesian Network represents the local relationships encoded by the 3 basic structures plus the derived relationships Consider Y 1 Y 2 Y 3 Y 4 Local Markov Relationships Y 1 Y 3 Y 2 Derived Relationship Y 1 Y 4 Y 2 Y 4 Y 1, Y 2 Y 3

d-separation Introduction Definition (d-separation) Let r = Y 1... Y 2 be an undirected path between Y 1 and Y 2, then r is d-separated by Z if there exist at least one node Y c Z for which path r is blocked. In other words, d-separation holds if at least one of the following holds r contains an head-to-tail structure Y i Y c Y j (or Y i Y c Y j ) and Y c Z r contains a tail-to-tail structure Y i Y c Y j and Y c Z r contains an head-to-head structure Y i Y c Y j and neither Y c nor its descendants are in Z

Markov Blanket and d-separation Definition (Nodes d-separation) Two nodes Y i and Y j in a BN G are said to be d-separated by Z V (denoted by Dsep G (Y i, Y j Z ) if and only if all undirected paths between Y i and Y j are d-separated by Z Definition (Markov Blanket) The Markov blanket Mb(Y ) is the minimal set of nodes which d-separates a node Y from all other nodes (i.e. it makes Y conditionally independent of all other nodes in the BN) Mb(Y ) = {pa(y ), ch(y ), pa(ch(y ))}

Are Directed Models Enough? Bayesian Networks are used to model asymmetric dependencies (e.g. causal) What if we want to model symmetric dependencies Bidirectional effects, e.g. spatial dependencies Need undirected approaches Directed models cannot represent some (bidirectional) dependencies in the distributions What if we want to represent Y 1 Y 3 Y 2, Y 4? Y 2 What if we also want Y 2 Y 4 Y 1, Y 3? Y 1 Y 4 Y 3 Cannot be done in BN! Need undirected model

Markov Random Fields Undirected graph G = (V, E) (a.k.a. Markov Networks) Nodes v V represent random variables X v Shaded observed Empty un-observed Edges e E describe bi-directional dependencies between variables (constraints) Often arranged in a structure that is coherent with the data/constraint we want to model

Image Processing Introduction Often used in image processing to impose spatial constraints (e.g.smoothness) Image de-noising example Lattice Markov Network (Ising model) Y i observed value of the noisy pixel X i unknown (unobserved) noise-free pixel value Can use more expressive structures Complexity of inference and learning can become relevant

Conditional Independence What is the undirected equivalent of d-separation in directed models? A B C Again it is based on node separation, although it is way simpler! Node subsets A, B V are conditionally independent given C V \ {A, B} if all paths between nodes in A and B pass through at least one of the nodes in C The Markov Blanket of a node includes all and only its neighbors

Joint Probability Factorization What is the undirected equivalent of conditional probability factorization in directed models? We seek a product of functions defined over a set of nodes associated with some local property of the graph Markov blanket tells that nodes that are not neighbors are conditionally independent given the remainder of the nodes P(X v, X i X V\{v,i} ) = P(X v X V\{v,i} )P(X i X V\{v,i} ) Factorization should be chosen in such a way that nodes X v and X i are not in the same factor What is a well-known graph structure that includes only nodes that are pairwise connected?

Cliques Introduction Definition (Clique) A subset of nodes C in graph G such that G contains an edge between all pair of nodes in C Definition (Maximal Clique) A clique C that cannot include any further node from the graph without ceasing to be a clique

Maximal Clique Factorization Define X = X 1,..., X N as the RVs associated to the N nodes in the undirected graph G P(X) = 1 ψ(x Z C ) X C RV associated with nodes in the maximal clique C ψ(x C ) potential function over the maximal cliques C Z partition function ensuring normalization Z = ψ(x C ) X Partition function is the computational bottleneck of undirected modes: e.g. O(K N ) for N discrete RV with K distinct values C C

Potential Functions Introduction Potential functions ψ(x C ) are not probabilities! Express which configurations of the local variables are preferred If we restrict to strictly positive potential functions, the Hammersley-Clifford theorem provides guarantees on the distribution that can be represented by the clique factorization Definition (Boltzmann distribution) A convenient and widely used strictly positive representation of the potential functions is ψ(x C ) = exp { E(X C )} where E(X C ) is called energy function

Models Long story short

From Directed To Undirected Straightforward in some cases Requires a little bit of thinking for v-structures Moralization a.k.a. marrying of the parents

The BN Structure Learning Problem Structure Learning in Bayesian Networks Wrap-up and Next Lecture Y 2 Y 1 Y 4 Y 3 Y 5 Y 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 1 2 1 0 3 4 4 0 0 0 1 2.................................... 0 0 1 3 2 1 Observations are given for a set of fixed random variables But the structure of the Bayesian Network is not specified How do we determine which arcs exist in the network (causal relationships)? Determining causal relationships between variables entails Deciding on arc presence Directing edges

Structure Finding Approaches Structure Learning in Bayesian Networks Wrap-up and Next Lecture Search and Score Model selection approach Search in the space of the graphs Constraint Based Use tests of conditional independence Constrain the network Hybrid Model selection of constrained structures

Constraint-based Models Outline Structure Learning in Bayesian Networks Wrap-up and Next Lecture Tests of conditional independence I(X i, X j Z ) determine edge presence (network skeleton) Estimate mutual information MI(X i, X j Z ) and assume conditional independence if MI is below a threshold, e.g. I(X i, X j Z ) = MI(X i, X j Z ) < α cut Testing order is the fundamental choice for avoiding super-exponential complexity Level-wise testing: tests I(X i, X j Z ) are performed in order of increasing size of the conditioning set Z (PC algorithm by Spirtes, 1995) Nodes that enter Z are chosen in the neighborhood of X i and X j Markovian dependencies determine edge orientation (DAG) Deterministic rules based on the 3 basic substructures seen previously

PC Algorithm Skeleton Identification Structure Learning in Bayesian Networks Wrap-up and Next Lecture 1 Initialize a fully connected graph G = (V, E) 2 for each edge (Y i, Y j ) V if I(Y i, Y j ) then prune (Y i, Y j ) 3 K 1 4 for each test of order K = Z for each edge (Y i, Y j ) V Z set of conditioning sets of K -th order for Y i, Y j if I(Y i, Y j z) for any z Z then prune (Y i, Y j ) K K + 1 5 return G

PC Algorithm Order 0 Tests Introduction Structure Learning in Bayesian Networks Wrap-up and Next Lecture Y 2 Y 1 Y 4 Y 3 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 1 2 1 0 3 4 4 0 0 0 1 2.................................... 0 0 1 3 2 1 Y 5 Y 6 Step 1 Initialize Step 2 Check unconditional independence I(Y i, Y j ) Step 3 Repeat unconditional tests for all edges

PC Algorithm Order 1 Tests Introduction Structure Learning in Bayesian Networks Wrap-up and Next Lecture Y 2 Y 1 Y 4 Y 3 Y 2 Y 3 Y 6 Y 5 Y 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 1 2 1 0 3 4 4 0 0 0 1 2.................................... 0 0 1 3 2 1 Step 4 Select an edge (Y i, Y j ) Step 5 Add the neighbors to the conditioning set Z Step 6 Check independence for each z Z Step 7 Iterate until convergence

Take Home Messages Structure Learning in Bayesian Networks Wrap-up and Next Lecture Directed graphical models Represent asymmetric (causal) relationships between variables and provide a compact representation of conditional probabilities Difficult to assess conditional independence relationships (v-structures) Straightforward to incorporate prior knowledge and to interpret Undirected graphical models Represent bi-directional relationships between variables (e.g. constraints) Factorization in terms of generic potential functions which, however, are typically not probabilities Easy to assess conditional independence, but difficult to interpret the encoded knowledge Serious computational issues associated with computation of normalization factor

Next Lecture Introduction Structure Learning in Bayesian Networks Wrap-up and Next Lecture Inference in Exact inference Inference on a chain Inference in tree-structured models Sum-product algorithm Elements of approximate inference Variational algorithms Sampling-based methods