Message Passing and Junction Tree Algorithms Kayhan Batmanghelich 1
Review 2
Review 3
Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) Slides adapted from Matt Gormley (2016) adapted from MacKay (2003) textbook 4
Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here Slides adapted from Matt Gormley (2016) adapted from MacKay (2003) textbook 5
Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here Slides adapted from Matt Gormley (2016) adapted from MacKay (2003) textbook 6
Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us Slides adapted from Matt Gormley (2016) adapted from MacKay (2003) textbook 7
Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us Slides adapted from Matt Gormley (2016) adapted from MacKay (2003) textbook wouldn't work correctly with a 'loopy' (cyclic) graph 8
Review B A B A A C m c m b Message from one C1 to C2: Multiply all incoming messages with the local factor and sum over variables that are not shared m d C E D m e A C m g E D A m f E E F m h A F G H me ( a, c, d) = å p( e c, d) m e g ( e) m f ( a, e) Eric Xing @ CMU, 2005-2015 9
Message passing (Belief Propagation) on singly connected graph 10
Remember this: Factor Graph? Random variables x 1 f(x 1,x 2 ) x 2 x 3 f(x 1,x 3 ) x 4 f(x2,x 4 ) Factors A factor graph is a graphical model representation that unifies directed and undirected models It is an undirected bipartite graph with two kinds of nodes. Round nodes represent variables, Square nodes represent factors and there is an edge from each variable to every factor that mentions it. We are going to study messages passing between nodes. 11
How General Are Factor Graphs? Factor graphs can be used to describe Markov Random Fields (undirected graphical models) i.e., log-linear models over a tuple of variables Conditional Random Fields Bayesian Networks (directed graphical models) Inference treats all of these interchangeably. Convert your model to a factor graph first. Pearl (1988) gave key strategies for exact inference: Belief propagation, for inference on acyclic graphs Junction tree algorithm, for making any graph acyclic (by merging variables and factors: blows up the runtime)
Factor Graph Notation Variables: X 9 ψ {1,8,9} Factors: X 8 ψ {2,7,8} X 7 ψ {3,6,7} Joint Distribution X 6 ψ 10 X 1 ψ {1,2} ψ 2 X 2 ψ {2,3} ψ {3,4} X 3 X 4 ψ 8 X ψ {1} 1 ψ {2} 3 ψ {3} 5 ψ 7 ψ time flies like an arr 13
Factors are Tensors Factors: s vp pp s 0 s 2 vp.3 pp s 0 s2 vp.3pp vp 3 4 2 vp s3 04 2.3 pp.1 2 1 pp vp.1 32 41 2 pp.1 2 1 s vppp X 9 ψ {1,8,9} X 8 ψ {2,7,8} v n p d v 1 6 3 4 n 8 4 2 0.1 p 1 3 1 3 d 0.1 8 0 0 X 7 ψ {3,6,7} X 6 ψ 10 Slides adapted from Matt Gormley (2016) v 3 n 4 p 0.1 d 0.1 X 1 ψ {1} 1 time ψ {1,2} ψ 2 X 2 ψ {2,3} ψ {3,4} X 3 ψ {2} 3 ψ {3} 5 X 4 ψ 8 ψ 7 flies like an arr 14 X ψ
An Inference Example 15
Message Passing in Belief Propagation v 6 n 6 a 9 v 1 n 6 a 3 My other factors think I m a noun X Ψ Slides adapted from Matt Gormley (2016) But my other variables and I think you re a verb v 6 n 1 a 3 Both of these messages judge the possible values of variable X. Their product = belief at X = product of all 3 messages to X. 16
Sum-Product Belief Propagation Variables Factors Beliefs ψ 2 X 2 ψ 1 X 1 ψ 3 X 1 ψ 1 X 3 Messages ψ 2 X 2 ψ 1 X 1 ψ 3 ψ 1 X 1 X 3 Slides adapted from Matt Gormley (2016) 17
Sum-Product Belief Propagation Variable Belief v 0.1 n 3 p 1 ψ 2 v 1 n 2 p 2 v 4 n 1 p 0 ψ 1 X 1 ψ 3 Slides adapted from Matt Gormley (2016) v.4 n 6 p 0 18
Sum-Product Belief Propagation Variable Message v 0.1 n 3 p 1 ψ 2 v 1 n 2 p 2 v 0.1 n 6 p 2 ψ 1 X 1 ψ 3 Slides adapted from Matt Gormley (2016) 19
Sum-Product Belief Propagation Factor Belief p 4 d 1 n 0 v n p 0.1 8 d 3 0 n 1 1 v 8 n 0.2 ψ 1 X 1 X 3 v n p 3.2 6.4 d 24 0 n 0 0 Slides adapted from Matt Gormley (2016) 20
Sum-Product Belief Propagation Factor Belief ψ 1 X 1 X 3 v n p 3.2 6.4 d 24 0 n 0 0 Slides adapted from Matt Gormley (2016) 21
Sum-Product Belief Propagation Factor Message p 0.8 + 0.16 d 24 + 0 n 8 + 0.2 v n p 0.1 8 d 3 0 n 1 1 v 8 n 0.2 ψ 1 X 1 X 3 Slides adapted from Matt Gormley (2016) 22
Sum-Product Belief Propagation Factor Message matrix-vector product (for a binary factor) ψ 1 X 1 X 3 Slides adapted from Matt Gormley (2016) 23
Summary of the Messages 24
Sum-Product Belief Propagation Input: a factor graph with no cycles Output: exact marginals for each variable and factor Algorithm: 1. Initialize the messages to the uniform distribution. 1. Choose a root node. 2. Send messages from the leaves to the root. Send messages from the root to the leaves. 1. Compute the beliefs (unnormalized marginals). 2. Normalize beliefs and return the exact marginals. 25
Sum-Product Belief Propagation Variables Factors Beliefs ψ 2 X 2 ψ 1 X 1 ψ 3 X 1 ψ 1 X 3 Messages ψ 2 X 2 ψ 1 X 1 ψ 3 ψ 1 X 1 X 3 Slides adapted from Matt Gormley (2016) 26
Sum-Product Belief Propagation Variables Factors Beliefs ψ 2 X 2 ψ 1 X 1 ψ 3 X 1 ψ 1 X 3 Messages ψ 2 X 2 ψ 1 X 1 ψ 3 ψ 1 X 1 X 3 Slides adapted from Matt Gormley (2016) 27
(Acyclic) Belief Propagation In a factor graph with no cycles: 1. Pick any node to serve as the root. 2. Send messages from the leaves to the root. 3. Send messages from the root to the leaves. A node computes an outgoing message along an edge only after it has received incoming messages along all its other edges. X 8 ψ 12 X 7 ψ 11 X 9 X 6 ψ 13 ψ 10 X 1 X 2 X 3 X 4 X 5 Slides adapted from Matt Gormley (2016) ψ 1 time ψ 3 ψ 5 like flies an arrow ψ 7 ψ 9 28
(Acyclic) Belief Propagation In a factor graph with no cycles: 1. Pick any node to serve as the root. 2. Send messages from the leaves to the root. 3. Send messages from the root to the leaves. A node computes an outgoing message along an edge only after it has received incoming messages along all its other edges. X 8 ψ 12 X 7 ψ 11 X 9 X 6 ψ 13 ψ 10 X 1 X 2 X 3 X 4 X 5 Slides adapted from Matt Gormley (2016) ψ 1 time ψ 3 ψ 5 flies like an arrow ψ 7 ψ 9 29
A note on the implementation To avoid numerical precision issue, use log message ( ): 30
How about other queries? (MPA, Evidence) 31
Example 32
The Max Product Algorithm 33
Can I use BP in a multiply connected graph? 34
Loops the trouble makers One needs to account for the fact that the structure of the graph changes. The junction tree algorithm deals with this by combining variables to make a new singly connected graph for which the graph structure remains singly connected under variable elimination. 35
Clique Graph Def (Clique Graph): A clique graph consists of a set of potentials,! " # ",,! & (# & ) each defined on a set of variables # ". For neighboring cliques on the graph, defined on sets of variables # " and # *, the intersection # + = # - # * is called the separator and has a corresponding potential! + (# + ). A clique graph represents the function Example 36
Clique Graph Def (Clique Graph): A clique graph consists of a set of potentials,! " # ",,! & (# & ) each defined on a set of variables # ". For neighboring cliques on the graph, defined on sets of variables # " and # *, the intersection # + = # - # * is called the separator and has a corresponding potential! + (# + ). A clique graph represents the function Don t confuse it with Factor Graph! Example 37
Example: probability density 38
Junction Tree Idea: form a new representation of the graph in which variables are clustered together, resulting in a singly-connected graph in the cluster variables. Insight: distribution can be written as product of marginal distributions, divided by a product of the intersection of the marginal distributions. Not a remedy to the intractability. 39
Let s learn by an example. Coherence Coherence Difficulty Intelligence Difficulty Intelligence Grades SAT Grades SAT Happy Letter Job Happy Letter Job Moralization 40
Let s learn by an example. Difficulty Coherence Intelligence Let s pick an ordering for the variable elimination C, D, I, H, G, S, L Grades SAT Happy Letter Job 41
Let s learn by an example. Difficulty Coherence Intelligence Let s pick an ordering for the variable elimination C, D, I, H, G, S, L Grades SAT Happy Letter Job 42
Let s learn by an example. Difficulty Coherence Intelligence Let s pick an ordering for the variable elimination C, D, I, H, G, S, L Grades SAT Happy Letter Job 43
Let s learn by an example. Difficulty Coherence Intelligence Let s pick an ordering for the variable elimination C, D, I, H, G, S, L Grades SAT Happy Letter Job 44
Let s learn by an example. Difficulty Coherence Intelligence Let s pick an ordering for the variable elimination C, D, I, H, G, S, L Grades SAT Happy Letter Job 45
Let s learn by an example. Difficulty Coherence Grades Intelligence SAT Let s pick an ordering for the variable elimination C, D, I, H, G, S, L The rest is obvious Happy Letter Job 46
OK, what we got so far? Coherence Coherence Difficulty Intelligence Difficulty Intelligence Grades SAT Grades SAT Happy Letter Job Happy Letter Job We started with Moralized and Triangulated 47
OK, what we got so far? Def: An undirected graph is triangulated if every loop Coherence Coherence of length 4 or more has a chord. An equivalent term is that the graph is decomposable or chordal. From this Difficulty Intelligence Difficulty Intelligence definition, one may show that an undirected graph is triangulated SAT if and only if its clique graph has a SAT Grades Grades junction tree. Happy Letter Job Happy Letter Job We started with Moralized and Triangulated 48
Let s build the Junction Tree Coherence C,D Difficulty Intelligence D D,I,G Grades SAT G,I G,I,S L,J Happy Letter Job G,S G,J,S,L J,S,L J,S,L L,J G,J The ordering C, D, I, H, G, S, L 49 H,G,J
Pass the messages on the JT Coherence C,D Difficulty Intelligence D D,I,G Grades SAT G,I G,I,S L,J Happy Letter Job G,S G,J,S,L J,S,L J,S,L L,J The ordering C, D, I, H, G, S, L G,J H,G,J can you figure out the directionality? 50
The message passing protocol Message passing protocol Cluster B is allowed to send a message to a neighbor C only after it has received messages from all neighbors except C. COLLECT DISTRIBUTE root 51
Message from Clique to another (The Shafer-Shenoy Algorithm) B B \ C C µ B!C (u) = X B(u [ v) Y µ A!B (u A [ v A ) v2b\c (A,B)2E A6=C 52
Formal Algorithm Moralisation: Marry the parents (only for directed distributions). Triangulation: Ensure that every loop of length 4 or more has a chord. Junction Tree: Form a junction tree from cliques of the triangulated graph, removing any unnecessary links in a loop on the cluster graph. Algorithmically, this can be achieved by finding a tree with maximal spanning weight with weight given by the number of variables in the separator between cliques. Alternatively, given a clique elimination order (with the lowest cliques eliminated first), one may connect each clique to the single neighboring clique. Potential Assignment: Assign potentials to junction tree cliques and set the separator potentials to unity. Message Propagation 53
Some Facts about BP BP is exact on trees. If BP converges it has reached a local minimum of an objective function (the Bethe free energy Yedidia et.al 00, Heskes 02)àoften good approximation If it converges, convergence is fast near the fixed point. Many exciting applications: - error correcting decoding (MacKay, Yedidia, McEliece, Frey) - vision (Freeman, Weiss) - bioinformatics (Weiss) - constraint satisfaction problems (Dechter) - game theory (Kearns) - 54
Summary of the Network Zoo x 1 f(x 1,x 2 ) x 2 x 3 f(x 1,x 3 ) x 4 f(x2,x 4 ) UGM and DGM Use to represent family of probability distributions Clear definition of arrows and circles Factor Graph A way to present factorization for both UGM and DGM It is bipartite graph More like a data structure Not to read the independencies Clique graph or Junction Tree A data structure used for exact inference and message passing Nodes are cluster of variables Not to read the independencies 55