Chapter 04: Exact Inference in Bayesian Networks

Size: px

Start display at page:

Download "Chapter 04: Exact Inference in Bayesian Networks"

Domenic Harvey
6 years ago
Views:

1 LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 04: Exact Inference in Bayesian Networks Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 04 p. /23

2 References for this chapter Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 8, Springer, 2006 Stuart Russell and Peter Norvig, Artificial Intelligenece: A Modern Approach, ch. 4, Prentice Hall, 2003 Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 989 Steffen L. Lauritzen and David J. Spiegelhalter, Local computations with probabilities on graphical structures and their applications to expert systems, In: The Journal of the Royal Statistical Society, vol. 50, no. 2, pp , 988 Brendan J. Frey and David J. C. MacKay, A Revolution: Belief Propagation in Graphs with Cycles, In: Advances in Neural Information Processing Systems (NIPS), vol. 0, pp , 997, http: //books.nips.cc/papers/files/nips0/0479.pdf Learning and Inference in Graphical Models. Chapter 04 p. 2/23

3 Inference Given a graphical model with unobserved nodes U and observed nodeso, we want to conclude about the distribution of unobserved nodes calculate the single node marginal distribution p(x O) withx U calculate the maximum-a-posterior estimator arg max u p(u = u O = o) Exact inference is not always possible. We focus on the easier cases, polytrees example of a polytree a polytree is a directed acyclic graph whose underlying undirected graph is acyclic. counterexample of a polytree Learning and Inference in Graphical Models. Chapter 04 p. 3/23

4 Marginalization Calculatep(X = x O = o,o 2 = o 2 ) for the network on the right. p (X = x O = o,o 2 = o 2 ) with = p(x = x,o = o,o 2 = o 2 ) p(o = o,o 2 = o 2 ) p(x = x,o = o,o 2 = o 2 ) = p(x = x,o = o,o 2 = o 2 )dx p (X = x,o = o,o 2 = o 2 ) = p(x=x,o =o,o 2 =o 2,U =u,..., U U 2 U 3 X U 4 U 5 O continue on blackboard U 7 =u 7 ) du du 2 du 3 du 4 du 5 du 6 du 7 U 6 O 2 U 7 Learning and Inference in Graphical Models. Chapter 04 p. 4/23

5 Factor graph A factor graph is a bipartite graph with two kind of nodes: variable nodes that model random variables like in a Bayesian network factor nodes that model a probabilistic relationship between variable nodes. Each factor node is assigned with a factor, i.e. a function that models the stochastic relationship Variable nodes and factor nodes are connected by undirected links. For each Bayesian polytree we can create a factor graph as follows: the set of variable nodes is taken from the nodes of the Bayesian polytree for each factorp(x Pred(X) in the Bayesian network we create a new factor nodef we connectx andpred(x) withf we assignf(x,y,...,y n ) p(x = x Pred(X) = (y,...,y n )) Hence, the joint probability of the Bayesian polytree is equal to the product of all factors of the factor tree. Learning and Inference in Graphical Models. Chapter 04 p. 5/23

6 Factor graphs Example: factor graph for the Bayesian network on the right f (u )=p(u = u ) f 2 (u,u 2 )=p(u 2 = u 2 U = u ) f 3 (u 3 )=p(u 3 = u 3 ) f 4 (u 2,u 3,x)=p(X = x U 2 = u 2,U 3 = u 3 ) f 5 (u 4 )=p(u 4 = u 4 ) f 6 (x,u 4,u 5 )=p(u 5 = u 5 X = x,u 4 = u 4 ) f 7 (u 4,o )=p(o = o U 4 = u 4 ) f 8 (u 5,u 6 )=p(u 6 = u 6 U 5 = u 5 ) f 9 (u 5,o 2 )=p(o 2 = o 2 U 5 = u 5 ) f 0 (o,u 7 )=p(u 7 = u 7 O = o ) f U f 2 f 3 U 2 U 3 f 4 f 5 X U 4 f 6 f 7 U 5 O f 8 f 9 f 0 U 6 O 2 U 7 Learning and Inference in Graphical Models. Chapter 04 p. 6/23

7 Marginalization on factor graphs Task: calculate I(x) = f (u )f 2 (u,u 2 )f 3 (u 3 ) f 4 (u 2,u 3,x)f 5 (u 4 )f 6 (x,u 4,u 5 ) f 7 (u 4,o )f 8 (u 5,u 6 )f 9 (u 5,o 2 )f 0 (o,u 7 ) du du 2 du 3 du 4 du 5 du 6 du 7 Observations: factor graph is tree with rootx I(x) can be split into two large factors m f4 X(x)= f (u )f 2 (u,u 2 )f 3 (u 3 ) f 4 (u 2,u 3,x)du du 2 du 3 m f6 X(x)= f 5 (u 4 )f 6 (x,u 4,u 5 ) f U f 2 f 3 U 2 U 3 f 4 f 5 X U 4 f 6 f 7 U 5 O f 8 f 9 f 0 U 6 O 2 U 7 f 7 (u 4,o )f 8 (u 5,u 6 )f 9 (u 5,o 2 )f 0 (o,u 7 )du 4 du 5 du 6 du 7 Learning and Inference in Graphical Models. Chapter 04 p. 7/23

8 Marginalization on factor graphs Inm f4 x we can factor outf 3 andf 4 ( m f4 X(x)= ) f (u )f 2 (u,u 2 )du } {{ } =:m f2 U 2 (u 2 ) and rewritem f2 U 2 (u 2 ) as m f2 U 2 (u 2 )= f (u ) f }{{} 2 (u,u 2 )du =:m f U (u ) Observations: calculation can be split along the branches of the tree the lead nodes can serve as starting points for the calculation f 3 (u 3 ) f }{{} 4 (u 2,u 3,x)du 2 du 3 =:m f3 U 3 (u 3 ) f U f 2 f 3 U 2 U 3 f 4 only multiplication and integration/summation occur intermediate results can be interpreted as messages sent from one node to its neighbors X Learning and Inference in Graphical Models. Chapter 04 p. 8/23

9 Marginalization on factor graphs m f6 X(x) can be split in a similar manner m f6 X(x)= f 6 (x,u 4,u 5 )m U4 f 6 (u 4 )m U5 f 6 (u 5 )du 4 du 5 m U4 f 6 (u 4 )=m f5 U 4 (u 4 ) m f7 U 4 (u 4 ) m U5 f 6 (u 4 )=m f8 U 5 (u 5 ) m f9 U 5 (u 5 ) m f8 U 5 (u 5 )= f 8 (u 5,u 6 )m U6 f 8 (u 6 )du 6 m U6 f 8 (u 6 )= m f9 U 5 (u 5 )=f 9 (u 5,o 2 ) with observedo 2 If we want to extend the procedure to observed nodes, we could also argue m f9 U 5 (u 5 )= f 9 (u 5,o 2)m O2 f 9 (o 2)do 2 f 5 X U 4 f 6 f 7 U 5 O f 8 f 9 f 0 U 6 O 2 U 7 m O2 f 9 (o 2) = δ(o 2 o 2 ) withδ is the Dirac distribution Learning and Inference in Graphical Models. Chapter 04 p. 9/23

10 Side topic: Dirac distribution The Diracδ is a distribution that can be used to model discrete distributions in continuous space: δ(x) = so that Examples { 0 ifx 0 ifx = 0 δ(x)dx = ifx is distributed w.r.t. the Dirac distribution,x can only take the value0 ify is distributed w.r.t. 0.3 δ(y 2)+0.7 δ(y +5.),Y will take the value2with probability of0.3 and 5. with probability of0.7 Learning and Inference in Graphical Models. Chapter 04 p. 0/23

11 Belief propagation Example motivates a generic algorithm known as sum-product algorithm or belief propagation (Pearl, 989; Lauritzen and Spiegelhalter, 988) factor nodesf generate messages and send them to variable nodesv: m f V (v) variable nodesv generate messages and send them to factor nodesf : m V f (v) messages are like distributions (but not necessarily normalized) a message from a nodento a neighboring noden can be generated as soon asnreceived messages from all its neighbors exceptn hence, the method can start at the leaf nodes and follow the branches of the tree until the node of interest is met (dynamic programing principle) Learning and Inference in Graphical Models. Chapter 04 p. /23

12 Belief propagation How are the messages created? messages from unobserved variable nodes to factor nodes: m X f (x) = m fi X(x) i iff,f,...,f n are the neighbors ofx messages from observed variable nodes to factor nodes: m X f (x) = δ(x x ) m fi X(x) i iff,f,...,f n are the neighbors ofx andx is the observed value atx messages from factor nodes to variable nodes: m f X (x) = ifx,y,...,y n are the neighbors off f(x,y,...,y n ) m Yi f(y i )dy...dy n i Learning and Inference in Graphical Models. Chapter 04 p. 2/23

13 Belief propagation Example ( exam problem ) c P(c=c) m + P(M=m) 3 4 P (E=e C=c,M=m) = ife=c+m ife=c ife=c m 0 otherwise level of comprehension mental mood result of exam apply belief propagation to calculate P(C E) for E = 2 blackboard Learning and Inference in Graphical Models. Chapter 04 p. 3/23

14 Belief propagation Belief propagation works if either or all distributions are categorical (e.g. example) all distributions are conjugate all distributions are Gaussian and variables depend linearly, i.e. X Y,...,Y n N(c Y + +c n Y n,σ 2 ) with fixed values c,...,c n,σ 2 Otherwise, integrals might become untreatable analytically Gauss-linear example: exam problem with C N(3,4) M N(0,) E C,M N(C +M,) blackboard/homework Learning and Inference in Graphical Models. Chapter 04 p. 4/23

15 MAP estimator Second task, calculate the maximum-a-posterior estimator arg maxp(u = u O = o) u Again, we focus on polytrees only. Calculate the MAP estimator for the model on the right. arg max u = argmax u = argmax u = argmax u p(u = u O = o) p(u = u,o = o) p(o = o) p(u = u,o = o) logp(u = u,o = o) U U 2 U 3 U 8 U 4 U 5 O U 6 O 2 U 7 Learning and Inference in Graphical Models. Chapter 04 p. 5/23

16 MAP estimator logp(u = u,o = o) = log ( f(u i,pred(u i )) i f(o j,pred(o j )) ) = i j logf(u i,pred(u i ))+ logf(o j,pred(o j )) j choose one node as root node (e.g. U 8 ): max logp(u = u,o = o) u,...,u 8 ( = max mf4 U 8 (u 8 )+m f6 U 8 (u 8 ) ) u 8 wherem f4 U 8 contains all terms reated tof,...,f 4 andm f6 U 8 contains all terms related tof 5,...,f 0 f U f 2 f 3 U 2 U 3 f 4 f 5 U 8 U 4 f 6 f 7 U 6 U 5 O f 8 f 9 f 0 O 2 U 7 Learning and Inference in Graphical Models. Chapter 04 p. 6/23

17 MAP estimator m f6 U 8 (u 8 )=max u 4,u 5 ( logf6 (u 4,u 5,u 8 )+m U4 f 6 (u 4 )+m U5 f 6 (u 5 ) ) m U5 f 6 (u 5 )=m f8 U 5 (u 5 )+m f9 U 5 (u 5 ) ( m f8 U 5 (u 5 )=max logf8 (u 5,u 6 )+m U6 f 8 (u 6 ) ) u 6 m U6 f 8 (u 6 )=0 m f9 U 5 (u 5 )=logf 9 (u 5,o 2 ) ( =max logf9 (u 5,o 2)+m O2 o f 9 (o 2) ) 2 { m O2 f 9 (o 0 ifo 2 = o 2 2)= otherwise f 5 U 8 U 4 f 6 f 7 U 5 O f 8 f 9 f 0 U 6 O 2 U 7 Learning and Inference in Graphical Models. Chapter 04 p. 7/23

18 MAP estimator m U4 f 6 (u 4 )=m f5 U 4 (u 4 )+m f7 U 4 (u 4 ) m f5 U 4 (u 4 )=logf 4 (u 4 ) m f7 U 4 (u 4 )=logf 7 (u 4,o )+something ( =max logf7 (u 4,o )+m O o f 7 (o ) ) m O f 7 (o )=m f0 O (o )+ { 0 ifo = o otherwise m f0 O (o )=max u 7 ( logf0 (o,u 7 )+m U7 f 0 (u 7 ) ) m U7 f 0 (u 7 )=0 f 5 U 8 U 4 f 6 f 7 U 5 O f 8 f 9 f 0 U 6 O 2 U 7 Learning and Inference in Graphical Models. Chapter 04 p. 8/23

19 Max-sum algorithm Example motivates a generic algorithm to calculatemax u logp(u = u,o = o) known as max-sum algorithm factor nodesf generate messages and send them to variable nodesv: m f V (v) variable nodesv generate messages and send them to factor nodesf : m V f (v) messages are like functions depending on one variable a message from a nodento a neighboring noden can be generated as soon asnreceived messages from all its neighbors exceptn hence, the method can start at the leaf nodes and follow the branches of the tree until the node of interest is met (dynamic programing principle) Learning and Inference in Graphical Models. Chapter 04 p. 9/23

20 Max-sum algorithm How are the messages created? messages from unobserved variable nodes to factor nodes: m X f (x) = m fi X(x)+0 i iff,f,...,f n are the neighbors ofx messages from observed variable nodes to factor nodes: m X f (x) = { 0 ifx = x m fi X(x)+ otherwise i iff,f,...,f n are the neighbors ofx andx is the observed value atx messages from factor nodes to variable nodes: ( m f X (x) = max logf(x,y,...,y n )+ y,...,y n i ifx,y,...,y n are the neighbors off m Yi f(y i ) ) Learning and Inference in Graphical Models. Chapter 04 p. 20/23

21 Max-sum algorithm How do we calculateargmax u logp(u = u,o = o) with the max-sum algorithm? basic idea: backtracking in each maximization step we remind the maximizing value of each variable after having calculated the maximum over the whole tree, we backtrack through all branches following the reminded values Example: f f 2 U U 2 U 3 with binary variablesu,u 2,U 3 blackboard f (u,u 2 ) u 2 = 0 u 2 = u = 0 4 u = f 2 (u 2,u 3 ) u 3 = 0 u 3 = u 2 = 0 u 2 = Learning and Inference in Graphical Models. Chapter 04 p. 2/23

22 Non-polytrees Can we apply max-sum or sum-product also in non-polytree structures? no loopy belief propagation (Frey and MacKay, 998) EM/ECM algorithm variational methods Monte Carlo methods Learning and Inference in Graphical Models. Chapter 04 p. 22/23

23 Summary sum-product algorithm (belief propagation) max-sum algorithm Learning and Inference in Graphical Models. Chapter 04 p. 23/23

Chapter 03: Bayesian Networks

LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 03: Bayesian Networks Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control