Lecture 4: Probability Propagation in Singly Connected Bayesian Networks

Similar documents
Probability Propagation in Singly Connected Networks

6.867 Machine learning, lecture 23 (Jaakkola)

Bayesian networks: approximate inference

Decisiveness in Loopy Propagation

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Statistical Approaches to Learning and Discovery

Inference in Bayesian Networks

Variational algorithms for marginal MAP

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Bayesian Networks: Belief Propagation in Singly Connected Networks

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

9 Forward-backward algorithm, sum-product on factor graphs

Probabilistic Graphical Models

Figure 1: Bayesian network for problem 1. P (A = t) = 0.3 (1) P (C = t) = 0.6 (2) Table 1: CPTs for problem 1. (c) P (E B) C P (D = t) f 0.9 t 0.

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Chris Bishop s PRML Ch. 8: Graphical Models

Outline. Bayesian Networks: Belief Propagation in Singly Connected Networks. Form of Evidence and Notation. Motivation

Lecture 15, 16: Diagonalization

On the errors introduced by the naive Bayes independence assumption

Machine Learning Lecture 14

Getting Started with Communications Engineering. Rows first, columns second. Remember that. R then C. 1

Discrete Bayesian Networks: The Exact Posterior Marginal Distributions

p L yi z n m x N n xi

1.1.1 Algebraic Operations

Published in: Tenth Tbilisi Symposium on Language, Logic and Computation: Gudauri, Georgia, September 2013

Bayes Networks 6.872/HST.950

Bayesian Machine Learning - Lecture 7

Artificial Intelligence

Figure 1: Bayesian network for problem 1. P (A = t) = 0.3 (1) P (C = t) = 0.6 (2) Table 1: CPTs for problem 1. C P (D = t) f 0.9 t 0.

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Bayes Nets III: Inference

Intelligent Systems (AI-2)

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Lecture 8: Bayesian Networks

Probabilistic Graphical Models

Selection and Adversary Arguments. COMP 215 Lecture 19

Part I Qualitative Probabilistic Networks

Methods of Exact Inference. Cutset Conditioning. Cutset Conditioning Example 1. Cutset Conditioning Example 2. Cutset Conditioning Example 3

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Bayesian networks. Chapter Chapter

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Bayesian Networks: Aspects of Approximate Inference

Math Models of OR: Branch-and-Bound

12 : Variational Inference I

Probabilistic Graphical Models (I)

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian network modeling. 1

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Probabilistic Reasoning Systems

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Belief Update in CLG Bayesian Networks With Lazy Propagation

CS 188: Artificial Intelligence. Bayes Nets

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Lecture 15. Probabilistic Models on Graph

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Semiconductor Optoelectronics Prof. M. R. Shenoy Department of physics Indian Institute of Technology, Delhi

Bayesian Networks: Representation, Variable Elimination

On-Line Geometric Modeling Notes VECTOR SPACES

I&C 6N. Computational Linear Algebra

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Expectation Propagation Algorithm

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Introduction to Artificial Intelligence. Unit # 11

Reading for Lecture 13 Release v10

CSC321 Lecture 18: Learning Probabilistic Models

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 25. Minesweeper. Optimal play in Minesweeper

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

QUIZ 1 SOLUTION. One way of labeling voltages and currents is shown below.

Behavioral Data Mining. Lecture 2

Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Probabilistic and Bayesian Machine Learning

CS 361: Probability & Statistics

CS 343: Artificial Intelligence

Bayesian Machine Learning

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Physics 141 Dynamics 1 Page 1. Dynamics 1

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Machine Learning 4771

CSC2515 Assignment #2

Lecture 4 October 18th

1 Measurements, Tensor Products, and Entanglement

Getting Started with Communications Engineering

STA 4273H: Statistical Machine Learning

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

CMPUT 675: Approximation Algorithms Fall 2014

Vectors. 1 Basic Definitions. Liming Pang

CSC 411 Lecture 6: Linear Regression

Bayesian Networks Representation and Reasoning

Conditional probabilities and graphical models

Directed Graphical Models

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

PMR Learning as Inference

Part 1: You are given the following system of two equations: x + 2y = 16 3x 4y = 2

Transcription:

Lecture 4: Probability Propagation in Singly Connected Bayesian Networks Singly Connected Networks So far we have looked at several cases of probability propagation in networks. We now want to formalise the methods to produce a general purpose algorithm. We will restrict the treatment to singly connected networks, which are those that have at most one (undirected) path between two variables. The reason for this, which will be come obvious in due course, is to ensure that the algorithm will terminate. As a further restriction, we will allow nodes to have at most two parents. This is primarily to keep the algebra simple. It is possible to extend the treatment to cope with any number of parents. Multiple Parents Singly Connected Network Multiply Connected Network Up until now our networks have been trees, but in general they need not be. In particular, we need to cope with the possibility of multiple parents. Multiple parents can be thought of as representing different possible causes of an outcome. For example, the eyes in a picture could be caused by other image features, such as other animals with similarly shaped eyes, or by other objects entirely sharing the same geometric model that we are using. Suppose in a given application we have a data base containing pictures of owls and cats. The intrinsic characteristics of the eyes may be similar, and so there could be two causes for the E variable. Unfortunately, with multiple parents our conditional probabilities become more complex. For the eye node we have to consider the conditional distribution: P (E W &C) which is a conditional probability table with an individual probability value for each possible combination of c i &w j and each e k. The link matrix for multiple parents now takes the form: P (e 1 w 1 &c 1 ) P (e 1 w 1 &c 2 ) P (e 1 w 2 &c 1 ) P (e 1 w 2 &c 2 ) P (E W &C) = P (e 2 w 1 &c 1 ) P (e 2 w 1 &c 2 ) P (e 2 w 2 &c 1 ) P (e 2 w 2 &c 2 ) P (e 3 w 1 &c 1 ) P (e 3 w 1 &c 2 ) P (e 3 w 2 &c 1 ) P (e 3 w 2 &c 2 ) We would, of course not expect any cases to occur when both W and C were true in this particular example. However, we can still treat C and W as being independent events for which the special case w 1 &c 1, implying both an owl and a cat never occurs. Assuming that w 1 &c 1 never occurs in our data we need to set the values: P (e 1 w 1 &c 1 ), P (e 2 w 1 &c 1 ) and P (e 3 w 1 &c 1 ) equal to 1/3, though this is not important as they will never affect any λ or π message. π messages We noted in the last lecture that it is important to ensure that when we pass evidence around a tree we do not count it more than once. When we are calculating the π messages that will be sent to E from C and W, we DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 1

do not use the λ messages that E has sent to C and W. Thus for example if π E (C) is the π message that C sends to E, it only takes into account the prior evidence for C and the λ message sent to C from the node F. We can see that it would be wrong to use the λ message from E. If we did so we would include it twice in calculating the overall evidence for E. This would bias our decision in favour (in this case) of the nodes S and D. Of course, if we want to calculate the probability of an Owl or a Cat, using all the evidence available then we need to use the λ evidence from E to calculate the final probability distribution over W and/or C. In order to make it quite clear we will reiterate following notational conventions introduced in the last lectures. P (C) will be used only for the final probability distribution over node C after all evidence has been propagated. π E (C) is the evidence for node C excluding any λ evidence from E. This is called the π message from C to E. Remembering that the π messages need not be normalised, we can write the scalar equations: P (C) = απ E (C)λ E (C) π E (C) = P (C)/λ E (C) In order to send a π message in the case of a multiple parent we must calculate a joint distribution of the evidence over C&W. To do this we assume that variables C and W are independent so that we can write: π E (W &C) = π E (W ) π E (C) which is a scalar equation with variables C and W. This independence assumption is fine in the case of singly connected networks (those without loops) since there cannot be any other path between C and W through which evidence can be propagated. However, if there was, for example a common ancestor of W and C the assumption would no longer hold. We will discuss this point further later in the course. The joint evidence over C and W can be written as a vector of dimension four. π E (W &C) = [π E (w 1 &c 1 ), π E (w 1 &c 2 ), π E (w 2 &c 1 ), π E (w 2 &c 2 )] = [π E (w 1 )π E (c 1 ), π E (w 1 )π E (c 2 ), π E (w 2 )π E (c 1 ), π E (w 2 )π E (c 2 )] And the π evidence for the node E can be found by multiplying this joint distribution by the link matrix: P (e 1 w 1 &c 1 ) P (e 1 w 1 &c 2 ) P (e 1 w 2 &c 1 ) P (e 1 w 2 &c 2 ) [π(e 1 ) π(e 2 ) π(e 3 )] = P (e 2 w 1 &c 1 ) P (e 2 w 1 &c 2 ) P (e 2 w 2 &c 1 ) P (e 2 w 2 &c 2 ) P (e 3 w 1 &c 1 ) P (e 3 w 1 &c 2 ) P (e 3 w 2 &c 1 ) P (e 3 w 2 &c 2 ) or in vector form: π(e) = P (E W &C)π E (W &C) π E (w 1 &c 1 ) π E (w 1 &c 2 ) π E (w 2 &c 1 ) π E (w 2 &c 2 ) We now can finally compute a probability distribution over the states of E. As before all we do is multiply the λ and π evidence together and normalise. P (e i ) = αλ(e i )π(e i ) α is the normalising constant. DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 2

λ messages Calculating λ evidence to send to multiple parents is harder than the single parent case. Before we introduced the W node the λ message from node E was calculated as: λ E (c 1 ) = λ(e 1 )P (e 1 c 1 ) + λ(e 2 )P (e 2 c 1 ) + λ(e 3 )P (e 3 c 1 ) λ E (c 2 ) = λ(e 1 )P (e 1 c 2 ) + λ(e 2 )P (e 2 c 2 ) + λ(e 3 )P (e 3 c 1 ) we can write this as a scalar equation parameterised by i, the state of C λ E (c i ) = j P (e j c i )λ(e j ) The use of the subscript indicates that λ E is the part of the λ evidence that comes from node E. It is called the λ message frome to C. The full λ evidence for C takes into account node F as well: (λ(c) = λ E (C)λ F (C)). We also found it convenient to write the λ message in vector form: λ E (C) = λ(e)p (E C) However, things are now more complex, since we no longer have a simple link matrix P (E C), and the λ message to C needs to take into account the node W. One way to deal with the problem is to calculate a joint λ message for W &C. We can do this using a similar equation to the vector form given above: λ E (W &C) = λ(e)p (E W &C) If we do this we have a λ message for each joint state of C and W : λ E (W &C) = [λ(w 1 &c 1 ), λ(w 1 &c 2 ), λ(w 2 &c 1 ), λ(w 2 &c 2 )] So we now need to use the evidence for W to separate out the evidence for C from the joint λ message: λ E (c i ) = j π E(w j )λ(w j &c i ) The identical computation can be made estimating the link matrix P (E C), from P (E W &C) given the evidence for W excluding E (π E (W )). This estimate is only valid given the current state of the network. Each element of the matrix can be computed using the scalar equation: or more generally: where k ranges over the states of W. P (e i c j ) = P (e i c j &w 1 )π E (w 1 ) + P (e i c j &w 2 )π E (w 2 ) P (e i c j ) = k P (e i c j &w k )π E (w k ) The λ message can be written directly as a single scalar equation: Instantiation and Dependence λ E (c j ) = k π E(w k ) i P (e i c j &w k )λ(e i ) We have already noted that instantiating a node with children, means that any λ message sent by a child node is ignored. This means that all the children of an instantiated node are independent. Any change of the probability distribution of any of them cannot change the others. For example, when C is instantiated any change in F will not effect E. We call this property conditional independence since the children are independent given that the parent is in a known state (or condition). However, if the parent node is not instantiated then a change to one of its children will change all the other children, since it changes the π message from the parent to each child. Thus, if C is not instantiated a change in F will change the value of E. DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 3

An initially surprising result is that for a child with multiple parents the opposite is true. That is to say if a node with multiple parents is not instantiated then its parents are independent. However, if a node with multiple parents is instantiated, then there is dependence between its parents. This becomes clear when we look at the equation for the λ message to joint parents. λ E (c j ) = k π E(w k ) i P (e i c j &w k )λ(e i ) For the case where E has no evidence we can write λ(e i ) = 1 for all i. Thus the second sum will evaluate to 1 for any j and k, since we are summing the columns of the link matrix. thus the equation reduces to: λ E (c j ) = k π E(w k ) Thus λ E (c j ) will evaluate to the same number for any j hence expressing no evidence. This is not so if we have λ evidence for E. The Operating Equations for Probability Propagation In order to propagate probabilities in Bayesian networks we need just five equations. Operating Equation 1: The λ message The λ message from C to A is given by: λ C (a i ) = m π C (b j ) j=1 n P (c k a i &b j )λ(c k ) k=1 In the case where we have a single parent (A) this reduces to: λ C (a i ) = n P (c k a i )λ(c k ) and for the case of the single parent we can use the simpler matrix form: k=1 λ C (A) = λ(c)p(c A) The matrix form for multiple parents relates to the joint states of the parents. λ C (A&B) = λ(c)p(c A&B) It is necessary to separate the λ evidence for the individual parents with a scalar equation of the form: Operating Equation 2: The π Message λ C (a i ) = Σ j π C (b j )λ C (a i &b j ) If C is a child of A, the π message from A to C is given by: 1 if A is instantiated for a i π C (a i ) = 0 if A is instantiated but not for a i P (a i )/λ C (a i ) if A is not instantiated Operating Equation 3: The λ evidence If C is a node with n children D 1, D 2,..D n, then the λ evidence for C is: 1 if C is instantiated for c k λ(c k ) = 0 if C is instantiated but not for c k i λ D i (c k ) if C is not instantiated DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 4

Operating Equation 4: The π evidence If C is a child of two parents A and B the π evidence for C is given by: π(c k ) = l m P (c k a i &b j )π C (a i )π C (b j ) i=1 j=1 This can be written in matrix form using as follows: π(c) = P(C A&B)π C (A&B) where The single parent matrix equation is: π C (a i &b j ) = π C (a i )π C (b j ) π(c) = P(C A)π C (A) Operating Equation 5: The posterior probability If C is a variable the (posterior) probability of C based on the evidence received is written as: where α is chosen to make k P (c k ) = 1 Message Passing Summary P (c k ) = αλ(c k )π(c k ) New evidence enters a network when a variable is instantiated, ie when it receives a new value from the outside world. When this happens the posterior probabilities of each node in the whole network must be re-calculated. This is achieved by message passing. When the λ or π evidence for a node changes it must inform some of its parents and its children as follows: For each parent to be updated Update the parents λ message array Set a flag to indicate that the parent must be re-calculated For each child to be updated Update the childs π message array Set a flag to indicate that the child must be re-calculated The network reaches a steady state when there are no more nodes to be re-calculated. This condition will always be reached in singly connected networks. Multiply connected networks will not necessarily reach a steady state. We are now in a position to set out the algorithm. There are four steps that are required. 1. Initialisation The network is initialised to a state where no nodes are instantiated. That is to say all we have in the way of information is the prior probabilities of the root nodes, and the conditional probability link matrices. 1 All λ messages and λ values are set to 1 (there is no λ evidence) 2 All π messages are set to 1 3 For all root nodes the π values are set to the prior probabilities, eg, for all states of R : π(r i ) = P (r i ) 4 Post and propagate the π messages from the root nodes using downward propagation DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 5

2. Upward Propagation This is the way evidence is passed up the tree. if a node C receives λ message from one child if C is not instantiated { 1 Compute the new λ(c) value (Operating Equation 3) 2 Compute the new posterior probability P (C) (Operating Equation 5) 3 Post a λ message to all C s parents 4 Post a π message to C s other children } Notice that if a node is instantiated, then we know its value exactly, and the evidence from its children has no effect. (a) Upward Propagation (b) Downward propagation (c) Instantiation Figure 1: Message Passing 3. Downward Propagation if a variable C receives a π message from one parent: if C is not instantiated { 1 Compute a new value for array π(c) (Operating Equation 4) 2 Compute a new value for P (C) (Operating Equation 5) 3 Post a π message to each child } if (there is λ evidence in C) Post a λ message to the other parents Instantiation If variable C is instantiated for state c k 1 for all j k set P (c j ) = 0 2 set P (c k ) = 1 3 Compute λ(c) (Operating Equation 3) 4 Post a λ message to each parent of C 5 Post a π message to each child of C Blocked Paths If a node is instantiated it will block some message passing between other nodes. For example, there is clearly no sense in sending a λ or π message to an instantiated node, as the evidence will be ignored since we know the state of the node. This is the case for diverging paths. DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 6

However, converging paths (multiple parents) behave differently. Converging paths are blocked when there is no λ evidence on the child node, but unblocked when there is λ evidence, or the node is instantiated. This becomes clear when we look at operating equation 1. λ C (a i ) = m π C (b j ) j=1 n P (c k a i &b j )λ(c k ) k=1 Given that node C has no λ evidence we can write λ(c) = {1, 1, 1,. 1} and substituting this into operating equation 1 we get: m n λ C (a i ) = π C (b j ) P (c k a i &b j ) j=1 The second summation now evalutes to 1 for any value of i and j, since we are summing a probability distribution. Thus: m λ C (a i ) = π C (b j ) j=1 The sum is independent of i and hence the value of λ c (a i ) is the same for all values of i, that is to say there is no λ evidence sent to A. Finally, we give an example of a sequence of message passing in our owl and pussy cat example. k=1 1. Initialisation The only evidence initially is the prior probabilities of the the roots. 1. Initialisation The priors become π messages and propagate downwards. DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 7

2. Initialisation (cont.) E has no λ evidence so it just sends a π message to its children. Then the network reaches a steady state. 3. Instantiation A new measurement is made for node S. We instantiate it and λ message propagate upwards. 4. Propagation 5. Propagation (cont.) E receives the λ message from S and recomputes its λ evidence and it s π message to D. It sends messages everywhere except to S. W and C are updated. Their posterior probabilities are calculated. The π message from C to F has changed, so that will be updated too. The π message is sent to F, it updates it s posterior probability and the network reaches a steady state. 6. More Evidence C is now instantiated and sends π messages to its children. 7. Propagation E has λ evidence so it will send a λ message to W. The π message it sends to D will require D to update its posterior probability, but the π message sent to S will be blocked as S has been instantiated, and will not change. DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 4 8