Probabilistic Graphical Models: Representation and Inference

Similar documents
Bayesian Networks: Independencies and Inference

Expectation Maximization

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Directed Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Directed Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

p L yi z n m x N n xi

Machine Learning Lecture 14

Graphical Models and Kernel Methods

Probabilistic Graphical Models (I)

Probabilistic Graphical Models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian Networks (Part II)

{ p if x = 1 1 p if x = 0

Graphical Models - Part I

CS 188: Artificial Intelligence. Bayes Nets

Based on slides by Richard Zemel

CSCE 478/878 Lecture 6: Bayesian Learning

Introduction to Bayesian Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

13: Variational inference II

Directed and Undirected Graphical Models

Bayesian Machine Learning

Bayes Networks 6.872/HST.950

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Junction Tree, BP and Variational Methods

Statistical Approaches to Learning and Discovery

Bayesian Networks. Motivation

Conditional Independence

Bayesian Machine Learning - Lecture 7

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

PROBABILISTIC REASONING SYSTEMS

Stephen Scott.

Introduction to Artificial Intelligence. Unit # 11

Latent Dirichlet Allocation

4 : Exact Inference: Variable Elimination

Directed and Undirected Graphical Models

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

STA 4273H: Statistical Machine Learning

Statistical Learning

Notes on Machine Learning for and

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Lecture 15. Probabilistic Models on Graph

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Chapter 16. Structured Probabilistic Models for Deep Learning

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Lecture 6: Graphical Models

Learning Bayesian Networks (part 1) Goals for the lecture

Rapid Introduction to Machine Learning/ Deep Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Lecture 8: Bayesian Networks

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

An Introduction to Bayesian Machine Learning

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

CS 343: Artificial Intelligence

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

CPSC 540: Machine Learning

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Graphical Models 359

Inference in Bayesian Networks

Directed Graphical Models (aka. Bayesian Networks)

Review: Bayesian learning and inference

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 4 October 18th

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Bayes Nets: Independence

Bayes Nets III: Inference

STA 4273H: Statistical Machine Learning

Graphical Models - Part II

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Review: Directed Models (Bayes Nets)

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Artificial Intelligence Bayes Nets: Independence

Informatics 2D Reasoning and Agents Semester 2,

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Artificial Intelligence Bayesian Networks

Introduction to Bayesian Networks

Bayesian networks. Chapter Chapter

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Probabilistic Classification

Undirected Graphical Models

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Probabilistic Graphical Models

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Techniques for Computer Vision

STA 4273H: Statistical Machine Learning

Transcription:

Probabilistic Graphical Models: Representation and Inference Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Andrew Moore 1

Overview Graphs and probabilities - Conditional independence - Directed graphs - Examples Inference in a Directed Graphical Model - Belief propagation Learning - Maximum likelihood I: fully observable case - Maximum likelihood II: Expectation Maximum Approximate Inference - Variational Methods - Monte Carlo Methods 2

Probabilistic Graphical Models Graphs endowed with a probability distribution - nodes represent random variables and the edges encode conditional independence assumptions Graphical model express sets of conditional independence via graph structure (and conditional independence is useful) Graph structure plus associated parameters define joint probability distribution of the set of nodes/variables Probability theory Graph theory Probabilistic graphical theory 3

Graphical Models Graphical models come in two main flavors: 1. Directed graphical models (a.k.a Bayes Net, Belief Networks): - consists of a set of nodes with arrows (directed edges) between some of the nodes - arrows encode factorized conditional probability distributions 2. Undirected graphical models (a.k.a Markov random fields): - consists of a set of nodes with undirected edges between some of the nodes - edges (or more accurately the lack of edges) encode conditional independence. Today, we will focus exclusive on directed graphs. 4

Probability Review: Marginal Independence Definition: X is marginally independent of Y if for all (i, j) P (X = x i, Y = y j ) = P (X = x i )P (Y = y j ) P (X, Y ) = P (X)P (Y ) This is equivalent to the expression: P (X Y ) = P (X) P (Y X) = P (Y ) Why? Recall from the probability product rule P (X, Y ) = P (X Y )P (Y ) = P (Y X)P (X) = P (X)P (Y ) 5

Probability Review: Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y, given the value of Z: for all (i, j, k) P (X = x i, Y = y j Z = z k ) = P (X = x i Z = z k )P (Y = y j Z = z k ) P (X, Y Z) = P (X Z)P (Y Z) Or equivalently (by the product rule): P (X Y, Z) = P (X Z) P (Y X, Z) = P (Y Z) Why? Recall from the probability product rule P (X, Y, Z) = P (X Y, Z)P (Y Z)P (Z) = P (X Z)P (Y Z)P (Z) Example: P (Thunder Rain, Lightning) = P (Thunder Lightning) 6

Directed Graphical Models Set of nodes with arrows (directed edges) between some of the nodes - Nodes represent random variables - Arrows encode factorized conditional probability distributions Consider an arbitrary joint distribution: P (A, B, C) A B By application of the product rule: P (A, B, C) = P (A)P (B, C A) = P (A)P (B A)P (C A, B) What if C is conditionally independent from B: A C B P (A, B, C) = P (A)P (B A)P (C A) C E.g., P (Rain, Lightning StormCloud) = P (Rain StormCloud)P (Lightning StormCloud) StormCloud Lightning Rain 7

The General Case Consider the arbitrary joint distribution: P (X 1,..., X d ) By successive application of the product rule P (X 1,..., X d ) = P (X 1 )P (X 2 X 1 ) P (X d X 1,..., X d 1 ) Can be represented by a graph in which each node has links from all lowernumbered nodes (i.e. a fully connected graph) X 1 X 2 X 7 X 3 X 6 X 4 X 5 Directed acyclic graph: NO DIRECTED CYCLES can number nodes so that there are no links from higher numbered to lower numbered nodes. 8

Directed Acyclic Graphs (DAGs) General factorization: X 1 X 2 P (X 1,..., X d ) = d P (X i Pa i ) i=1 where Pai denotes the immediate parents of node i, - i.e. all the nodes with arrows leading to node i. X 7 X 6 X 5 X 3 X 4 Node X i is conditionally independent of its non-descendents given its immediate parents (missing links imply conditional independence). Some terminology: - Parents of node i = nodes with arrows leading to node i. - Antecedents = parents, parents of parents, etc. - Children of node i = nodes with arrows leading from node i. - Descendents = children, children of children, etc. X 1 X 2 X 3 X 4 X 5 X 6 X 7 9

Importance of Ordering Since every joint probability distribution can be decomposed as d P (X 1,..., X d ) = P (X i Pa i ) then every joint probability can be represented as a graphical model - with arbitrary ordering over the variables - i.e. any decomposition will generate a valid graph However not all decompositions are created equal. An illustration: i=1 Battery Fuel Start Fuel gauge Fuel gauge Engine turns over Engine turns over Start Battery Fuel 10

DAG Example I: Bayes Net A conditional probability distribution (CPD) is associated with each node. - Defines P(X i Pa i ). - Specifies how the parents interact. Storm Clouds CPD for node: Lightning Pa P(L SC) P(~L SC) SC 0.7 0.3 ~SC 0 1.0 Lightning Thunder Rain Wind Surf CPD for node: Wind Surf Pa P(WS Pa) P(~WS Pa) L, R 0 1.0 L, ~R 0 1.0 ~L, R 0.2 0.8 ~L, ~R 0.9 0.1 d P (SC, L, R, T, WS) = P (node i Pa i ) i=1 = P (SC )P (L SC )P (R SC )P (T L)P (WS L, R) 11

Example I: Bayes Net (cont.) Let s count the parameters - In the full joint distribution: 2 5-1 = 31 - Taking advantage of conditional independences: 11 Storm Clouds CPD for node: Lightning Pa P(L SC) P(~L SC) SC 0.7 0.3 ~SC 0 1.0 Lightning Thunder Rain Wind Surf CPD for node: Wind Surf Pa P(WS Pa) P(~WS Pa) L, R 0 1.0 L, ~R 0 1.0 ~L, R 0.2 0.8 ~L, ~R 0.9 0.1 P (SC, L, R, T, WS) = P (SC )P (L SC )P (R SC )P (T L)P (WS L, R) 12

Example: Univariate Gaussian Consider a univariate Gaussian random variable as a (simple) graphical model. p(x = x) = N (x µ, σ 2 ) = 1 ) ( σ 2π exp (x µ)2 2σ 2 Graphical model: X 13

roduce parameters nditional distributions clique potentials E-step: evaluate the posterior distribution using current estimate for the parameters M-step: re-estimate by maximizing the expected complete-data log likelihood which govern the in a directed graph, in an undirected graph Example II: Mixture of Gaussians aximum likelihood: determine by Marginal distribution Now let s consider a random variable distributed according to a mixture of Gaussians. Conditional distributions: P (I = i) = wi oblem: the summation over H inside the logarithm may intractable p(x = x I = i) her M. Bishop = N (x µi, σi2 ) = NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop Note that the log and the summation have been exchanged this will often make the summation tractable Iterate E and M steps until convergence "! (x µi )2 1 exp 2σi2 σi 2π Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop where I is an index over the Gaussian components in the mixture and the Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop mixing proportion, wi, is the marginal probability that X is generated by mixture component i. NAT NATO ASI: Learning Th Marginal distributions:!! = x) = Example: p(x = x I =ofi)p (I i) =(cont d) w σi2 ) EM: Variational Example: Mixtures ofp(x Gaussians EM=Example: Mixtures ofi, Gaussians (cont d) View i N (x µ EM Mixtures Gaussians her M. Bishop i i For an arbitrary distribut Graphical model: where I X Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop Kullback-Leibler diverge with equality if and only Hence is a rigo marginal likelihood Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 NAT 14

roduce parameters nditional distributions clique potentials E-step: evaluate the posterior distribution using current estimate for the parameters M-step: re-estimate by maximizing the expected complete-data log likelihood which govern the in a directed graph, in an undirected graph Example II: Mixture of Gaussians aximum likelihood: determine by Marginal distribution Now let s consider a random variable distributed according to a mixture of Gaussians. Conditional distributions: P (I = i) = wi oblem: the summation over H inside the logarithm may intractable p(x = x I = i) her M. Bishop = N (x µi, σi2 ) = NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop Note that the log and the summation have been exchanged this will often make the summation tractable Iterate E and M steps until convergence "! (x µi )2 1 exp 2σi2 σi 2π Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop where I is an index over the Gaussian components in the mixture and the Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop mixing proportion, wi, is the marginal probability that X is generated by mixture component i. NAT NATO ASI: Learning Th Marginal distributions:!! = x) = Example: p(x = x I =ofi)p (I i) =(cont d) w σi2 ) EM: Variational Example: Mixtures ofp(x Gaussians EM=Example: Mixtures ofi, Gaussians (cont d) View i N (x µ EM Mixtures Gaussians her M. Bishop i i For an arbitrary distribut Graphical model: where Kullback-Leibler diverge with equality if and only Hence is a rigo marginal likelihood Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 NATO ASI: Learning Theory and Practice, Leuven, July 2002 Christopher M. Bishop Christopher M. Bishop NATO ASI: Learning Theory and Practice, Leuven, July 2002 NAT 14

Example III: Naïve Bayes Consider the classification problem: X Y Training data: For a new example X new, We want to determine P(Y X new ) so that we can classify Y new as argmaxj P(Y = j X new ). One way to do this is to use Bayes Rule: P (Y X) = P (X Y )P (Y ) P (X) 15

Example III: Naïve Bayes Using Bayes rule, to compute P(Y X) we need to represent P(X Y) and P(Y). - For even marginally large X, the full joint P(X1,X2,...,Xn Y) is impractical - We would require a parameter for each unique combination of X1,..., Xn. Naïve Bayes assumption: P (X Y ) = P (X 1,..., X n Y ) = i P (X i Y ) Xi and Xj (i j) are conditionally independent given Y. Y X1... Xn 16

Conditioning on Evidence Much of the point of Bayes Nets (and Graphical models, more generally) is to use data or observed variables to infer the values of hidden (or latent) variables. We can condition on the observed variables and compute conditional distributions over the latent variables. Battery? Fuel? Engine turns over Fuel gauge empty Start No Note: hidden variables may have a specific interpretations (such as Battery and Fuel), or may be introduced to permit a richer class of distributions (as in neural network hidden units). 17

Markov Properties We have said that the conditional independence properties are represented directly from the graph. The pattern of conditional independence turn out to be extremely important for determining how to calculate the conditional distributions over variables of interest. We will now consider three simple independence structures that form the basic building blocks of all independence relationships representable by directed graphical models. A B C A B C A B C 18

Markov Properties I Consider the joint distribution, P(A,B,C) specified by the graph: A B C Note the missing edge from A to C. Node B is head-to-tail with respect to the path A-B-C. Joint distribution: P (A, B, C) = P (A)P (B A)P (C A, B) = P (A)P (B A)P (C B) 19

Markov Properties I (cont.) Suppose we condition on the node B: A B C A and C are conditionally independent, given B: P (A, C B) = P (A, B, C) P (B) = P (A)P (B A)P (C B) P (B) = P (A)P (B A) P (C B) P (B) = P (A B)P (C B) Note that if B is not observed: A C B [By definition of joint] [By Bayes rule] A C P (A, C) = B P (A, C B)P (B) = B P (A B)P (C B)P (B) P (A)P (C) Informally: observing B blocks the path from A to C. 20

Markov Properties II Consider the graph: B A C Again, note missing edge from A to C. Node B is tail-to-tail with respect to the path A-B-C. Joint distribution: P (A, B, C) = P (B)P (A B)P (C A, B) = P (B)P (A B)P (C B) 21

Markov Properties II (cont.) Again, let s condition on node B: B A C A and C are conditionally independent, given B: A C B P (A, C B) = P (A B)P (C B) Again, if B is not observed: A C P (A, C) = B P (A, C B)P (B) = B P (A B)P (C B)P (B) P (A)P (C) Informally: observing B blocks the path from A to C. 22

Markov Properties III Consider the graph: A C Once again, note missing edge from A to C. Node B is head-to-head with respect to the path A-B-C. B Joint distribution: P (A, B, C) = P (A)P (C)P (B A, C) If B is not observed, we have: A C P (A, C) = B P (A)P (C)P (B A, C) = P (A)P (C) B P (B A, C) = P (A)P (C) 23

Markov Properties III (cont.) The Famous: Explaining Away Effect Suppose, once again, we condition on node B: A C A and C are not conditionally independent, given B: B P (A, C B) = P (A B, C)P (C B) = P (C B, A)P (A B) P (A B)P (C B) A C B Informally, an unobserved head-to-head node B blocks the path from A to C, but once B is observed the path is unblocked. Important Note: observation of any descendent of B also unblocks the path. 24

Explaining Away: An Illustration Consider the Burglar Alarm problem: Burglar Earthquake Alarm Phone Call You house has a twitchy burglar alarm that is sometimes triggered by earthquakes. Earth doesn t care whether your house is currently being burgled While on vacation, one of your neighbours call and tells you that your home s burglar alarm is ringing! 25

The plot thickens... Burglar Earthquake Alarm Phone Call But now suppose you learn that there was a small earthquake in your neighbourhood. Whew! Probably not a burglar after all. The Earthquake explains away the hypothetical burglar. So Burglar must not be conditionally independent of Earthquake given Alarm (or Phone Call). 26

d-separation In a general DAG, consider three groups of nodes A, B, C. To determine whether A is conditional independent of C given B, consider all possible paths from any node in A to any node in C. Any such path is blocked if there is a node X which is head-to-tail or tail-to-tail with respect to the path and X B or if the node is head-tohead and neither the node, nor any of its descendents, is in B. Note that a particular node may, for example, be head-to-head with respect to one particular path and head-to-tail with respect to a different path Theorem [Verma & Pearl, 1998]: If, given B, all possible paths from A to C are blocked (we say that B d-separates A and C), then A is conditionally independent of C, given B. Note: Variables may actually be independent even when not d-separated. 27

d-separation Example A B Conditional Independence C E G D F H C D? C D A? C D A,B? C D A,B,J? C D A,B,E,J? I J 28

Inference in a DAG Inference: calculate P(X Y) for some variable or sets of variables of interest X and observed variable or set of variables Y. If Z are the set of remaining variables in the graph: P (X Y ) = Z P (X Z, Y )P (Z Y ) The bad news: exact inference in Bayes Nets (DAGs) is NP-hard! But exact inference is still tractable in some cases. Let s look at a special class of networks: trees / forests - Each node has at most one parent. 29

Example: Markov Chain Consider the graph: (a Markov Chain)... X1 X2 XL-1 XL P (X 1,..., X L ) = P (X 1 )P (X 2 X 1 )... P (X L X L 1 ) What if we want to find P(X1 XL)? Direct evaluation gives: P (X 1 X L ) = X 2 X L 1 P (X 1... X L ) X 1 X 2 X L 1 P (X 1... X L ) For variables having M states, the denominator involves summing over M L-1 terms (exponential in the length of the chain). 30

Example: Markov Chain (cont.) Using the conditional independence structure we can reorder the summations in the denominator to give X 1 P (X 1... X L ) = X 2 X L 1 P (X L X L 1 ) P (X 3 X 2 ) P (X 2 X 1 )P (X 1 ) X L 1 X 2 X 1 which involves on the order of LM 2 summations (linear in the length of the chain) - similarly for the numerator This is known as variable elimination, also can be interpreted as a sequence of messages being passed. 31

Belief Propagation: Decomposing Probabilities Suppose we want to compute P(X i E) where E is some set of evidence variables. - Evidence variables are those that are observable, offering evidence with respect to the values of the latent variables. Let s split E into two parts: i. E i is the part consisting of assignments to variables in the subtree rooted at X i. ii. E i + is the rest of it. Ei+ Xi Ei 32

Belief Propagation: Decomposing Probabilities Suppose we want to compute P(X i E) where E is some set of evidence variables. - Evidence variables are those that are observable, offering evidence with respect to the values of the latent variables. Let s split E into two parts: i. E i is the part consisting of assignments to variables in the subtree rooted at X i. ii. E i + is the rest of it. P (X i E) = P (X i E i, E+ i ) Ei+ Xi Ei 32

Belief Propagation: Decomposing Probabilities Suppose we want to compute P(X i E) where E is some set of evidence variables. - Evidence variables are those that are observable, offering evidence with respect to the values of the latent variables. Let s split E into two parts: i. E i is the part consisting of assignments to variables in the subtree rooted at X i. ii. E i + is the rest of it. P (X i E) = P (X i E i, E+ i ) = P (E i X, E + i )P (X E+ i ) P (E i E + i ) [by Bayes rule] Ei+ Xi Ei 32

Belief Propagation: Decomposing Probabilities Suppose we want to compute P(X i E) where E is some set of evidence variables. - Evidence variables are those that are observable, offering evidence with respect to the values of the latent variables. Let s split E into two parts: i. E i is the part consisting of assignments to variables in the subtree rooted at X i. ii. E i + is the rest of it. P (X i E) = P (X i E i, E+ i ) = P (E i X, E + i )P (X E+ i ) P (E i E + i ) [by Bayes rule] = P (E i X)P (X E + i ) P (E i E + i ) [by cond. ind.] Xi Ei+ Ei 32

Belief Propagation: Decomposing Probabilities Suppose we want to compute P(X i E) where E is some set of evidence variables. - Evidence variables are those that are observable, offering evidence with respect to the values of the latent variables. Let s split E into two parts: i. E i is the part consisting of assignments to variables in the subtree rooted at X i. ii. E i + is the rest of it. P (X i E) = P (X i E i, E+ i ) = P (E i X, E + i )P (X E+ i ) P (E i E + i ) = P (E i X)P (X E + i ) P (E i E + i ) = απ(x i )λ(x i ) [by Bayes rule] [by cond. ind.] Xi Ei+ where π(x i ) = P (X i E + i ), λ(x i) = P (E i X i ) and α is a constant independent of X i. Ei 32

Computing λ(xi) for leaves Starting at the leaves of the tree, recursively compute λ(xi)= P(Ei Xi) for all Xi. If Xi is a leaf: - If X i is in E (X i is observed): λ(x i ) = 1 if X i matches E, 0 otherwise. - If X i is not in E: E i is the null set, so λ(x i ) = 1 (constant). For simplicity, and without loss of generality, we will assume that all variables in E (the evidence set) are leaves in the tree. - How can we do this? In the case of a tree, conditioning on E d-separates the graph at that point. 33

Computing λ(xi) for non-leaves I Suppose Xi has one child, Xc. Xi λ Xc Note: λ is a function of X (for discrete X, λ is a vector). 34

Computing λ(xi) for non-leaves I Suppose Xi has one child, Xc. λ(x i ) = P (E i X i ) = P (E i, X c = j X i ) j Xi λ Xc = j P (X c = j X i )P (E i X i, X c = j) Note: λ is a function of X (for discrete X, λ is a vector). 34

Computing λ(xi) for non-leaves I Suppose Xi has one child, Xc. λ(x i ) = P (E i X i ) = P (E i, X c = j X i ) j Xi λ Xc = j P (X c = j X i )P (E i X i, X c = j) = j P (X c = j X i )P (E i X c = j) = j P (X c = j X i )λ(x c = j) Note: λ is a function of X (for discrete X, λ is a vector). 34

Computing λ(xi) for non-leaves II Now suppose Xi has a set of children C. Since Xi d-separates each of its subtrees, the contribution of each subtree to λ(xi) is independent. λ(x i ) = P (E i X i ) = X j C λ j (X i ) = X j C where λj(xi) is the contribution to λ(xi) of the evidence lying in the subtree rooted at one of Xi s children Xj. Xi X1... Xj [ ] P (X j = k X i )λ(x j ) k We now have a way to recursively compute all the λ(xi) s. Each node in the network passes a little λ message to its parent. λ λ λ λ λ 35

The other half of the problem Remember P(Xi E) = απ(xi)λ(xi). Now that we have all the λ(xi) s, what about the π(xi) s? - recall: π(xi) = P(Xi Ei+) At the root node Xr, Er+ is the null set, so π(xr) = P(Xr). - since we also know λ(x i ), we can compute the final P(X r E)! So for an arbitrary Xi with parent Xp, we can recursively compute π(xi) from π(xp) and/or P(Xp E). How do we do this? 36

Computing π(xi) Xp Xi 37

Computing π(xi) π(x i ) = P (X i E + i ) Xp Xi 37

Computing π(xi) π(x i ) = P (X i E + i ) = j P (X i, X p = j E + i ) Xp Xi 37

Computing π(xi) π(x i ) = P (X i E + i ) = j P (X i, X p = j E + i ) = j P (X i X p = j, E + i )P (X p = j E + i ) Xp Xi 37

Computing π(xi) π(x i ) = P (X i E + i ) = j P (X i, X p = j E + i ) = j P (X i X p = j, E + i )P (X p = j E + i ) Xp = j P (X i X p = j)p (X p = j E + i ) Xi 37

Computing π(xi) π(x i ) = P (X i E + i ) = j P (X i, X p = j E + i ) = j P (X i X p = j, E + i )P (X p = j E + i ) Xp = j P (X i X p = j)p (X p = j E + i ) Xi = j P (X i X p = j) P (X p = j E) λ i (X p = j) 37

Computing π(xi) π(x i ) = P (X i E + i ) = j P (X i, X p = j E + i ) Xp = j = j P (X i X p = j, E + i )P (X p = j E + i ) P (X i X p = j)p (X p = j E + i ) Xi = j = j P (X i X p = j) P (X p = j E) λ i (X p = j) P (X i X p = j)π i (X p = j) where π i (X p ) is defined as P (X p E) λ i (X p ). 37

Belief Propagation Wrap-up Using a top down strategy, we can compute, for all i, the P(Xi E) and, in turn, the π(xi) for the children of Xi. Nodes can be thought of as autonomous processors passing λ and π messages to their neighbors. What if you a conjunctive distribution, e.g., P(A,B C) instead of just marginal distributions P(A C) and P(B C)? - Use the chain rule: P(A,B C) = P(A C)P(B A, C) - Each of these latter probabilities can be computed using belief propagation. λ λ λ π π π π λ π λ 38

Generalizing Belief Propagation Great! For tree structured graphs, we have an inference algorithm with time complexity linear in the number of nodes. All this is fine, but what if you don t have a tree structured graph? The method we discussed can be generalized to polytrees: - undirected versions of the graphs are still trees, but nodes can have more than one parent. - do not contain any undirected cycles. 39

Dealing with Cycles Can deal with undirected cycles in a graph by the junction tree algorithm. An arbitrary directed acyclic graph can be transformed via some graph theoretic procedure into a tree structure with nodes containing clusters of variables (cliques). Belief propagation can be performed on the new tree. A S SBL T E L B AT TLE BLE X D XE DBE Note: In the worst case, the junction tree nodes must take on exponentially many combinations of values, but can work well in practice. 40

Approximate Inference Methods For densely connected graphs, exact inference may be intractable. There are 3 widely used approximation schemes - Pretend the graph is a tree: loopy belief propagation - Markov chain Monte Carlo (MCMC): simulate lots of values of the variables and use the sample frequencies to approximate probabilities. - Variational inference: Approximate the graph with a simplified the graph (with no cycles) and fit a set of variational parameters such that the probabilities computed with the simple structure are close to those in the original graph. 41

Acknowledgments The material presented here is taken from tutorials, notes and lecture slides from Yoshua Bengio, Christopher Bishop, Andrew Moore, Tom Mitchell and Scott Davies.!"#$%#"%"#&$'%#$()&$'*%(+,%-*$'*%".%#&$*$%*/0,$*1%2+,'$3%(+,%4)"##%3"-/,% 5$%,$/06&#$,%0.%7"-%."-+,%#&0*%*"-')$%8(#$'0(/%-*$.-/%0+%6090+6%7"-'%"3+% /$)#-'$*1%:$$/%.'$$%#"%-*$%#&$*$%*/0,$*%9$'5(#08;%"'%#"%8",0.7%#&$8%#"%.0#% 7"-'%"3+%+$$,*1%<"3$'<"0+#%"'060+(/*%('$%(9(0/(5/$1%=.%7"-%8(>$%-*$%".%(% *06+0.0)(+#%?"'#0"+%".%#&$*$%*/0,$*%0+%7"-'%"3+%/$)#-'$;%?/$(*$%0+)/-,$%#&0*% 8$**(6$;%"'%#&$%."//"30+6%/0+>%#"%#&$%*"-')$%'$?"*0#"'7%".%2+,'$ 3@*% #-#"'0(/*A%&##?ABB3331)*1)8-1$,-BC(38B#-#"'0(/* 1%D"88$+#*%(+,% )"''$)#0"+*%6'(#$.-//7%'$)$09$,1% 42