Bayesian networks: approximate inference

Similar documents
Inference in Bayesian Networks

CS 343: Artificial Intelligence

Sampling Algorithms for Probabilistic Graphical models

Machine Learning for Data Science (CS4786) Lecture 24

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Stochastic Simulation

Graphical Models and Kernel Methods

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 8: Bayesian Networks

CS 188: Artificial Intelligence. Bayes Nets

Probabilistic Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Statistical Approaches to Learning and Discovery

Bayes Networks 6.872/HST.950

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Probability Propagation in Singly Connected Networks

Probabilistic Graphical Networks: Definitions and Basic Results

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Bayes Nets: Sampling

Markov Chain Monte Carlo

Bayesian Learning in Undirected Graphical Models

Artificial Intelligence

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

PROBABILISTIC REASONING SYSTEMS

Probabilistic Graphical Models

Machine Learning Lecture 14

STA 4273H: Statistical Machine Learning

Results: MCMC Dancers, q=10, n=500

Bayesian Networks: Belief Propagation in Singly Connected Networks

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

LECTURE 15 Markov chain Monte Carlo

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Bayesian Learning in Undirected Graphical Models

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Outline } Exact inference by enumeration } Exact inference by variable elimination } Approximate inference by stochastic simulation } Approximate infe

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Inference in Bayesian networks

Probabilistic Graphical Models (I)

Machine Learning 4771

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Intelligent Systems (AI-2)

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

an introduction to bayesian inference

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Markov chain Monte Carlo Lecture 9

Bayesian Machine Learning

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Approximate Inference using MCMC

Artificial Intelligence Methods. Inference in Bayesian networks

DAG models and Markov Chain Monte Carlo methods a short overview

Lecture 4: Probability Propagation in Singly Connected Bayesian Networks

Informatics 2D Reasoning and Agents Semester 2,

Artificial Intelligence Bayesian Networks

Markov Networks.

13 : Variational Inference: Loopy Belief Propagation

Lecture 8: PGM Inference

Lecture 6: Graphical Models

Bayesian Classification and Regression Trees

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

CPSC 540: Machine Learning

Bayesian Networks. B. Inference / B.3 Approximate Inference / Sampling

MCMC Sampling for Bayesian Inference using L1-type Priors

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Bayesian networks. Chapter Chapter

Variational Inference (11/04/13)

MCMC: Markov Chain Monte Carlo

12 : Variational Inference I

Probabilistic Reasoning Systems

Published in: Tenth Tbilisi Symposium on Language, Logic and Computation: Gudauri, Georgia, September 2013

Particle-Based Approximate Inference on Graphical Model

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Sampling Methods (11/30/04)

Markov Chains and MCMC

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Inference in Bayesian Networks

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Part 1: Expectation Propagation

Graphical Probability Models for Inference and Decision Making

Bayesian Networks. 7. Approximate Inference / Sampling

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Outline. Bayesian Networks: Belief Propagation in Singly Connected Networks. Form of Evidence and Notation. Motivation

Decisiveness in Loopy Propagation

Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Monte Carlo in Bayesian Statistics

Bayes Net Representation. CS 188: Artificial Intelligence. Approximate Inference: Sampling. Variable Elimination. Sampling.

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Directed Graphical Models

The Ising model and Markov chain Monte Carlo

Quantifying Uncertainty

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Need for Sampling in Machine Learning. Sargur Srihari

Markov Chain Monte Carlo

Transcription:

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25

Motivation Because of the (worst-case) intractability of exact inference in Bayesian networks, try to find more efficient approximate inference techniques: Instead of computing exact posterior compute approximation P(A E = e) ˆP(A E = e) with ˆP(A E = e) P(A E = e) Approximative inference September 2008 2 / 25

Absolute/Relative Error For p, ˆp [0, 1]: ˆp is approximation for p with absolute error ǫ, if p ˆp ǫ, i.e. ˆp [p ǫ, p + ǫ]. Approximative inference September 2008 3 / 25

Absolute/Relative Error For p, ˆp [0, 1]: ˆp is approximation for p with absolute error ǫ, if p ˆp ǫ, i.e. ˆp [p ǫ, p + ǫ]. ˆp is approximation for p with relative error ǫ, if 1 ˆp/p ǫ, i.e. ˆp [p(1 ǫ), p(1 + ǫ)]. Approximative inference September 2008 3 / 25

Absolute/Relative Error For p, ˆp [0, 1]: ˆp is approximation for p with absolute error ǫ, if ˆp is approximation for p with relative error ǫ, if p ˆp ǫ, i.e. ˆp [p ǫ, p + ǫ]. 1 ˆp/p ǫ, i.e. ˆp [p(1 ǫ), p(1 + ǫ)]. This definition is not always fully satisfactory, because it is not symmetric in p and ˆp and not invariant under the transition p (1 p), ˆp (1 ˆp). Use with care! When ˆp 1, ˆp 2 are approximations for p 1, p 2 with absolute error ǫ, then no error bounds follow for ˆp 1 /ˆp 2 as an approximation for p 1 /p 2. When ˆp 1, ˆp 2 are approximations for p 1, p 2 with relative error ǫ, then ˆp 1 /ˆp 2 approximates p 1 /p 2 with relative error (2ǫ)/(1 + ǫ). Approximative inference September 2008 3 / 25

Randomized Methods Most methods for approximate inference are randomized algorithms that compute approximations ˆP from random samples of instantiations. We shall consider: Forward sampling Likelihood weighting Gibbs sampling Metropolis Hastings algorithm Approximative inference September 2008 4 / 25

Forward Sampling Observation: can use Bayesian network as random generator that produces full instantiations V = v according to distribution P(V). Example: A B A t f.2.8 B A t f t.7.3 f.4.6 - Generate random numbers r 1, r 2 uniformly from [0,1]. - Set A = t if r 1.2 and A = f else. - Depending on the value of A and r 2 set B to t or f. Generation of one random instantiation: linear in size of network. Approximative inference September 2008 5 / 25

Sampling Algorithm Thus, we have a randomized algorithm S that produces possible outputs from sp(v) according to the distribution P(V). Define ˆP(A = a E = e) := {i 1,..., N E = e, A = a in S i } {i 1,..., N E = e in S i } Approximative inference September 2008 6 / 25

Forward Sampling: Illustration Sample with not E = e E = e, A a E = e, A = a Approximation for P(A = a E = e): # # Approximative inference September 2008 7 / 25

Sampling from the conditional distribution Problem of forward sampling: samples with E e are useless! Idea: find sampling algorithm S c that produces outputs from sp(v) according to the distribution P(V E = e). Approximative inference September 2008 8 / 25

Sampling from the conditional distribution Problem of forward sampling: samples with E e are useless! Idea: find sampling algorithm S c that produces outputs from sp(v) according to the distribution P(V E = e). A tempting approach: Fix the variables in E to e and sample from the nonevidence variables only! Approximative inference September 2008 8 / 25

Sampling from the conditional distribution Problem of forward sampling: samples with E e are useless! Idea: find sampling algorithm S c that produces outputs from sp(v) according to the distribution P(V E = e). A tempting approach: Fix the variables in E to e and sample from the nonevidence variables only! Problem: Only evidence from the ancestors are taken into account! Approximative inference September 2008 8 / 25

Likelihood weighting We would like to sample from (pa(x) are the parents in E) P(U, e) = Y P(X pa(x), pa(x) = e) Y P(X = e pa(x), pa(x) = e), X U\E X E but by applying forward sampling with fixed E we actually sample from: Sampling distribution = Y P(X pa(x), pa(x) = e). X U\E Approximative inference September 2008 9 / 25

Likelihood weighting We would like to sample from (pa(x) are the parents in E) P(U, e) = Y P(X pa(x), pa(x) = e) Y P(X = e pa(x), pa(x) = e), X U\E X E but by applying forward sampling with fixed E we actually sample from: Sampling distribution = Y P(X pa(x), pa(x) = e). X U\E Solution: Instead of letting each sample count as 1, use w(x, e) = Y P(X = e pa(x), pa(x) = e). X E Approximative inference September 2008 9 / 25

Likelihood weighting: example A B A t f.2.8 B A t f t.7.3 f.4.6 - Assume evidence B = t. - Generate a random number r uniformly from [0,1]. - Set A = t if r.2 and A = f else. - If A = t then let the sample count as w(t, t) = 0.7; otherwise w(f, t) = 0.4. Approximative inference September 2008 10 / 25

Likelihood weighting: example A B A t f.2.8 B A t f t.7.3 f.4.6 - Assume evidence B = t. - Generate a random number r uniformly from [0,1]. - Set A = t if r.2 and A = f else. - If A = t then let the sample count as w(t, t) = 0.7; otherwise w(f, t) = 0.4. With N samples (a 1,..., a N ) we get ˆP(A = t B = t) = P N i=1 w(a i = t, e) P N i=1 (w(a i = t, e) + w(a i = f, e)). Approximative inference September 2008 10 / 25

Gibbs Sampling For notational convenience assume from now on that for some l: E = V l+1, V l+2,..., V n. Write W for V 1,..., V l. Principle: obtain new sample from previous sample by randomly changing the value of only one selected variable. Procedure Gibbs sampling v 0 = (v 0,1,..., v 0,l ) := arbitrary instantiation of W i := 1 repeat forever choose V k W # deterministic or randomized generate randomly v i,k according to distribution P(V k V 1 = v i 1,1,..., V k 1 = v i 1,k 1, V k+1 = v i 1,k+1,..., V l = v i 1,l, E = e) set v i = (v i 1,1,... v i 1,k 1, v i,k, v i 1,k+1,..., v i 1,l ) i := i + 1 Approximative inference September 2008 11 / 25

Illustration The process of Gibbs sampling can be understood as a random walk in the space of all instantiations with E = e: Reachable in one step: instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variable V k ). Approximative inference September 2008 12 / 25

The sampling step Implementation of Sampling Step generate randomly v i,k according to distribution P(V k V 1 = v i 1,1,..., V k 1 = v i 1,k 1, V k+1 = v i 1,k+1,..., V l = v i 1,l, E = e) requires sampling from a conditional distribution. In this special case (all but one variables are instantiated) this is easy: just need to compute for each v sp(v k ) the probability P(V 1 = v i 1,1,..., V k 1 = v i 1,k 1, V k = v, V k+1 = v i 1,k+1,..., V l = v i 1,l, E = e) (linear in network size), and choose v i,k according to these probabilities (normalized). This can be further simplified by computing the distribution on sp(v k ) only in the Markov blanket of V k, i.e. the subnetwork consisting of V k, its parents, its children, and the parents of its children. Approximative inference September 2008 13 / 25

Convergence of Gibbs Sampling Under certain conditions: the distribution of samples converges to the posterior distribution P(W E = e): lim P(v i = v) = P(W = v E = e) (v sp(w)). i Sufficient conditions are: in the repeat loop of the Gibbs sampler, variable V k is randomly selected (with non-zero selection probability for all V k W), and the Bayesian network has no zero-entries in its cpt s Approximative inference September 2008 14 / 25

Problems: Approximate Inference using Gibbs Sampling 1. Start Gibbs sampling with some starting configuration v 0. 2. Run the sampler for N steps ( Burn in ) 3. Run the sampler for M additional steps; use the relative frequency of state v in these M samples as an estimate for P(W = v E = e). How large must N be chosen? Difficult to say how long it takes for Gibbs sampler to converge! Even when sampling is from the stationary distribution, samples are not independent. Result: error cannot be bounded as function of M using Chebyshev s inequality (or related methods). Approximative inference September 2008 15 / 25

Effect of dependence P(v N = v) close to P(W = v E = e): probability that v N is in the red region is close to P(A = a E = e). This does not guarantee that the fraction of samples in v N, v N+1,..., v N+M that are in the red region yields a good approximation to P(A = a E = e)! v 0 v N v N v N+M v N+M Approximative inference September 2008 16 / 25

Multiple starting points In practice, one tries to counteract these difficulties by restarting the Gibbs sampling several times (often with different starting points): v 0 vn v 0 v N+M v N v N+M v N v N+M v 0 Approximative inference September 2008 17 / 25

Metropolis Hastings Algorithm Another way of constructing a random walk on sp(w): Let {q(v, v ) v, v sp(w)} be a set of transition probabilities over sp(w), i.e. q(v, ) is a probability distribution for each v sp(w). The q(v, v ) are called proposal probabilities. Define j α(v, v ) := min 1, P(W = v E = e)q(v ff, v) P(W = v E = e)q(v, v ) j := min 1, P(W = v, E = e)q(v ff, v) P(W = v, E = e)q(v, v ) α(v, v ) is called the acceptance probability for the transition from v to v. Approximative inference September 2008 18 / 25

Procedure Metropolis Hastings sampling v 0 = (v 0,1,..., v 0,l ) := arbitrary instantiation of W i := 1 repeat forever sample v according to distribution q(v i 1, ) set accept to true with probability α(v, v ) if accept v i := v else v i := v i 1 i := i + 1 Approximative inference September 2008 19 / 25

Convergence of Metropolis Hastings Sampling Under certain conditions: the distribution of samples converges to the posterior distribution P(W E = e). A sufficient condition is: q(v, v ) > 0 for all v, v. To obtain a good performance, q should be chosen so as to obtain high acceptance probabilities, i.e. quotients P(W = v E = e)q(v, v) P(W = v E = e)q(v, v ) should be close to 1. Optimal (but usually not feasible): q(v, v ) = P(W = v E = e). Generally: try to approximate with q the target distribution P(W E = e). Approximative inference September 2008 20 / 25

Loopy belief propagation Loopy belief propagation Message passing algorithm like junction tree propagation Works directly on the Bayesian network structure (rather than the junction tree) Approximative inference September 2008 21 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B P(C A, B) C D E Approximative inference September 2008 22 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A P(C A, B) C D E Approximative inference September 2008 22 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C D E Approximative inference September 2008 22 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C φ D D E Approximative inference September 2008 22 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C φ D π E(C) D E Approximative inference September 2008 22 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C φ D φ E π E(C) D E X π E (C) = φ D P(C A, B)φ A φ B A,B Approximative inference September 2008 22 / 25

Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B λ A(C) φ A φ B P(C A, B) C φ D φ E π E(C) D E π E (C) = φ D X A,B P(C A, B)φ A φ B λ C (A) = X B,C P(C A, B)φ B φ D φ E Approximative inference September 2008 22 / 25

Loopy belief propagation Example: error sources A few observations: When calculating P(E) we treat C and D as being independent (in the junction tree C and D would appear in the same separator). Evidence on a converging connection ma cause the error to cycle. Approximative inference September 2008 23 / 25

Loopy belief propagation In general There is no guarantee of convergence, nor that in case of convergence, it will converge to the correct distribution. However, the method converges to the correct distribution surprisingly often! If the network is singly connected, convergence is guaranteed. Approximative inference September 2008 24 / 25

Literature R.M. Neal: Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical report CRG-TR-93-11993, Department of Computer Science, University of Toronto. http://omega.albany.edu:8008/neal.pdf P. Dagum, M. Luby: Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence 60, 1993. Approximative inference September 2008 25 / 25