Inference in Bayesian Networks

Similar documents
Inference in Bayesian networks

Artificial Intelligence Methods. Inference in Bayesian networks

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Inference in Bayesian networks

Bayesian Networks. Philipp Koehn. 6 April 2017

Bayesian Networks. Philipp Koehn. 29 October 2015

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Lecture 10: Bayesian Networks and Inference

Bayesian networks. Chapter Chapter

Artificial Intelligence

Outline } Exact inference by enumeration } Exact inference by variable elimination } Approximate inference by stochastic simulation } Approximate infe

Bayesian networks. Chapter 14, Sections 1 4

Probabilistic Reasoning Systems

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Course Overview. Summary. Outline

PROBABILISTIC REASONING SYSTEMS

Informatics 2D Reasoning and Agents Semester 2,

Belief Networks for Probabilistic Inference

CS 188: Artificial Intelligence. Bayes Nets

Uncertainty and Bayesian Networks

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian networks. AIMA2e Chapter 14

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

CS 343: Artificial Intelligence

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

School of EECS Washington State University. Artificial Intelligence

Bayes Networks 6.872/HST.950

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

14 PROBABILISTIC REASONING

last two digits of your SID

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions,

Quantifying uncertainty & Bayesian networks

Artificial Intelligence Bayesian Networks

Bayes Nets III: Inference

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Reasoning Under Uncertainty

Introduction to Artificial Intelligence Belief networks

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Directed Graphical Models

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Reasoning Under Uncertainty

Intelligent Systems (AI-2)

Sampling from Bayes Nets

Quantifying Uncertainty

Foundations of Artificial Intelligence

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Informatics 2D Reasoning and Agents Semester 2,

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian Networks. Motivation

Introduction to Artificial Intelligence. Unit # 11

Approximate Inference

Bayes Nets: Sampling

Machine Learning for Data Science (CS4786) Lecture 24

PROBABILISTIC REASONING Outline

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

CS 188: Artificial Intelligence Spring Announcements

Bayes Nets: Independence

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Review: Bayesian learning and inference

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Probabilistic Representation and Reasoning

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Artificial Intelligence Bayes Nets: Independence

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

CSE 473: Artificial Intelligence Autumn 2011

CS 5522: Artificial Intelligence II

BAYESIAN NETWORKS AIMA2E CHAPTER (SOME TOPICS EXCLUDED) AIMA2e Chapter (some topics excluded) 1

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

COMP5211 Lecture Note on Reasoning under Uncertainty

Graphical Models - Part I

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Dynamic Bayesian Networks and Hidden Markov Models Decision Trees

Lecture 8: Bayesian Networks

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Learning Bayesian Networks (part 1) Goals for the lecture

CS 188: Artificial Intelligence Spring Announcements

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Bayesian networks: approximate inference

COMP538: Introduction to Bayesian Networks

CSEP 573: Artificial Intelligence

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Course Overview. Summary. Performance of approximation algorithms

Introduction to Artificial Intelligence (AI)

CS 730/730W/830: Intro AI

Graphical Models and Kernel Methods

Y. Xiang, Inference with Uncertain Knowledge 1

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

Probabilistic representation and reasoning

Transcription:

Lecture 7 Inference in Bayesian Networks Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig

Course Overview Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman Filters Learning Supervised Learning Bayesian Networks, Neural Networks Unsupervised EM Algorithm Reinforcement Learning Games and Adversarial Search Minimax search and Alpha-beta pruning Multiagent search Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2

Bayesian networks, Resume Encode local conditional independences Pr(X i X i ) = Pr(X i Parents(X i )) Thus the global semantics simplifies to (joint probability factorization): Pr(X 1,..., X n ) = = n Pr(X i X 1,..., X i 1 ) i = 1 n Pr(X i Parents(X i )) i = 1 (chain rule) (by construction) 3

Outline 1. 4

Inference tasks Simple queries: compute posterior marginal Pr(X i E = e) e.g., P(NoGas Gauge = empty, Lights = on, Starts = false) Conjunctive queries: Pr(X i, X j E = e) = Pr(X i E = e) Pr(X j X i, E = e) Optimal decisions: decision networks include utility information; probabilistic inference required for P(outcome action, evidence) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? 5

Inference by enumeration Sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: B E Pr(B j, m) = Pr(B, j, m)/p(j, m) = α Pr(B, j, m) = α e a Pr(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: J A M Pr(B j, m) = α e a Pr(B)P(e) Pr(a B, e)p(j a)p(m a) = α Pr(B) e P(e) a Pr(a B, e)p(j a)p(m a) Recursive depth-first enumeration: O(n) space, O(d n ) time 6

Enumeration algorithm function Enumeration-Ask(X, e, bn) returns a distribution over X inputs: X, the query variable e, observed values for variables E bn, a Bayesian network with variables {X } E Y Q(X ) a distribution over X, initially empty for each value x i of X do Q(x i ) Enumerate-All(bn.Vars, e {X = x i }) return Normalize(Q(X )) function Enumerate-All(vars, e) returns a real number if Empty?(vars) then return 1.0 Y First(vars) if Y has value y in e then return P(y parent(y )) Enumerate-All(Rest(vars), e) else return y P(y parent(y )) Enumerate-All(Rest(vars), e {Y = y}) 7

Evaluation tree P(b).001 P(e).002 P( e).998 P(a b,e) P( a b,e) P(a b, e) P( a b, e).95.05.94.06 P(j a).90 P(j a).05 P(j a).90 P(j a).05 P(m a) P(m a).70.01 P(m a) P(m a).70.01 Enumeration is inefficient: repeated computation e.g., computes P(j a)p(m a) for each value of e 8

Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation Pr(B j, m) = α Pr(B) }{{} P(e) e }{{} B E = α Pr(B) e P(e) = α Pr(B) e P(e) = α Pr(B) e P(e) a Pr(a B, e) P(j a) P(m a) }{{}}{{}}{{} A J M a Pr(a B, e)p(j a)f M(a) a Pr(a B, e)f J(a)f M (a) a f A(a, b, e)f J (a)f M (a) = α Pr(B) P(e)fĀJM(b, e) (sum out A) e = α Pr(B)f ĒĀJM (b) (sum out E) = αf B (b) f (b) ĒĀJM 9

Variable elimination: Basic operations Summing out a variable from a product of factors: 1. move any constant factors outside the summation: x f 1 f k = f 1 f i f 1 f i f X assuming f 1,..., f i do not depend on X x f i+1 f k = 2. add up submatrices in pointwise product of remaining factors: Eg: pointwise product of f 1 and f 2 : f 1 (x 1,..., x j, y 1,..., y k ) f 2 (y 1,..., y k, z 1,..., z l ) = f (x 1,..., x j, y 1,..., y k, z 1,..., z l ) E.g., f 1 (a, b) f 2 (b, c) = f (a, b, c) 10

Irrelevant variables Consider the query P(JohnCalls Burglary = true) P(J b) = αp(b) e P(e) a P(a b, e)p(j a) m Sum over m is identically 1; M is irrelevant to the query B P(m a) J A E M Theorem Y is irrelevant unless Y Ancestors({X } E) Here, X = JohnCalls, E = {Burglary}, and Ancestors({X } E) = {Alarm, Earthquake} so MaryCalls is irrelevant 12

Irrelevant variables contd. Defn: moral graph of DAG Bayes net: marry all parents and drop arrows Defn: A is m-separated from B by C iff separated by C in the moral graph Theorem Y is irrelevant if m-separated from X by E For P(JohnCalls Alarm = true), both Burglary and Earthquake are irrelevant B E A J M 13

Complexity of exact inference Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space cost (with variable elimination) are O(d k n) hence time and space cost are linear in n and k bounded by a constant Multiply connected networks: can reduce 3SAT to exact inference = NP-hard equivalent to counting 3SAT models = #P-complete Proof of this in one of the exercises for Thursday. 14

Inference by stochastic simulation Basic idea: Draw N samples from a sampling distribution S Compute an approximate posterior probability ˆP Show this converges to the true probability P 0.5 Coin Outline: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior 15

Sampling from an empty network function Prior-Sample(bn) returns an event sampled from bn inputs: bn, a belief network specifying joint distribution Pr(X 1,..., X n ) x an event with n elements for i = 1 to n do x i a random sample from Pr(X i parents(x i )) given the values of Parents(X i ) in x return x Ancestor sampling 16

Example P(C).50 Cloudy C T F P(S C).10.50 Sprinkler Rain C T F P(R C).80.20 Wet Grass S T T F F R P(W S,R) T F T F.99.90.90.01 17

Sampling from an empty network contd. Probability that PriorSample generates a particular event i.e., the true prior probability S PS (x 1... x n ) = P(x 1... x n ) E.g., S PS (t, f, t, t) = 0.5 0.9 0.8 0.9 = 0.324 = P(t, f, t, t) Proof: Let N PS (x 1... x n ) be the number of samples generated for event x 1,..., x n. Then we have lim ˆP(x 1,..., x n ) N = lim PS(x 1,..., x n )/N N = S PS (x 1,..., x n ) n = P(x i parents(x i )) = P(x 1... x n ) i = 1 That is, estimates derived from PriorSample are consistent Shorthand: ˆP(x 1,..., x n ) P(x 1... x n ) 18

Rejection sampling ˆPr(X e) estimated from samples agreeing with e function Rejection-Sampling(X, e, bn, N) returns an estimate of P(X e) local variables: N, a vector of counts over X, initially zero for j = 1 to N do x Prior-Sample(bn) if x is consistent with e then N[x] N[x]+1 where x is the value of X in x return Normalize(N[X]) E.g., estimate Pr(Rain Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. ˆPr(Rain Sprinkler = true) = Normalize( 8, 19 ) = 0.296, 0.704 Similar to a basic real-world empirical estimation procedure 19

Analysis of rejection sampling Rejection sampling returns consistent posterior estimates Proof: ˆPr(X e) = αn PS (X, e) (algorithm defn.) = N PS (X, e)/n PS (e) (normalized by N PS (e)) Pr(X, e)/p(e) (property of PriorSample) = Pr(X e) (defn. of conditional probability) Problem: hopelessly expensive if P(e) is small P(e) drops off exponentially with number of evidence variables! 20

Likelihood weighting Idea: fix evidence variables, sample only nonevidence variables, and weight each sample by the likelihood it accords the evidence function Likelihood-Weighting(X, e, bn, N) returns an estimate of P(X e) local variables: W, a vector of weighted counts over X, initially zero for j = 1 to N do x, w Weighted-Sample(bn) W[x] W[x] + w where x is the value of X in x return Normalize(W[X ]) function Weighted-Sample(bn, e) returns an event and a weight x an event with n elements; w 1 for i = 1 to n do if X i has a value x i in e then w w P(X i = x i parents(x i )) else x i a random sample from Pr(X i parents(x i )) return x, w 21

Likelihood weighting example P(Rain Sprinkler = true, WetGrass = true) P(C).50 Cloudy C T F P(S C).10.50 Sprinkler Rain C T F P(R C).80.20 Wet Grass S T T F F R P(W S,R) T F T F.99.90.90.01 22

Likelihood weighting analysis Likelihood weighting returns consistent estimates Sampling probability for WeightedSample is S WS (z, e) = l P(z i parents(z i )) i = 1 (pays attention to evidence in ancestors only) somewhere in between prior and posterior distribution Weight for a given sample z, e is w(z, e) = m P(e i parents(e i )) i = 1 Weighted sampling probability is Sprinkler Cloudy Wet Grass Rain but performance still degrades with many evidence variables because a few samples have nearly all the total weight S WS (z, e)w(z, e) = l P(zi parents(z i )) m P(ei parents(e i )) = P(z, e) 23

Summary Approximate inference by LW: LW does poorly when there is lots of (late-in-the-order) evidence LW generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 24

Approximate inference using MCMC State of network = current assignment to all variables. Generate next state by sampling one variable given Markov blanket Sample each variable in turn, keeping evidence fixed function MCMC-Ask(X, e, bn, N) returns an estimate of P(X e) local variables: N[X ], a vector of counts over X, initially zero Z, nonevidence variables in bn, hidden + query x, current state of the network, initially copied from e initialize x with random values for the variables in Z for j = 1 to N do N[x] N[x] + 1 where x is the value of X in x for each Z i in Z do sample the value of Z i in x from Pr(Z i mb(z i )) given the values of MB(Z i ) in x return Normalize(N[X ]) Can also choose a variable to sample at random each time 25

The Markov chain With Sprinkler = true, WetGrass = true, there are four states: Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Grass Wet Grass Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Grass Wet Grass Wander about for a while, average what you see Probabilistic finite state machine 26

MCMC example contd. Estimate Pr(Rain Sprinkler = true, WetGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = false ˆPr(Rain Sprinkler = true, WetGrass = true) = Normalize( 31, 69 ) = 0.31, 0.69 Theorem The Markov Chain approaches a stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability 27

Markov blanket sampling Markov blanket of Cloudy is Sprinkler and Rain Markov blanket of Rain is Cloudy, Sprinkler, and WetGrass Sprinkler Cloudy Wet Grass Rain Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large: P(X i mb(x i )) won t change much (law of large numbers) 28

Local semantics and Markov Blanket Local semantics: each node is conditionally independent of its nondescendants given its parents Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents U 1... U m U 1... U m X Z 1j X Z nj Z 1j Z nj Y 1... Y n Y 1... Y n 29

MCMC analysis: Outline Transition probability q(x x ) Occupancy probability π t (x) at time t Equilibrium condition on π t defines stationary distribution π(x) Note: stationary distribution depends on choice of q(x x ) Pairwise detailed balance on states guarantees equilibrium Gibbs sampling transition probability: sample each variable given current values of all others = detailed balance with the true posterior For Bayesian networks, Gibbs sampling reduces to sampling conditioned on each variable s Markov blanket 30

Stationary distribution π t (x) = probability in state x at time t π t+1 (x ) = probability in state x at time t + 1 π t+1 in terms of π t and q(x x ) π t+1 (x ) = x π t(x)q(x x ) Stationary distribution: π t = π t+1 = π π(x ) = x π(x)q(x x ) for all x If π exists, it is unique (specific to q(x x )) In equilibrium, expected outflow = expected inflow 31

Detailed balance Outflow = inflow for each pair of states: π(x)q(x x ) = π(x )q(x x) for all x, x Detailed balance = stationarity: x π(x)q(x x ) = = π(x ) = π(x ) x π(x )q(x x) x q(x x) MCMC algorithms typically constructed by designing a transition probability q that is in detailed balance with desired π 32

Gibbs sampling Sample each variable in turn, given all other variables Sampling X i, let X i be all other nonevidence variables Current values are x i and x i ; e is fixed Transition probability is given by q(x x ) = q(x i, x i x i, x i ) = P(x i x i, e) This gives detailed balance with true posterior P(x e): π(x)q(x x ) = P(x e)p(x i x i, e) = P(x i, x i e)p(x i x i, e) = P(x i x i, e)p( x i e)p(x i x i, e) (chain rule) = P(x i x i, e)p(x i, x i e) (chain rule backwards) = q(x x)π(x ) = π(x )q(x x) 33

Summary Exact inference by variable elimination: polytime on polytrees, NP-hard on general graphs space = time, very sensitive to topology Approximate inference by LW, MCMC: PriorSampling and RejectionSampling unusable as evidence grow LW does poorly when there is lots of (late-in-the-order) evidence LW, MCMC generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 35