Belief Networks for Probabilistic Inference

Similar documents
CS Belief networks. Chapter

Bayesian networks. Chapter Chapter

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

Introduction to Artificial Intelligence Belief networks

Outline } Conditional independence } Bayesian networks: syntax and semantics } Exact inference } Approximate inference AIMA Slides cstuart Russell and

Bayesian networks: Modeling

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Probabilistic Reasoning Systems

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Course Overview. Summary. Outline

Bayesian Networks. Philipp Koehn. 6 April 2017

Bayesian Networks. Philipp Koehn. 29 October 2015

Graphical Models - Part I

Bayesian networks. Chapter Chapter

Example. Bayesian networks. Outline. Example. Bayesian networks. Example contd. Topology of network encodes conditional independence assertions:

Outline. Bayesian networks. Example. Bayesian networks. Example contd. Example. Syntax. Semantics Parameterized distributions. Chapter 14.

Inference in Bayesian Networks

Lecture 10: Bayesian Networks and Inference

Bayesian networks. Chapter 14, Sections 1 4

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Outline } Exact inference by enumeration } Exact inference by variable elimination } Approximate inference by stochastic simulation } Approximate infe

14 PROBABILISTIC REASONING

Belief networks Chapter 15.1í2. AIMA Slides cæstuart Russell and Peter Norvig, 1998 Chapter 15.1í2 1

Inference in Bayesian networks

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Bayesian networks. AIMA2e Chapter 14

Uncertainty and Bayesian Networks

PROBABILISTIC REASONING SYSTEMS

Artificial Intelligence Methods. Inference in Bayesian networks

Artificial Intelligence

Inference in Bayesian networks

Graphical Models - Part II

Quantifying uncertainty & Bayesian networks

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Informatics 2D Reasoning and Agents Semester 2,

PROBABILISTIC REASONING Outline

Review: Bayesian learning and inference

Bayesian Networks BY: MOHAMAD ALSABBAGH

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

Probabilistic representation and reasoning

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

School of EECS Washington State University. Artificial Intelligence

Bayesian Networks. Material used

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Artificial Intelligence Bayesian Networks

Informatics 2D Reasoning and Agents Semester 2,

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

CSE 473: Artificial Intelligence Autumn 2011

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

BAYESIAN NETWORKS AIMA2E CHAPTER (SOME TOPICS EXCLUDED) AIMA2e Chapter (some topics excluded) 1

Reasoning Under Uncertainty

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Probabilistic representation and reasoning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 188: Artificial Intelligence. Bayes Nets

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Directed Graphical Models

Probabilistic Models. Models describe how (a portion of) the world works

Uncertainty. Chapter 13

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Reasoning Under Uncertainty

CS 343: Artificial Intelligence

Bayesian Belief Network

last two digits of your SID

Rational decisions. Chapter 16. Chapter 16 1

Bayes Networks 6.872/HST.950

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Rational decisions From predicting the effect to optimal intervention

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

CS 188: Artificial Intelligence Spring Announcements

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Bayesian Networks. Motivation

Probabilistic Reasoning

qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions,

Probabilistic Models

Artificial Intelligence Bayes Nets: Independence

Artificial Intelligence Uncertainty

Bayes Nets: Independence

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

CS 188: Artificial Intelligence Fall 2009

Bayesian networks (1) Lirong Xia

Introduction to Artificial Intelligence. Unit # 11

Quantifying Uncertainty

CS 5522: Artificial Intelligence II

COMP5211 Lecture Note on Reasoning under Uncertainty

Rational decisions. Chapter 16 1

Intelligent Systems (AI-2)

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Foundations of Artificial Intelligence

Bayes Nets III: Inference

Transcription:

1 Belief Networks for Probabilistic Inference Liliana Mamani Sanchez lmamanis@tcd.ie October 27, 2015

Background Last lecture we saw that using joint distributions for probabilistic inference presented problems: High time and space complexity: O(d n ) Knowledge acquisition bottleneck Now we will introduce Bayesian networks a mechanism that lowers complexity by exploiting conditional independence, and see how they can be used to perform: Exact inference and Approximate inference

Independence Two random variables A B are (absolutely) independent iff or P(A B) = P(A) (1) P(A, B) = P(A B)P(B) = P(A)P(B) (2) e.g., A and B are two coin tosses. If n Boolean variables are independent, the full joint is P(X 1,..., X n ) = i P(X i ) (3) hence can be specified by just n numbers. Absolute independence is a very strong requirement, seldom met.

4 Conditional independence Consider the dentist problem with three random variables: Toothache, Cavity, Catch (steel probe catches in my tooth). The full joint distribution has 2 3 1 = 7 independent entries. If I have a cavity, the probability that the probe catches in it doesn t depend on whether I have a toothache: P(Catch Toothache, Cavity) = P(Catch Cavity) (4) i.e., Catch is conditionally independent of Toothache given Cavity. The same independence holds if I haven t got a cavity: P(Catch Toothache, Cavity) = P(Catch Cavity) (5)

Conditional independence contd. Equivalent statements to (4) Why?? P(Toothache Catch, Cavity) = P(Toothache Cavity) (6) P(Toothache, Catch Cavity) = P(Toothache Cavity)P(Catch Cavity) (7) Why?? Full joint distribution can now be written as P(Toothache, Catch, Cavity) = P(Toothache, Catch Cavity)P(Cavity) = P(Toothache Cavity)P(Catch Cavity)P(Cavity) i.e., 2 + 2 + 1 = 5 independent numbers (equations 1 and 2 remove 2 entries.)

6 Belief networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions. Syntax: a set of nodes, one per variable a directed, acyclic graph (link directly influences ) a conditional distribution for each node given its parents: P(X i Parents(X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT)

7 Example I m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn t call. Sometimes it s set off by minor earthquakes. Is there a burglar? Network topology reflects causal knowledge: Burglary P(B).001 Earthquake P(E).002 Variables: Alarm B T T F F E T F T F P(A).95.94.29.001 Burglar, Earthquake, Alarm, JohnCalls, JohnCalls A T F P(J).90.05 MaryCalls A T F P(M).70.01 MaryCalls

8 Semantics Global semantics defines the full joint distribution as the product of the local conditional distributions: P(X 1,..., X n ) = n P(X i Parents(X i )) i = 1 e.g., P(J M A B E) is given by??

8 Semantics Global semantics defines the full joint distribution as the product of the local conditional distributions: P(X 1,..., X n ) = n P(X i Parents(X i )) i = 1 e.g., P(J M A B E) is given by?? = P( B)P( E)P(A B E)P(J A)P(M A) Local semantics: each node is conditionally independent of its nondescendants given its parents. Theorem: Local semantics global semantics

9 Markov blanket Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents U 1... U m Z 1j X Z nj Y 1... Y n

10 Constructing belief networks Need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics. 1 Choose an o r d e r i n g o f v a r i a b l e s X 1,..., X n 2 f o r i = 1..n 3 add X i to the network 4 s e l e c t p a r e n t s from X 1,..., X i 1 such t h a t 5 P(X i Parents(X i )) = P(X i X 1,..., X i 1 ) This choice of parents guarantees the global semantics: P(X 1,..., X n ) = = n P(X i X 1,..., X i 1 ) i = 1 n P(X i Parents(X i )) i = 1 (chain rule) (by construction)

11 Example Suppose we choose the ordering M, J, A, B, E. MaryCalls

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? MaryCalls JohnCalls

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No MaryCalls JohnCalls

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? MaryCalls JohnCalls Alarm

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No MaryCalls JohnCalls Alarm

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? P(B A, J, M) = P(B)? No MaryCalls JohnCalls Alarm Burglary

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? P(B A, J, M) = P(B)? No No MaryCalls JohnCalls Alarm Burglary

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? MaryCalls Alarm JohnCalls Burglary Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes MaryCalls Alarm JohnCalls Burglary Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes P(E B, A, J, M) = P(E)? MaryCalls Alarm JohnCalls Burglary Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes P(E B, A, J, M) = P(E)? No MaryCalls Alarm JohnCalls Burglary Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes P(E B, A, J, M) = P(E)? No P(E B, A, J, M) = P(E A)? Burglary MaryCalls Alarm JohnCalls Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes P(E B, A, J, M) = P(E)? No P(E B, A, J, M) = P(E A)? No Burglary MaryCalls Alarm JohnCalls Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes P(E B, A, J, M) = P(E)? No P(E B, A, J, M) = P(E A)? No P(E B, A, J, M) = P(E A, B)? Burglary MaryCalls Alarm JohnCalls Earthquake

11 Example Suppose we choose the ordering M, J, A, B, E. P(J M) = P(J)? No P(A J, M) = P(A)? P(A J, M) = P(A J)? No P(B A, J, M) = P(B)? No P(B A, J, M) = P(B A)? Yes P(E B, A, J, M) = P(E)? No P(E B, A, J, M) = P(E A)? No P(E B, A, J, M) = P(E A, B)? Yes Burglary MaryCalls Alarm JohnCalls Earthquake

12 Example: Car diagnosis Initial evidence: engine won t start Testable variables (thin ovals), diagnosis variables (thick ovals) Hidden variables (shaded) ensure sparse structure, reduce parameters battery age alternator broken fanbelt broken battery dead no charging battery flat no oil no gas fuel line blocked starter broken lights oil light gas gauge engine won t start

13 Example: Car insurance Predict claim costs (medical, liability, property) given data on application form (other unshaded nodes) Age SeniorTrain GoodStudent RiskAversion SocioEcon Mileage VehicleYear ExtraCar DrivingSkill MakeModel DrivingHist Antilock DrivQuality Airbag CarValue HomeBase AntiTheft Ruggedness Accident OwnDamage Theft Cushioning OtherCost OwnCost MedicalCost LiabilityCost PropertyCost

Compact conditional distributions CPT grows exponentially with no. of parents CPT becomes infinite with continuous-valued parent or child Solution: canonical distributions that are defined compactly Deterministic nodes are the simplest case: X = f (Parents(X )) for some function f E.g., Boolean functions NorthAmerican Canadian US Mexican E.g., numerical relationships among continuous variables Level t = inflow + precipation - outflow - evaporation

15 CPTs: the discrete case Noisy-OR models multiple noninteracting causes 1) Parents U 1... U k include all causes (can add leak node) 2) Independent failure probability q i for each cause alone = P(X U 1... U j, U j+1... U k ) = 1 j i = 1 q i Cold Flu Malaria P(Fever) P( Fever) F F F 0.0 1.0 F F T 0.9 0.1 F T F 0.8 0.2 F T T 0.98 0.02 = 0.2 0.1 T F F 0.4 0.6 T F T 0.94 0.06 = 0.6 0.1 T T F 0.88 0.12 = 0.6 0.2 T T T 0.988 0.012 = 0.6 0.2 0.1 Number of parameters linear in number of parents

16 Hybrid (discrete+continuous) networks Discrete (Subsidy? and Buys?); continuous (Harvest and Cost) Options: Subsidy? Cost Buys? Harvest 1. discretization possibly large errors, large CPTs 2. finitely parameterized canonical families: 2.1 Continuous variable, discrete+continuous parents (e.g., Cost) 2.2 Discrete variable, continuous parents (e.g., Buys?)

Continuous child variables Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents Most common is the linear Gaussian model, e.g.,: P(Cost = c Harvest = h, Subsidy? = true) = N(a t h + b t, σ t )(c) ( 1 = exp 1 ( ) ) c (at h + b t ) 2 σ t 2π 2 σ t Mean Cost varies linearly with Harvest, variance is fixed Linear variation is unreasonable over the full range but works OK if the likely range of Harvest is narrow

18 Continuous child variables P(Cost Harvest,Subsidy?=true) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 P(Cost Harvest,Subsidy?=false) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 P(Cost Harvest) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 Cost 10 0 5 10 Harvest 0 5 Cost 10 0 5 10 Harvest 0 5 Cost 10 0 5 10 Harvest All-continuous network with LG distributions = full joint is a multivariate Gaussian Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values

19 Discrete variable w/ continuous parents Probability of Buys? given Cost should be a soft threshold: 1 0.8 P(Buys?=false Cost=c) 0.6 0.4 0.2 0 0 2 4 6 8 10 12 Cost c Probit distribution uses integral of Gaussian: Φ(x) = x N(0, 1)(x)dx P(Buys? = true Cost = c) = Φ(( c + µ)/σ) (Can view as hard threshold whose location is subject to noise)

20 Discrete variable contd. Sigmoid (or logit) distribution also used in neural networks: P(Buys? = true Cost = c) = 1 1 + exp( 2 c+µ σ ) Sigmoid has similar shape to probit but much longer tails: P(Buys?=false Cost=c) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 Cost c

Back to Inference algorithms... Typical inference tasks: Simple queries: compute posterior marginal P(X i E = e) e.g.: P(NoGas Gauge = empty, Lights = on, Starts = false) Conjunctive queries: P(X i, X j E = e) = P(X i E = e)p(x j X i, E = e) Optimal decisions: decision networks include utility information; probabilistic inference required for P(outcome action, evidence) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor?

Using BN structure to simplify enumeration The algorithm we saw in the last lecture enumerated the entries of the full joint distribution in order to compute the conditional: P(X e) = α P(X, e, Y) (8) y... but according to their global semantics, belief nets give a complete representation of the joint distribution: P(x 1,..., x n ) = n P(x i parents(x i )) (9) i=1 Therefore, any query can be answered in a BN by computing the sums of products of conditional probabilities of the network

3 Inference by enumeration (in Bayesian Nets) Slightly better way to sum out variables from the joint without actually constructing its explicit representation. Simple query on the burglary network: P(B J = true, M = true) = P(B, J = true, M = true)/p(j = true, M = true) = αp(b, J = true, M = true) = α e a P(B, e, a, J = true, M = true) Burglary JohnCalls P(B).001 Alarm A T F P(J).90.05 Earthquake B T T F F E T F T F P(A).95.94.29.001 MaryCalls A T F P(E).002 P(M).70.01 Now, rewrite full joint entries using product of CPT entries: P(B = true J = true, M = true) = α P(B = true)p(e)p(a B = true, e) e a P(J = true a)p(m = true a) = αp(b = true) P(e) P(a B = true, e) e a P(J = true a)p(m = true a)

BN Enumeration algorithm Exhaustive depth-first enumeration: O(n) space, O(d n ) time 24. bn Enumeration(X, e, bn) returns a distribution over X 1 inputs: X, the query variable 2 e, evidence specified as an event 3 bn, a belief network specifying joint distribution P(X 1,..., X n) 4 Q(X ) a distribution over X 5 for each value x i of X do 6 extend e with value x i for X 7 Q(xi) EnumerateAll(Vars[bn], e) 8 return Normalize(Q(X )) EnumerateAll(vars, e) returns a real number 1 if Empty?(vars) then return 1.0 2 else do 3 Y First(vars) 4 if Y has value y in e 5 then return P(y Pa(Y )) EnumerateAll(Rest(vars), e) 6 else 7 return y P(y Pa(Y )) EnumerateAll(Rest(vars), ey ) /* e y = e extended with Y = y */

25 BN Enumeration algorithm s search tree P(b).001 P(e).002 P( e).998 P(a b,e) P( a b,e) P(a b, e) P( a b, e).95.05.94.06 P(j a).90 P(j a) P(j a).05.90 P(j a).05 P(m a) P(m a).70.01 P(m a) P(m a).70.01

25 BN Enumeration algorithm s search tree P(b).001 P(e).002 P( e).998 P(a b,e) P( a b,e) P(a b, e) P( a b, e).95.05.94.06 P(j a).90 P(j a) P(j a).05.90 P(j a).05 P(m a) P(m a).70.01 P(m a) P(m a).70.01 Source of inefficiency: P(J A) P(M A) get computed twice.

Inference by variable elimination Basic idea: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation. E.g.: P(B J = true, M = true) = α P(B) }{{} e P(E) }{{} a P(A B, E) P(j A) P(m A) }{{}}{{}}{{} B E A J M = αp(b) e P(E) a P(A B, E)P(j A)f M(A) = αp(b) e P(E) a P(A B, E)f J(A)f M (A) = αp(b) e P(E) a f A(A, B, E)f J (A)f M (A) = αp(b) e P(E)f ĀJM (B, E) (sum out A) = αp(b)fēājm (B) (sum out E) = αf B (B) fēājm (B) (NB: abbrev. J = true written as j etc)

Variable elimination: Basic operations Pointwise product of factors f 1 and f 2 : f 1 (x 1,..., x j, y 1,..., y k ) f 2 (y 1,..., y k, z 1,..., z l ) E.g., f 1 (a, b) f 2 (b, c) = f (a, b, c) = f (x 1,..., x j, y 1,..., y k, z 1,..., z l ) Summing out a variable from a product of factors: move any constant factors outside the summation: f 1 f k = f 1 f i f i+1 f k x = f 1 f i f X assuming f 1,..., f i do not depend on X Note that vars which are not ancestors of query variables or evidence variables are irrelevant! x

28 Variable elimination algorithm function EliminationAsk(X,e,bn) returns a distribution over X inputs: X, the query variable e, evidence specified as an event bn, a belief network specifying joint distribution P(X 1,..., X n ) if X e then return observed point distribution for X factors [ ]; vars Reverse(Vars[bn]) for each var in vars do factors [MakeFactor(var, e) factors] if var is a hidden variable then factors SumOut(var,factors) return Normalize(PointwiseProduct(factors)) NB: The order in which the variables are incorporated into factors matters. It is impractical to search for an optimal ordering, so heuristics are often employed.

29 Complexity of exact inference Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space cost of variable elimination are O(d k n) Multiply connected networks: can reduce 3SAT to exact inference = NP-hard equivalent to counting 3SAT models = #P-complete 1. A v B v C 0.5 0.5 0.5 0.5 A B C D 2. C v D v ~A 1 2 3 3. B v C v ~D AND

Inference by stochastic simulation Basic idea: 1. Draw N samples from a sampling distribution S 2. Compute an approximate posterior probability ˆP 3. Show this converges to the true probability P Alternative ways of implementing it: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples MCMC: sample from a stochastic process whose stationary distribution is the true posterior

31 Sampling from an empty network function PriorSample(bn) returns an event sampled from P(X 1,..., X n ) specified by bn x an event with n elements for i = 1 to n do x i a random sample from P(X i P arents(x i )) return x P(C) =.5 Cloudy P(Cloudy) = 0.5, 0.5 sample true P(Sprinkler Cloudy) = 0.1, 0.9 sample false P(Rain Cloudy) = 0.8, 0.2 sample true P(WetGrass Sprinkler, Rain) = 0.9, 0.1 sample true C T F P(S).10.50 Sprinkler S R P(W) T T T F F T F F Wet Grass.99.90.90.00 Rain C T F P(R).80.20

Sampling from an empty network contd. Probability that PriorSample generates a particular event n S PS (x 1... x n ) = P(x i Parents(X i )) (10) i = 1 = P(x 1... x n ) i.e., the true prior probability. Let N PS (Y = y) be the number of samples generated for which Y = y, for any set of variables Y. Then ˆP(Y = y) = N PS (Y = y)/n and lim ˆP(Y = y) = N h = h S PS (Y = y, H = h) P(Y = y, H = h) = P(Y = y) That is, estimates derived from PriorSample are consistent

Rejection sampling ˆP(X e) estimated from samples agreeing with e RejectionSampling(X, e, bn, N) returns an approximation to P(X e) 1 N[X ] a vector of counts over X, initially zero 2 for j = 1 to N do 3 x PriorSample(bn) 4 if x is consistent with e then 5 N[x] N[x] + 1 where x is the value of X in x 6 return Normalize(N[X ]) E.g., estimate P(Rain Sprinkler = true) using 100 samples: 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. ˆP(Rain Sprinkler = true) = Normalize( 8, 19 ) = 0.296, 0.704 Similar to a basic real-world empirical estimation procedure

Analysis of rejection sampling ˆP(X e) = αn PS (X, e) (algorithm defn.) = N PS (X, e)/n PS (e) (normalized by N PS (e)) P(X, e)/p(e) (property of PriorSample) = P(X e) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates. Problem: hopelessly expensive if P(e) is small.

Approximate inference using MCMC State of network = current assignment to all variables Generate next state by sampling one variable given Markov blanket Sample each variable in turn, keeping evidence fixed function MCMC-Ask(X,e,bn,N) returns an approximation to P (X e) local variables: N[X ], a vector of counts over X, initially zero Y, the nonevidence variables in bn x, the current state of the network, initially copied from e initialize x with random values for the variables in Y for j = 1 to N do N[x] N[x] + 1 where x is the value of X in x for each Y i in Y do sample the value of Y i in x from P(Y i MB(Y i)) given the values of MB(Y i) in x return Normalize(N[X ]) Approaches stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability.

36 MCMC Example Estimate P(Rain Sprinkler = true, WetGrass = true) Sample Cloudy then Rain, repeat. Count number of times Rain is true and false in the samples. Markov blanket of Cloudy is Sprinkler and Rain Markov blanket of Rain is Cloudy, Sprinkler, and WetGrass P(C) =.5 Cloudy C T F P(S).10.50 true Sprinkler true Rain C T F P(R).80.20 Wet Grass S R P(W) T T F F T F T F.99.90.90.00

7 MCMC example contd. Random initial state: Cloudy = true and Rain = false 1. P(Cloudy MB(Cloudy)) = P(Cloudy Sprinkler, Rain) sample false 2. P(Rain MB(Rain)) = P(Rain Cloudy, Sprinkler, WetGrass) sample true Visit 100 states 31 have Rain = true, 69 have Rain = false ˆP(Rain Sprinkler = true, WetGrass = true) = Normalize( 31, 69 ) = 0.31, 0.69

MCMC analysis: Outline Transition probability q(y y ) Occupancy probability π t (y) at time t Equilibrium condition on π t defines stationary distribution π(y) Note: stationary distribution depends on choice of q(y y ) Pairwise detailed balance on states guarantees equilibrium Gibbs sampling transition probability: sample each variable given current values of all others = detailed balance with the true posterior For Bayesian networks, Gibbs sampling reduces to sampling conditioned on each variable s Markov blanket

Further reading This lecture is based on [Russell and Norvig, 2003], where further details on Bayesian Networks can be found. See also [Bishop, 2006], chapter 8 in particular, for more on graphical models and their uses.

40 References Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer. Russell, S. J. and Norvig, P. (2003). Artificial Intelligence. A Modern Approach. Prentice-Hall, Englewood Cliffs, 2nd edition.