Lecture 7 Inference in Bayesian Networks Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig
Course Overview Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman Filters Learning Supervised Learning Bayesian Networks, Neural Networks Unsupervised EM Algorithm Reinforcement Learning Games and Adversarial Search Minimax search and Alpha-beta pruning Multiagent search Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2
Bayesian networks, Resume Encode local conditional independences Pr(X i X i ) = Pr(X i Parents(X i )) Thus the global semantics simplifies to (joint probability factorization): Pr(X 1,..., X n ) = = n Pr(X i X 1,..., X i 1 ) i = 1 n Pr(X i Parents(X i )) i = 1 (chain rule) (by construction) 3
Outline 1. 4
Inference tasks Simple queries: compute posterior marginal Pr(X i E = e) e.g., P(NoGas Gauge = empty, Lights = on, Starts = false) Conjunctive queries: Pr(X i, X j E = e) = Pr(X i E = e) Pr(X j X i, E = e) Optimal decisions: decision networks include utility information; probabilistic inference required for P(outcome action, evidence) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? 5
Inference by enumeration Sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: B E Pr(B j, m) = Pr(B, j, m)/p(j, m) = α Pr(B, j, m) = α e a Pr(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: J A M Pr(B j, m) = α e a Pr(B)P(e) Pr(a B, e)p(j a)p(m a) = α Pr(B) e P(e) a Pr(a B, e)p(j a)p(m a) Recursive depth-first enumeration: O(n) space, O(d n ) time 6
Enumeration algorithm function Enumeration-Ask(X, e, bn) returns a distribution over X inputs: X, the query variable e, observed values for variables E bn, a Bayesian network with variables {X } E Y Q(X ) a distribution over X, initially empty for each value x i of X do Q(x i ) Enumerate-All(bn.Vars, e {X = x i }) return Normalize(Q(X )) function Enumerate-All(vars, e) returns a real number if Empty?(vars) then return 1.0 Y First(vars) if Y has value y in e then return P(y parent(y )) Enumerate-All(Rest(vars), e) else return y P(y parent(y )) Enumerate-All(Rest(vars), e {Y = y}) 7
Evaluation tree P(b).001 P(e).002 P( e).998 P(a b,e) P( a b,e) P(a b, e) P( a b, e).95.05.94.06 P(j a).90 P(j a).05 P(j a).90 P(j a).05 P(m a) P(m a).70.01 P(m a) P(m a).70.01 Enumeration is inefficient: repeated computation e.g., computes P(j a)p(m a) for each value of e 8
Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation Pr(B j, m) = α Pr(B) }{{} P(e) e }{{} B E = α Pr(B) e P(e) = α Pr(B) e P(e) = α Pr(B) e P(e) a Pr(a B, e) P(j a) P(m a) }{{}}{{}}{{} A J M a Pr(a B, e)p(j a)f M(a) a Pr(a B, e)f J(a)f M (a) a f A(a, b, e)f J (a)f M (a) = α Pr(B) P(e)fĀJM(b, e) (sum out A) e = α Pr(B)f ĒĀJM (b) (sum out E) = αf B (b) f (b) ĒĀJM 9
Variable elimination: Basic operations Summing out a variable from a product of factors: 1. move any constant factors outside the summation: x f 1 f k = f 1 f i f 1 f i f X assuming f 1,..., f i do not depend on X x f i+1 f k = 2. add up submatrices in pointwise product of remaining factors: Eg: pointwise product of f 1 and f 2 : f 1 (x 1,..., x j, y 1,..., y k ) f 2 (y 1,..., y k, z 1,..., z l ) = f (x 1,..., x j, y 1,..., y k, z 1,..., z l ) E.g., f 1 (a, b) f 2 (b, c) = f (a, b, c) 10
Irrelevant variables Consider the query P(JohnCalls Burglary = true) P(J b) = αp(b) e P(e) a P(a b, e)p(j a) m Sum over m is identically 1; M is irrelevant to the query B P(m a) J A E M Theorem Y is irrelevant unless Y Ancestors({X } E) Here, X = JohnCalls, E = {Burglary}, and Ancestors({X } E) = {Alarm, Earthquake} so MaryCalls is irrelevant 12
Irrelevant variables contd. Defn: moral graph of DAG Bayes net: marry all parents and drop arrows Defn: A is m-separated from B by C iff separated by C in the moral graph Theorem Y is irrelevant if m-separated from X by E For P(JohnCalls Alarm = true), both Burglary and Earthquake are irrelevant B E A J M 13
Complexity of exact inference Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space cost (with variable elimination) are O(d k n) hence time and space cost are linear in n and k bounded by a constant Multiply connected networks: can reduce 3SAT to exact inference = NP-hard equivalent to counting 3SAT models = #P-complete Proof of this in one of the exercises for Thursday. 14
Inference by stochastic simulation Basic idea: Draw N samples from a sampling distribution S Compute an approximate posterior probability ˆP Show this converges to the true probability P 0.5 Coin Outline: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior 15
Sampling from an empty network function Prior-Sample(bn) returns an event sampled from bn inputs: bn, a belief network specifying joint distribution Pr(X 1,..., X n ) x an event with n elements for i = 1 to n do x i a random sample from Pr(X i parents(x i )) given the values of Parents(X i ) in x return x Ancestor sampling 16
Example P(C).50 Cloudy C T F P(S C).10.50 Sprinkler Rain C T F P(R C).80.20 Wet Grass S T T F F R P(W S,R) T F T F.99.90.90.01 17
Sampling from an empty network contd. Probability that PriorSample generates a particular event i.e., the true prior probability S PS (x 1... x n ) = P(x 1... x n ) E.g., S PS (t, f, t, t) = 0.5 0.9 0.8 0.9 = 0.324 = P(t, f, t, t) Proof: Let N PS (x 1... x n ) be the number of samples generated for event x 1,..., x n. Then we have lim ˆP(x 1,..., x n ) N = lim PS(x 1,..., x n )/N N = S PS (x 1,..., x n ) n = P(x i parents(x i )) = P(x 1... x n ) i = 1 That is, estimates derived from PriorSample are consistent Shorthand: ˆP(x 1,..., x n ) P(x 1... x n ) 18
Rejection sampling ˆPr(X e) estimated from samples agreeing with e function Rejection-Sampling(X, e, bn, N) returns an estimate of P(X e) local variables: N, a vector of counts over X, initially zero for j = 1 to N do x Prior-Sample(bn) if x is consistent with e then N[x] N[x]+1 where x is the value of X in x return Normalize(N[X]) E.g., estimate Pr(Rain Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. ˆPr(Rain Sprinkler = true) = Normalize( 8, 19 ) = 0.296, 0.704 Similar to a basic real-world empirical estimation procedure 19
Analysis of rejection sampling Rejection sampling returns consistent posterior estimates Proof: ˆPr(X e) = αn PS (X, e) (algorithm defn.) = N PS (X, e)/n PS (e) (normalized by N PS (e)) Pr(X, e)/p(e) (property of PriorSample) = Pr(X e) (defn. of conditional probability) Problem: hopelessly expensive if P(e) is small P(e) drops off exponentially with number of evidence variables! 20
Likelihood weighting Idea: fix evidence variables, sample only nonevidence variables, and weight each sample by the likelihood it accords the evidence function Likelihood-Weighting(X, e, bn, N) returns an estimate of P(X e) local variables: W, a vector of weighted counts over X, initially zero for j = 1 to N do x, w Weighted-Sample(bn) W[x] W[x] + w where x is the value of X in x return Normalize(W[X ]) function Weighted-Sample(bn, e) returns an event and a weight x an event with n elements; w 1 for i = 1 to n do if X i has a value x i in e then w w P(X i = x i parents(x i )) else x i a random sample from Pr(X i parents(x i )) return x, w 21
Likelihood weighting example P(Rain Sprinkler = true, WetGrass = true) P(C).50 Cloudy C T F P(S C).10.50 Sprinkler Rain C T F P(R C).80.20 Wet Grass S T T F F R P(W S,R) T F T F.99.90.90.01 22
Likelihood weighting analysis Likelihood weighting returns consistent estimates Sampling probability for WeightedSample is S WS (z, e) = l P(z i parents(z i )) i = 1 (pays attention to evidence in ancestors only) somewhere in between prior and posterior distribution Weight for a given sample z, e is w(z, e) = m P(e i parents(e i )) i = 1 Weighted sampling probability is Sprinkler Cloudy Wet Grass Rain but performance still degrades with many evidence variables because a few samples have nearly all the total weight S WS (z, e)w(z, e) = l P(zi parents(z i )) m P(ei parents(e i )) = P(z, e) 23
Summary Approximate inference by LW: LW does poorly when there is lots of (late-in-the-order) evidence LW generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 24
Approximate inference using MCMC State of network = current assignment to all variables. Generate next state by sampling one variable given Markov blanket Sample each variable in turn, keeping evidence fixed function MCMC-Ask(X, e, bn, N) returns an estimate of P(X e) local variables: N[X ], a vector of counts over X, initially zero Z, nonevidence variables in bn, hidden + query x, current state of the network, initially copied from e initialize x with random values for the variables in Z for j = 1 to N do N[x] N[x] + 1 where x is the value of X in x for each Z i in Z do sample the value of Z i in x from Pr(Z i mb(z i )) given the values of MB(Z i ) in x return Normalize(N[X ]) Can also choose a variable to sample at random each time 25
The Markov chain With Sprinkler = true, WetGrass = true, there are four states: Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Grass Wet Grass Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Grass Wet Grass Wander about for a while, average what you see Probabilistic finite state machine 26
MCMC example contd. Estimate Pr(Rain Sprinkler = true, WetGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = false ˆPr(Rain Sprinkler = true, WetGrass = true) = Normalize( 31, 69 ) = 0.31, 0.69 Theorem The Markov Chain approaches a stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability 27
Markov blanket sampling Markov blanket of Cloudy is Sprinkler and Rain Markov blanket of Rain is Cloudy, Sprinkler, and WetGrass Sprinkler Cloudy Wet Grass Rain Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large: P(X i mb(x i )) won t change much (law of large numbers) 28
Local semantics and Markov Blanket Local semantics: each node is conditionally independent of its nondescendants given its parents Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents U 1... U m U 1... U m X Z 1j X Z nj Z 1j Z nj Y 1... Y n Y 1... Y n 29
MCMC analysis: Outline Transition probability q(x x ) Occupancy probability π t (x) at time t Equilibrium condition on π t defines stationary distribution π(x) Note: stationary distribution depends on choice of q(x x ) Pairwise detailed balance on states guarantees equilibrium Gibbs sampling transition probability: sample each variable given current values of all others = detailed balance with the true posterior For Bayesian networks, Gibbs sampling reduces to sampling conditioned on each variable s Markov blanket 30
Stationary distribution π t (x) = probability in state x at time t π t+1 (x ) = probability in state x at time t + 1 π t+1 in terms of π t and q(x x ) π t+1 (x ) = x π t(x)q(x x ) Stationary distribution: π t = π t+1 = π π(x ) = x π(x)q(x x ) for all x If π exists, it is unique (specific to q(x x )) In equilibrium, expected outflow = expected inflow 31
Detailed balance Outflow = inflow for each pair of states: π(x)q(x x ) = π(x )q(x x) for all x, x Detailed balance = stationarity: x π(x)q(x x ) = = π(x ) = π(x ) x π(x )q(x x) x q(x x) MCMC algorithms typically constructed by designing a transition probability q that is in detailed balance with desired π 32
Gibbs sampling Sample each variable in turn, given all other variables Sampling X i, let X i be all other nonevidence variables Current values are x i and x i ; e is fixed Transition probability is given by q(x x ) = q(x i, x i x i, x i ) = P(x i x i, e) This gives detailed balance with true posterior P(x e): π(x)q(x x ) = P(x e)p(x i x i, e) = P(x i, x i e)p(x i x i, e) = P(x i x i, e)p( x i e)p(x i x i, e) (chain rule) = P(x i x i, e)p(x i, x i e) (chain rule backwards) = q(x x)π(x ) = π(x )q(x x) 33
Summary Exact inference by variable elimination: polytime on polytrees, NP-hard on general graphs space = time, very sensitive to topology Approximate inference by LW, MCMC: PriorSampling and RejectionSampling unusable as evidence grow LW does poorly when there is lots of (late-in-the-order) evidence LW, MCMC generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 35