Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25
Motivation Because of the (worst-case) intractability of exact inference in Bayesian networks, try to find more efficient approximate inference techniques: Instead of computing exact posterior compute approximation P(A E = e) ˆP(A E = e) with ˆP(A E = e) P(A E = e) Approximative inference September 2008 2 / 25
Absolute/Relative Error For p, ˆp [0, 1]: ˆp is approximation for p with absolute error ǫ, if p ˆp ǫ, i.e. ˆp [p ǫ, p + ǫ]. Approximative inference September 2008 3 / 25
Absolute/Relative Error For p, ˆp [0, 1]: ˆp is approximation for p with absolute error ǫ, if p ˆp ǫ, i.e. ˆp [p ǫ, p + ǫ]. ˆp is approximation for p with relative error ǫ, if 1 ˆp/p ǫ, i.e. ˆp [p(1 ǫ), p(1 + ǫ)]. Approximative inference September 2008 3 / 25
Absolute/Relative Error For p, ˆp [0, 1]: ˆp is approximation for p with absolute error ǫ, if ˆp is approximation for p with relative error ǫ, if p ˆp ǫ, i.e. ˆp [p ǫ, p + ǫ]. 1 ˆp/p ǫ, i.e. ˆp [p(1 ǫ), p(1 + ǫ)]. This definition is not always fully satisfactory, because it is not symmetric in p and ˆp and not invariant under the transition p (1 p), ˆp (1 ˆp). Use with care! When ˆp 1, ˆp 2 are approximations for p 1, p 2 with absolute error ǫ, then no error bounds follow for ˆp 1 /ˆp 2 as an approximation for p 1 /p 2. When ˆp 1, ˆp 2 are approximations for p 1, p 2 with relative error ǫ, then ˆp 1 /ˆp 2 approximates p 1 /p 2 with relative error (2ǫ)/(1 + ǫ). Approximative inference September 2008 3 / 25
Randomized Methods Most methods for approximate inference are randomized algorithms that compute approximations ˆP from random samples of instantiations. We shall consider: Forward sampling Likelihood weighting Gibbs sampling Metropolis Hastings algorithm Approximative inference September 2008 4 / 25
Forward Sampling Observation: can use Bayesian network as random generator that produces full instantiations V = v according to distribution P(V). Example: A B A t f.2.8 B A t f t.7.3 f.4.6 - Generate random numbers r 1, r 2 uniformly from [0,1]. - Set A = t if r 1.2 and A = f else. - Depending on the value of A and r 2 set B to t or f. Generation of one random instantiation: linear in size of network. Approximative inference September 2008 5 / 25
Sampling Algorithm Thus, we have a randomized algorithm S that produces possible outputs from sp(v) according to the distribution P(V). Define ˆP(A = a E = e) := {i 1,..., N E = e, A = a in S i } {i 1,..., N E = e in S i } Approximative inference September 2008 6 / 25
Forward Sampling: Illustration Sample with not E = e E = e, A a E = e, A = a Approximation for P(A = a E = e): # # Approximative inference September 2008 7 / 25
Sampling from the conditional distribution Problem of forward sampling: samples with E e are useless! Idea: find sampling algorithm S c that produces outputs from sp(v) according to the distribution P(V E = e). Approximative inference September 2008 8 / 25
Sampling from the conditional distribution Problem of forward sampling: samples with E e are useless! Idea: find sampling algorithm S c that produces outputs from sp(v) according to the distribution P(V E = e). A tempting approach: Fix the variables in E to e and sample from the nonevidence variables only! Approximative inference September 2008 8 / 25
Sampling from the conditional distribution Problem of forward sampling: samples with E e are useless! Idea: find sampling algorithm S c that produces outputs from sp(v) according to the distribution P(V E = e). A tempting approach: Fix the variables in E to e and sample from the nonevidence variables only! Problem: Only evidence from the ancestors are taken into account! Approximative inference September 2008 8 / 25
Likelihood weighting We would like to sample from (pa(x) are the parents in E) P(U, e) = Y P(X pa(x), pa(x) = e) Y P(X = e pa(x), pa(x) = e), X U\E X E but by applying forward sampling with fixed E we actually sample from: Sampling distribution = Y P(X pa(x), pa(x) = e). X U\E Approximative inference September 2008 9 / 25
Likelihood weighting We would like to sample from (pa(x) are the parents in E) P(U, e) = Y P(X pa(x), pa(x) = e) Y P(X = e pa(x), pa(x) = e), X U\E X E but by applying forward sampling with fixed E we actually sample from: Sampling distribution = Y P(X pa(x), pa(x) = e). X U\E Solution: Instead of letting each sample count as 1, use w(x, e) = Y P(X = e pa(x), pa(x) = e). X E Approximative inference September 2008 9 / 25
Likelihood weighting: example A B A t f.2.8 B A t f t.7.3 f.4.6 - Assume evidence B = t. - Generate a random number r uniformly from [0,1]. - Set A = t if r.2 and A = f else. - If A = t then let the sample count as w(t, t) = 0.7; otherwise w(f, t) = 0.4. Approximative inference September 2008 10 / 25
Likelihood weighting: example A B A t f.2.8 B A t f t.7.3 f.4.6 - Assume evidence B = t. - Generate a random number r uniformly from [0,1]. - Set A = t if r.2 and A = f else. - If A = t then let the sample count as w(t, t) = 0.7; otherwise w(f, t) = 0.4. With N samples (a 1,..., a N ) we get ˆP(A = t B = t) = P N i=1 w(a i = t, e) P N i=1 (w(a i = t, e) + w(a i = f, e)). Approximative inference September 2008 10 / 25
Gibbs Sampling For notational convenience assume from now on that for some l: E = V l+1, V l+2,..., V n. Write W for V 1,..., V l. Principle: obtain new sample from previous sample by randomly changing the value of only one selected variable. Procedure Gibbs sampling v 0 = (v 0,1,..., v 0,l ) := arbitrary instantiation of W i := 1 repeat forever choose V k W # deterministic or randomized generate randomly v i,k according to distribution P(V k V 1 = v i 1,1,..., V k 1 = v i 1,k 1, V k+1 = v i 1,k+1,..., V l = v i 1,l, E = e) set v i = (v i 1,1,... v i 1,k 1, v i,k, v i 1,k+1,..., v i 1,l ) i := i + 1 Approximative inference September 2008 11 / 25
Illustration The process of Gibbs sampling can be understood as a random walk in the space of all instantiations with E = e: Reachable in one step: instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variable V k ). Approximative inference September 2008 12 / 25
The sampling step Implementation of Sampling Step generate randomly v i,k according to distribution P(V k V 1 = v i 1,1,..., V k 1 = v i 1,k 1, V k+1 = v i 1,k+1,..., V l = v i 1,l, E = e) requires sampling from a conditional distribution. In this special case (all but one variables are instantiated) this is easy: just need to compute for each v sp(v k ) the probability P(V 1 = v i 1,1,..., V k 1 = v i 1,k 1, V k = v, V k+1 = v i 1,k+1,..., V l = v i 1,l, E = e) (linear in network size), and choose v i,k according to these probabilities (normalized). This can be further simplified by computing the distribution on sp(v k ) only in the Markov blanket of V k, i.e. the subnetwork consisting of V k, its parents, its children, and the parents of its children. Approximative inference September 2008 13 / 25
Convergence of Gibbs Sampling Under certain conditions: the distribution of samples converges to the posterior distribution P(W E = e): lim P(v i = v) = P(W = v E = e) (v sp(w)). i Sufficient conditions are: in the repeat loop of the Gibbs sampler, variable V k is randomly selected (with non-zero selection probability for all V k W), and the Bayesian network has no zero-entries in its cpt s Approximative inference September 2008 14 / 25
Problems: Approximate Inference using Gibbs Sampling 1. Start Gibbs sampling with some starting configuration v 0. 2. Run the sampler for N steps ( Burn in ) 3. Run the sampler for M additional steps; use the relative frequency of state v in these M samples as an estimate for P(W = v E = e). How large must N be chosen? Difficult to say how long it takes for Gibbs sampler to converge! Even when sampling is from the stationary distribution, samples are not independent. Result: error cannot be bounded as function of M using Chebyshev s inequality (or related methods). Approximative inference September 2008 15 / 25
Effect of dependence P(v N = v) close to P(W = v E = e): probability that v N is in the red region is close to P(A = a E = e). This does not guarantee that the fraction of samples in v N, v N+1,..., v N+M that are in the red region yields a good approximation to P(A = a E = e)! v 0 v N v N v N+M v N+M Approximative inference September 2008 16 / 25
Multiple starting points In practice, one tries to counteract these difficulties by restarting the Gibbs sampling several times (often with different starting points): v 0 vn v 0 v N+M v N v N+M v N v N+M v 0 Approximative inference September 2008 17 / 25
Metropolis Hastings Algorithm Another way of constructing a random walk on sp(w): Let {q(v, v ) v, v sp(w)} be a set of transition probabilities over sp(w), i.e. q(v, ) is a probability distribution for each v sp(w). The q(v, v ) are called proposal probabilities. Define j α(v, v ) := min 1, P(W = v E = e)q(v ff, v) P(W = v E = e)q(v, v ) j := min 1, P(W = v, E = e)q(v ff, v) P(W = v, E = e)q(v, v ) α(v, v ) is called the acceptance probability for the transition from v to v. Approximative inference September 2008 18 / 25
Procedure Metropolis Hastings sampling v 0 = (v 0,1,..., v 0,l ) := arbitrary instantiation of W i := 1 repeat forever sample v according to distribution q(v i 1, ) set accept to true with probability α(v, v ) if accept v i := v else v i := v i 1 i := i + 1 Approximative inference September 2008 19 / 25
Convergence of Metropolis Hastings Sampling Under certain conditions: the distribution of samples converges to the posterior distribution P(W E = e). A sufficient condition is: q(v, v ) > 0 for all v, v. To obtain a good performance, q should be chosen so as to obtain high acceptance probabilities, i.e. quotients P(W = v E = e)q(v, v) P(W = v E = e)q(v, v ) should be close to 1. Optimal (but usually not feasible): q(v, v ) = P(W = v E = e). Generally: try to approximate with q the target distribution P(W E = e). Approximative inference September 2008 20 / 25
Loopy belief propagation Loopy belief propagation Message passing algorithm like junction tree propagation Works directly on the Bayesian network structure (rather than the junction tree) Approximative inference September 2008 21 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B P(C A, B) C D E Approximative inference September 2008 22 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A P(C A, B) C D E Approximative inference September 2008 22 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C D E Approximative inference September 2008 22 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C φ D D E Approximative inference September 2008 22 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C φ D π E(C) D E Approximative inference September 2008 22 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B φ A φ B P(C A, B) C φ D φ E π E(C) D E X π E (C) = φ D P(C A, B)φ A φ B A,B Approximative inference September 2008 22 / 25
Loopy belief propagation Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator. A B λ A(C) φ A φ B P(C A, B) C φ D φ E π E(C) D E π E (C) = φ D X A,B P(C A, B)φ A φ B λ C (A) = X B,C P(C A, B)φ B φ D φ E Approximative inference September 2008 22 / 25
Loopy belief propagation Example: error sources A few observations: When calculating P(E) we treat C and D as being independent (in the junction tree C and D would appear in the same separator). Evidence on a converging connection ma cause the error to cycle. Approximative inference September 2008 23 / 25
Loopy belief propagation In general There is no guarantee of convergence, nor that in case of convergence, it will converge to the correct distribution. However, the method converges to the correct distribution surprisingly often! If the network is singly connected, convergence is guaranteed. Approximative inference September 2008 24 / 25
Literature R.M. Neal: Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical report CRG-TR-93-11993, Department of Computer Science, University of Toronto. http://omega.albany.edu:8008/neal.pdf P. Dagum, M. Luby: Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence 60, 1993. Approximative inference September 2008 25 / 25