Sampling Algorithms for Probabilistic Graphical models

Size: px

Start display at page:

Download "Sampling Algorithms for Probabilistic Graphical models"

Ariel Maxwell
6 years ago
Views:

1 Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir Friedman. Chapters 1 and 2 of Vibhav Gogate s thesis. 1 / 42

2 Bayesian Networks Common Inference Tasks CPTS: P(X i pa(x i )) Joint Distribution: P(X ) = N i=1 P(X i pa(x i )) P(D, I, G, S, L) = P(D)P(I )P(G D, I )P(S I )P(L G) Probability of Evidence: P(L = l 0, S = s 1 ) =? Posterior Marginal Estimation: P(D = d 1 L = l 0, S = s 1 ) =? 2 / 42

3 Markov networks Common Inference Tasks Edge Potentials: F i,j Node Potentials: F i Joint Distribution: P(x) = 1 F i,j (x) F i (x) Γ i,j E i V Compute the partition function: Γ =?. Posterior Marginal Estimation: P(D = d 1 I = i 1 ) =?. 3 / 42

4 Inference tasks: Definitions Probability of Evidence (or the partition function) P(E = e) = X \E n P(X i pa(x i )) E=e i=1 Z = x X φ i (x) Posterior marginals (belief updating) P(X i = x i E = e) = P(X i = x i, E = e) P(E = e) n X \E X = i i=1 P(X i pa(x i )) E=e,Xi =x i n X \E i=1 P(X i pa(x i )) E=e i 4 / 42

5 Why Approximate Inference? Both problems are #P-complete. Computationally intractable. No hope! A tractable class: When the treewidth of the graphical model is small (< 25). Most real world problems have high treewidth. In many applications, approximations are sufficient. P(Xi = x i E = e) = Approximate inference yields ˆP(X i = x i E = e) = 0.3 Buy the stock Xi if P(X i = x i E = e) < / 42

6 What we will cover today Sampling fundamentals Monte Carlo techniques Rejection Sampling Likelihood Weighting Importance sampling Markov Chain Monte Carlo techniques Metropolis-Hastings Gibbs sampling Advanced Schemes Advanced Importance sampling schemes Rao-Blackwellisation 6 / 42

7 What is a sample and how to generate one? Given a set of variables X = {X 1,..., X n }, a sample is an instantiation or an assignment to all variables. x t = (x t 1,..., x t n) Algorithm to draw a sample from a univariate distribution P(X i ). Domain of X i = {xi 0,..., x k 1 i } 1. Divide a real line [0, 1] into k intervals such that the width of the j-th interval is proportional to P(X i = x j i ) 2. Draw a random number r [0, 1] 3. Determine the region j in which r lies. Output x j i Example 1. Domain of X i = {x 0 i, x 1 i, x 2 i, x 3 i } 2. P(X i ) = (0.3, 0.25, 0.27, 0.18) 3. Random numbers: (a) r= X i =?, (b) r= X i =? / 42

8 Sampling from a Bayesian network (Logic Sampling) Sample variables one by one in a topological order (parents of a node before the node) Sample Difficulty from P(D). r = D =? Sample Intelligence from P(I ). r = I =?. 8 / 42

9 Sampling from a Bayesian network (Logic Sampling) Sample variables one by one in a topological order (parents of a node before the node) Sample Difficulty from P(D). r = D = d 1 Sample Intelligence from P(I ). r = I = i 0. Sample Grade from P(G i 0, d 1 ). r = 0.281, G =?. Sample SAT from P(S i 0 ). r = 0.992, S =?. 9 / 42

10 Sampling from a Bayesian network (Logic Sampling) Sample variables one by one in a topological order (parents of a node before the node) Sample = (d 1, i 0, g 1, s 1, l 0 ) Sample Difficulty from P(D). r = D = d 1 Sample Intelligence from P(I ). r = I = i 0. Sample Grade from P(G i 0, d 1 ). r = 0.281, G = g 1. Sample SAT from P(S i 0 ). r = 0.992, S = s 1. Sample Letter from P(L g 1 ). r = 0.034, L = l / 42

11 Main idea in Monte Carlo Estimation Express the given task as an expected value of a random variable. E P [g(x)] = g(x)p(x) x Generate samples from the distribution P with respect to which the expectation was taken. Estimate the expected value from the samples using: ĝ = 1 T T g(x t ) i=1 where x 1,..., x T are independent samples from P. 11 / 42

12 Properties of the Monte Carlo Estimate Convergence: By law of large numbers ĝ = 1 T T g(x t ) E P [g(x)] for T i=1 Unbiased: Variance: [ 1 V P [ĝ] = V P T E P [ĝ] = E P [g(x)] ] T g(x t ) t=1 = V P[g(x)] T Thus, variance of the estimator can be reduced by increasing the number of samples. We have no control over the numerator when P is given. 12 / 42

13 What we will cover today Sampling fundamentals Monte Carlo techniques Rejection Sampling Likelihood Weighting Importance sampling Markov Chain Monte Carlo techniques Metropolis-Hastings Gibbs sampling Advanced Schemes Advanced Importance sampling schemes Rao-Blackwellisation 13 / 42

14 Rejection Sampling Express P(E = e) as an expectation problem. P(E = e) = x I (x, E = e)p(x) = E P [I (x, E = e)] where I (x, E = e) is 1 if x contains E = e and 0 otherwise. Generate samples from the Bayesian network. Monte Carlo estimate: ˆP(E = e) = Number of samples that have E = e Total number of samples Issues: If P(E = e) is very small (e.g., ), all samples will be rejected. 14 / 42

15 Rejection Sampling: Example Let the evidence be e = (i 0, g 1, s 1, l 0 ) Probability of evidence = On an average, you will need approximately 1/ samples to get a non-zero estimate for P(E = e). Imagine how many samples will be needed if P(E = e) is small! 15 / 42

16 Importance Sampling Use a proposal distribution Q(Z = X \ E) satisfying P(Z = z, E = e) > 0 Q(Z = z) > 0. Express P(E = e) as follows: P(E = e) = z P(Z = z, E = e) = z Q(Z = z) P(Z = z, E = e) Q(Z = z) = E Q [ P(Z = z, E = e) Q(Z = z) ] = E Q [w(z)] Generate samples from Q and estimate using P(E = e) using the following Monte Carlo estimate: ˆP(E = e) = 1 T P(Z = z t, E = e) T T Q(Z = z t = w(z t ) ) t=1 where (z 1,..., z T ) are sampled from Q. t=1 16 / 42

17 Importance Sampling: Example Let l 1, s 0 be the evidence Imagine a uniform Q defined over (D, I, G) and the following samples are generated. ˆP(E = e) = Average of {P(sample, evidence)/q(sample)} sample = (d 0, i 0, g 0 ), P(sample, evidence) = , Q(sample) = sample = (d 1, i 0, g 0 ), P(sample, evidence) = , Q(sample) = and so on 17 / 42

18 Likelihood weighting A special kind of Importance sampling in which Q equals the network obtained by clamping evidence. Evidence = (g 0, s 0 ) 18 / 42

Likelihood weighting A special kind of Importance sampling in which Q equals the network obtained by clamping evidence. Evidence = (g 0, s 0 ) P(sample,evidence)/Q(sample) can be efficiently computed.

19 Likelihood weighting A special kind of Importance sampling in which Q equals the network obtained by clamping evidence. Evidence = (g 0, s 0 ) P(sample,evidence)/Q(sample) can be efficiently computed. The ratio equals the product of the corresponding CPT values at the evidence nodes. The remaining values cancel out. Let the sample = (d 0, i 0, l 1 ). P(sample, evidence) Q(sample) = / 42

20 Normalized Importance sampling (Un-normalized) IS is not suitable for estimating P(X i = x i E = e). One option: Estimate the numerator and denominator by IS. ˆP(X i = x i E = e) = ˆP(X i = x i, E = e) ˆP(E = e) This ratio estimate is often very bad because the numerator and denominator errors may be cumulative and may have a different source. For example, if the numerator is an under-estimate and the denominator is an over-estimate. How to fix this? Use: Normalized importance sampling. 20 / 42

21 Normalized Importance sampling: Theory Given a dirac delta function δ xi (z) (which is 1 if z contains X i = x i and 0 otherwise), we can write P(X i = x i E = e) as: z P(X i = x i E = e) = δ x i (z)p(z = z, E = e) P(Z = z, E = e) z Now we can use the same Q and samples from it to estimate both the numerator and the denominator. ˆP(X i = x i E = e) = T t=1 δ x i (z t )w(z t ) T t=1 w(zt ) This reduces variance because of common random numbers. (Read about it on Wikipedia. Not covered in standard machine learning texts.) 21 / 42

22 Normalized Importance sampling: Properties Asymptotically Unbiased: lim E Q[ˆP(X i = x i E = e)] = P(X i = x i E = e) T Variance: Harder to analyze Liu (2003) suggests a performance measure called effective sample size Definition: ESS(Ideal, Q) = V Q [w(z)] It means that T samples from Q are worth only T /(1 + V Q [w(z)]) samples from the ideal proposal distribution. 22 / 42

23 Importance sampling: Issues For optimum performance, the proposal distribution Q should be as close as possible to P(X E = e). When Q = P(X E = e), the weight of every sample is P(E = e)! However, achieving this is NP-hard. Likelihood weighting performs poorly when P(E = e) is at the leaves and is unlikely. In particular, designing a good proposal distribution is an art rather than a science! 23 / 42

24 What we will cover today Sampling fundamentals Monte Carlo techniques Rejection Sampling Likelihood Weighting Importance sampling Markov Chain Monte Carlo techniques Metropolis-Hastings Gibbs sampling Advanced Schemes Advanced Importance sampling schemes Rao-Blackwellisation 24 / 42

25 Markov Chains A Markov chain is composed of: A set of states Val(X) = {x 1, x 2,..., x r } A process that moves from a state x to another state x with probability T (x x ). T defines a square matrix in which each row and column sums to Chain Dynamics P (t+1) (X (t+1) = x ) = x Val(X) P (t) (X (t) = x)t (x x ) Markov chain Monte Carlo sampling is a process that mirrors the dynamics of a Markov chain. 25 / 42

26 Markov Chains: Stationary Distribution We are interested in the long-term behavior of a Markov chain, which is defined by the stationary distribution. A distribution π(x) is a stationary distribution if it satisfies: π(x = x ) = x Val(X) π(x = x)t (x x ) A Markov chain may or may not converge to a stationary distribution Constraints: π(x 1 ) = 0.25π(x 1 ) + 0.5π(x 3 ) x 1 x x 3 π(x 2 ) = 0.7π(x 2 ) + 0.5π(x 3 ) π(x 3 ) = 0.75π(x 1 ) + 0.3π(x 2 ) π(x 1 ) + π(x 2 ) + π(x 3 ) = 1 Unique Solution: π(x 1 ) = 0.2, π(x 2 ) = 0.5, π(x 3 ) = / 42

27 Sufficient Conditions for ensuring convergence Regular Markov chain: A Markov chain is said to be regular if there exists some number k such that for every x, x Val(X), the probability of getting from x to x om exactly k steps is greater than zero. Theorem If a finite Markov chain is regular, then it has a unique stationary distribution. Sufficient conditions for ensuring Regularity: Construct the state graph such that there is a positive probability to get from any state to any state. For each state x, there is a positive probability self-loop. 27 / 42

28 MCMC for computing P(X i = x i E = e) Main idea: Construct a Markov chain such that its stationary distribution equals P(X E = e). Generate samples using the Markov chain Estimate P(X i = x i E = e) using the standard Monte Carlo estimate: ˆP(X i = x i E = e) = 1 T δ xi (z t ) T t=1 28 / 42

29 Gibbs sampling Start at a random assignment to all non-evidence variables. Select a variable X i and compute the distribution P(X i E = e, x i ) where x i is the current sampled assignment to X \ E X i. Sample a value for X i from P(X i E = e, x i ). Repeat. Question: Can we compute P(X i E = e, x i ) efficiently? 29 / 42

30 Gibbs sampling Start at a random assignment to all non-evidence variables. Select a variable X i and compute the distribution P(X i E = e, x i ) where x i is the current sampled assignment to X \ E X i. Sample a value for X i from P(X i E = e, x i ). Repeat. Computing P(X i E = e, x i ) Exact inference is possible because only one variable is not assigned a value! The stationary distribution of the Markov chain equals P(X E = e) (easy to prove). 30 / 42

31 Gibbs sampling: Properties When the Bayesian network has no zeros, Gibbs sampling is guaranteed to converge to P(X E = e) When the Bayesian network has zeros or the Evidence is complex (e.g., a SAT formula), Gibbs sampling may not converge. Open problem! Mixing time: Let t ɛ be the minimum t such that for any starting distribution P (0), the distance between P(X E = e) and P (t) is less than ɛ. It is common to ignore some number of samples at the beginning, the so-called burn-in period, and then consider only every nth sample. 31 / 42

32 Metropolis-Hastings: Theory Detailed Balance: Given a transition function T (x x ) and an acceptance probability A(x x ), a Markov chain satisfies the detailed balance condition if there exists a distribution π such that: π(x)t (x x )A(x x ) = π(x )T (x x)a(x x) Theorem If a Markov chain is regular and satisfies the detailed balance condition relative to π, then it has a unique stationary distribution π. 32 / 42

33 Metropolis-Hastings: Algorithm Input: Current state x t Output: Next state x t+1 Draw x from T (x t x ) Draw a random number r [0, 1] and update { x t+1 x if r A(x = t x ) x t otherwise In Metropolis-Hastings A is defined as follows: { A(x x ) = min 1, π(x )T (x x } ) π(x)t (x x) Theorem The Metropolis Hastings algorithm satisfies the detailed balance condition. 33 / 42

34 Metropolis-Hastings: What T to use? Use an importance distribution Q to make transitions. This is called independent sampling because T does not depend on what state you are currently in. Use a random walk approach. 34 / 42

35 What we will cover today Sampling fundamentals Monte Carlo techniques Rejection Sampling Likelihood Weighting Importance sampling Markov Chain Monte Carlo techniques Metropolis-Hastings Gibbs sampling Advanced Schemes Advanced Importance sampling schemes Rao-Blackwellisation 35 / 42

36 Selecting a Proposal Distribution For good performance Q should be as close as possible to P(X E = e). Use a method that yields a good approximation of P(X E = e) to construct Q Variational Inference Generalized Belief Propagation Update the proposal distribution from the samples (the machine learning approach) Combinations! 36 / 42

37 Using Approximations of P(X E = e) to construct Q Algorithm JT-sampling (Perfect sampling) Let o = (X 1,..., X n ) be an ordering of variables q = 1 For i = 1 to n do Propagate evidence in the junction tree Construct a distribution Q i (X i ) by referring to any cluster mentioning X i and marginalizing out all other variables. Sample Xi = x i from Q i, q = q Q i (X i = x i ) Set Xi = x i as evidence in the junction tree. Return (x, q) 37 / 42

38 Using Approximations of P(X E = e) to construct Q Algorithm IJGP-sampling (Gogate&Dechter, UAI, 2005) Let o = (X 1,..., X n ) be an ordering of variables q = 1 For i = 1 to n do Propagate evidence in the join graph. Construct a distribution Q i (X i ) by referring to any cluster mentioning X i and marginalizing out all other variables. Sample Xi = x i from Q i, q = q Q i (X i = x i ) Set Xi = x i as evidence in the join graph. Return (x, q) 38 / 42

39 Adaptive Importance sampling Machine learning view of sampling: Learn from experience! Learn a proposal distribution Q from the samples. At regular intervals, update the proposal distribution Q t at the current interval t using: where α is the learning rate. Q t+1 = Q t + α(q t Q ) As the number of samples increases, the proposal will get closer and closer to P(X E = e). 39 / 42

40 Rao-Blackwellisation of sampling schemes Combine exact inference with sampling. Sample a few variables and analytically marginalize out other variables. Inference on trees (or low treewidth) graphical models is always tractable. Sample variables until the graphical model is a tree. Rao-Blackwell theorem: Let the non-evidence variables Z be partitioned into two sets Z 1 and Z 2, where Z 1 are sampled and Z 2 are inferred exactly. Then, V Q [ P(z1, z 2, e) Q(z 1, z 2 ) ] V Q [ P(z1, e) Q(z 1 ) ] 40 / 42

41 Rao-Blackwellisation of sampling schemes: Example Traditional importance sampling ˆΓ = 1 T T t=1 F (a t,..., i t ) Q(a t,..., i t ) Proposal distribution and samples defined over all variables. Rao-Blackwellised importance sampling Let Z = Vars \ {B, E} ˆΓ = 1 T T t=1 z F (z, bt, e t ) Q(b t, e t ) z F (z, bt, e t ) is computed efficiently using Belief Propagation or Variable Elimination. 41 / 42

42 Summary Importance sampling Generate samples from a proposal distribution Performance depends on how close the proposal is to the posterior distribution Markov chain Monte Carlo (MCMC) sampling Attempts to generate samples from the posterior distribution by creating a Markov chain whose stationary distribution equals the posterior distribution Metropolis-Hastings and Gibbs sampling Advances schemes How to construct and learn a good proposal distribution. How to use graph decompositions to improve the quality of estimation. 42 / 42

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each