Markov Chain Monte Carlo - PDF Free Download

1 Motivation 1.1 Bayesian Learning Markov Chain Monte Carlo Yale Chang In Bayesian learning, given data X, we make assumptions on the generative process of X by introducing hidden variables Z: p(z): prior distribution of Z; p(x Z): likelihood function; The objective of inference is to compute the posterior distribution using the Bayes formula: p(z X) = p(x Z)p(Z) p(x) After computing P (Z X), we can make prediction for any new sample x : p(x X) = p(z X)p(x Z)dZ (2) 1.2 Inference However, the challenge is the marginal likelihood p(x) = p(x, Z)dZ might be intractable. This could be due to the exponential number of configurations when Z contains high-dimensional discrete variables or the difficulty of numerical integration when Z contains high-dimensional real-valued variables. Approximate inference, including variational inference (VI) and expectation propagation (EP), uses a variational distribution q(z; λ) to approximate the posterior p(z X). In this way, the inference problem is transformed to an optimization problem. The main drawback of approximate inference is the inflexibility of q(z; λ) will cause non-negligible approximation error. An alternative strategy is to directly draw a set of samples from p(z X) and compute the expectation in the predictive distribution as follows: p(x X) 1 S where {z (1), z (2),, z (S) } are S samples drawn from the Markov chain. 2 Standard Distributions (1) p(x z (s) ) (3) Our objective is to sample from p(z). We observe that all random variables we simulate on our computers are ultimately transformations of a uniform random variable. In this section, we assume we can generate pseudo-random numbers distributed uniformly over [0, 1]. We denote this random variable as U U(0, 1). Assume the mapping from U to Z is Z = g(u), where g is an monotonically increasing function, then we have F Z (z) = P rob[z z] = P rob[g(u) z] = P rob[u g 1 (z)] = g 1 (z) 1

Therefore we have g 1 = F Z, which leads to g = F 1 Z. To sample from a standard distribution p(z), we first compute the inverse of its CDF F 1 Z. Then we apply this transformation on samples from the uniform distribution. 2.1 Example: exponential distribution p(z) = λ exp( λz) (z 0) F Z (z) = z p(t)dt = 1 exp( λz) Z = F 1 Z (U) = 1 log(1 U) λ Another example is the Cauchy distribution. This approach can also be generalized to multiple variables by observing that p(z 1,, z M ) = p(u 1,, u M ) (u 1,, u M ) (4) (z 1,, z M ) We need compute the CDF and its inverse, which is only feasible for a small set of distributions. For example, Gamma distribution can not be sampled in this way. 3 Rejection Sampling Objective: sample from p(z) = 1 Z p p(z). Assume we can sample from a proposal distribution q(z). Introduce a constant k such that kq(z) p(z) holds for all values of z. 1. Sample z 0 q(z) 2. Sample u 0 U(0, kq(z 0 )) 3. Reject the sample z 0 if u 0 p(z 0 ) and accept otherwise The remaing pairs then have uniform distribution under the curve of p(z) and hence the corresponding z values are distributed according to p(z). The proof is as follows: consider the CDF of the accepted points: P rob(z z 0 z accepted) = P rob(z z 0, z accepted) P rob(z accepted) z0 = p(z)dz p(z)dz = CDF (z) Use Gaussian distribution to illustrate the problem of rejection sampling in high-dimensional space. The exponential decrease of acceptance rate with dimenionality is a generic feature of rejection sampling. Although rejection can be a useful technique in one or two dimensions it is unsuited to problems of higher dimensionality. 2

4 Importance Sampling Objective: estimate E p(z) [f(z)]. Assume p(z) = p(z) Z p and proposal distribution q(z) = q(z) Z q. E p(z) [f(z)] = p(z)f(z)dz = q(z) p(z) q(z) f(z)dz = Z q p(z) Z p q(z) f(z)dz Z q 1 r s f(z (s) ) Z p S where r s = p(z(s) ) q(z (s) ) and Z p = 1 p(z)dz Z q Z q p(z) = q(z) q(z)dz 1 S r s Therefore, we have where w s = rs S. t=1 rt E p(z) [f(z)] w s f(z (s) ) As with rejection sampling, the success of the importance sampling approach depends crucially on how well the sampling distribution q(z) matches the desired distribution p(z). 5 Markov Chain Monte Carlo Both rejection sampling and importance sampling do not work well for high-dimensional variables. Markov chain Monte Carlo (MCMC) constructs a Markov chain whose stationary distribution is the posterior p(z X). A set of samples from the Markov chain follow the distribution p(z X) and can therefore be used to compute the expectation in the predictive distribution. 5.1 Metropolis Hastings Algorithm The sampling algorithm achieving the above objective is called Metropolis-Hastings (MH). Below is the MH procedures of sampling from p(z): Initialize z (0) ; for s = 0, 1, 2, 1. Set z = z (s) ; 2. Sample from the proposal distribution z q(z z); 3. Compute acceptance probability α(z z) = p(z )q(z z ) p(z)q(z z) and set r(z z) = min(1, α(z z)); 4. Sample from the uniform distribution u U(0, 1); 5. Set new sample z (s+1) to be equal z if u < r and z (s) otherwise. The key question is how to prove the distribution of z (s) will converge to p(z X). To anwer this question, we need have a brief recap on Markov chain and the properties related to its stationary distribution. 3

5.2 Markov Chain Given a sequence of random variables Z (1),, Z (S) of arbitrary length S, their joint distribution can be written as S p(z (1),, Z (S) ) = p(z (1) ) p(z (s) Z (1:s 1) ) (5) In the first-order Markov chain, we assume p(z (s) Z (1:s 1) ) = p(z (s) Z (s 1) ), then the joint distribution can be written as S p(z (1),, Z (S) ) = p(z (1) ) p(z (s) Z (s 1) ) (6) We assume Z (s) is sufficient to predict the future. If we assume p(z (s) Z (s 1) ) does not depend on s, then the chain is called homogeneous, stationary or time-invariant. 5.3 Transition Matrix When Z (s) is discrete, i.e., Z (s) {1,, K}, the conditional distribution p(z (s) Z (s 1) ) can be written as a K K matrix A, where s=2 s=2 A ij = p(z (s) = j Z (s 1) = i) (7) represents the probability of going from state i to state j in one step. We call A the transition matrix. Each row of the matrix sum to one: A ij = 1 (8) j=1 We can also define the n-step transition matrix A(n) as A ij (n) = p(z (s+n) = j Z (s) = i) (9) which is the probability of going from state i to state j in exactly n steps. According to definition, A(1) = A; Chapman-Kolmogorov equation: A ij (m + n) = K k=1 A ik(m)a kj (n) indicates that A(m + n) = A(m)A(n) Therefore A(n) = A n. 5.4 Stationary Distribution We have been focusing on Markov models as a way of defining joint probability distribution over a sequence of random variables. It can also be viewed as a stochastic dynamic system. In this case, we are interested in the long term distribution over states, which is called stationary distribution. Let π s (j) = p(z (s) = j) be the probability of being in state j at step s, then we have π s (j) = π s 1 (i)a ij (10) If we treat π s = (π s (1),, π s (K)) R 1 K as a row vector, this formula can be rewritten as π is called the stationary distribution of the Markov Chain if it satisfies Once we enter the stationary distribution, we will never leave. π s = π s 1 A (11) π = πa (12) 4

5.5 When does a unique stationary distribution exist? For a Markov chain (with transition matrix A), a sufficient condition to guarantee that π is its stationary distribution is π(i)a ij = π(j)a ji ; ( i, j {1,, K}) This property is called the detailed balance. The following is the proof (πa)(j) = = π(i)a ij π(j)a ji = π(j) = π(j) Since the equality holds for any j, we have πa = π, which is the definition of stationary distribution. To guarantee that π is unique, we need have additional assumption on the Markov chain: We can get from any state to any other state in n steps (n is some integer); 5.6 Prove MH Generates Samples from p(z) We need prove p(z) is the stationary distribution of the Markov chain defined in the Metropolis Hastings algorithm. Or equivalently, we need confirm whether p(z) satisfies the detailed balance equation: A ji p(z)p(z z) = p(z )p(z z ) (13) where p(z z) is the transition probability defined by MH: { p(z q(z z)r(z z) z) = q(z z) + z z q(z z)(1 r(z z)) if z z otherwise (14) This follows from a case analysis: If you move from z to a different z, you must first propose it (with probability q(z z)) and then accept it (with probability r(z z)). If you stay at z, you must either propose it (with probability q(z z)) or you propose z z (with probability q(z z)) and it was rejected (with probability 1 r(z z)). Next we prove the detailed balance equation hods for p(z) and p(z z). Proof: If p(z )q(z z ) < p(z)q(z z), then α(z z) < 1 r(z z) = α(z z) and α(z z ) > 1 r(z z ) = 1. To move from z to a different z z, we must first propose z and accept it: p(z z) = q(z z)r(z z) = q(z z) p(z )q(z z ) p(z)q(z z) = p(z ) p(z) q(z z ) 5

To move from z from z, we must first propose z and accept it: Combine these two equations, we have p(z z ) = q(z z )r(z z ) = q(z z ) The same conclusion holds if p(z )q(z z ) > p(z)q(z z). 6 Gibbs Sampling Consider p(z) = p(z 1,, z M ), the Gibbs sampling works as follows Initialize {z i : i = 1,, M} For s = 1,, S : 1. Sample z (s+1) 1 p(z 1 z (s) 2,, z(s) M ) 2. Sample z (s+1) 2 p(z 2 z (s+1) 1, z (s) 3,, z(s) M ) 3.. 4. Sample z (s+1) j p(z j z (s+1) 1,, z (s+1) j 1 p(z)p(z z) = p(z )p(z z ) (15), z(s) j+1,, z(s) 5.. 6. Sample z (s+1) M p(z M z (s+1) 1, z (s+1) 2,, z (s+1) M 1 ) Gibbs sampling can be viewed as a special case of MH. Consider a MH sampling step involving the variable z k while fixing z k. The proposal distribution is q(z z) = p(z k z k) and we also observe z k = z k (because all the remaining variables do not change). Therefore, the acceptance probability can be written as 6.1 Compare MH vs. Gibbs α(z z) = p(z )q(z z ) p(z)q(z z) M ) = p(z k z k )p(z k )p(z k z k ) p(z k z k )p(z k )p(z k z k) = 1 Gibbs has a few limitations 1. It might be difficult to derive the conditional distributions for each of the random variable in the model. 2. Even if we have the posterior conditional for each variable, it might be that they are not of a known form, and therefore there is not a straightforward way to draw samples from them. 3. The mixing of Gibbs sampling chain might be very slow. 6