Markov Chain Monte Carlo

Similar documents
Introduction to Machine Learning CMU-10701

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Basic Sampling Methods

Bayesian Inference and MCMC

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Approximate Inference using MCMC

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

MCMC and Gibbs Sampling. Sargur Srihari

MCMC: Markov Chain Monte Carlo

Sampling Algorithms for Probabilistic Graphical models

Markov Networks.

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Likelihood-free MCMC

Lecture 7 and 8: Markov Chain Monte Carlo

16 : Markov Chain Monte Carlo (MCMC)

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Convex Optimization CMU-10725

Introduction to Machine Learning CMU-10701

CPSC 540: Machine Learning

17 : Markov Chain Monte Carlo

Monte Carlo in Bayesian Statistics

Notes on pseudo-marginal methods, variational Bayes and ABC

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

LECTURE 15 Markov chain Monte Carlo

CS 343: Artificial Intelligence

Quantifying Uncertainty

Machine Learning for Data Science (CS4786) Lecture 24

Monte Carlo Methods. Leon Gu CSD, CMU

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Sampling Methods (11/30/04)

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Previously Monte Carlo Integration

Bayesian networks: approximate inference

19 : Slice Sampling and HMC

MARKOV CHAIN MONTE CARLO

Markov chain Monte Carlo

Answers and expectations

Markov Chain Monte Carlo, Numerical Integration

Brief introduction to Markov Chain Monte Carlo

Nonparametric Bayesian Methods (Gaussian Processes)

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Quantitative Biology II Lecture 4: Variational Methods

Bayes Nets: Sampling

CS 188: Artificial Intelligence. Bayes Nets

Markov Chain Monte Carlo methods

Markov Chains and MCMC

Markov chain Monte Carlo Lecture 9


Markov Chain Monte Carlo Methods

Lecture 6: Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Stochastic optimization Markov Chain Monte Carlo

Computational statistics

13: Variational inference II

Bayesian Estimation of Input Output Tables for Russia

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo (MCMC)

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Machine Learning. Probabilistic KNN.

Particle-Based Approximate Inference on Graphical Model

Advances and Applications in Perfect Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Results: MCMC Dancers, q=10, n=500

Introduction to Machine Learning

16 : Approximate Inference: Markov Chain Monte Carlo

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo

On Markov chain Monte Carlo methods for tall data

Markov Chain Monte Carlo

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

CS281A/Stat241A Lecture 22

Bayesian model selection in graphs by using BDgraph package

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Metropolis-Hastings Algorithm

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Markov Processes. Stochastic process. Markov process

6 Markov Chain Monte Carlo (MCMC)

Bayesian Methods for Machine Learning

Machine Learning using Bayesian Approaches

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Application of new Monte Carlo method for inversion of prestack seismic data. Yang Xue Advisor: Dr. Mrinal K. Sen

Part 1: Expectation Propagation

Inference in Bayesian Networks

STA 294: Stochastic Processes & Bayesian Nonparametrics

Simulation - Lectures - Part III Markov chain Monte Carlo

Transcription:

1 Motivation 1.1 Bayesian Learning Markov Chain Monte Carlo Yale Chang In Bayesian learning, given data X, we make assumptions on the generative process of X by introducing hidden variables Z: p(z): prior distribution of Z; p(x Z): likelihood function; The objective of inference is to compute the posterior distribution using the Bayes formula: p(z X) = p(x Z)p(Z) p(x) After computing P (Z X), we can make prediction for any new sample x : p(x X) = p(z X)p(x Z)dZ (2) 1.2 Inference However, the challenge is the marginal likelihood p(x) = p(x, Z)dZ might be intractable. This could be due to the exponential number of configurations when Z contains high-dimensional discrete variables or the difficulty of numerical integration when Z contains high-dimensional real-valued variables. Approximate inference, including variational inference (VI) and expectation propagation (EP), uses a variational distribution q(z; λ) to approximate the posterior p(z X). In this way, the inference problem is transformed to an optimization problem. The main drawback of approximate inference is the inflexibility of q(z; λ) will cause non-negligible approximation error. An alternative strategy is to directly draw a set of samples from p(z X) and compute the expectation in the predictive distribution as follows: p(x X) 1 S where {z (1), z (2),, z (S) } are S samples drawn from the Markov chain. 2 Standard Distributions (1) p(x z (s) ) (3) Our objective is to sample from p(z). We observe that all random variables we simulate on our computers are ultimately transformations of a uniform random variable. In this section, we assume we can generate pseudo-random numbers distributed uniformly over [0, 1]. We denote this random variable as U U(0, 1). Assume the mapping from U to Z is Z = g(u), where g is an monotonically increasing function, then we have F Z (z) = P rob[z z] = P rob[g(u) z] = P rob[u g 1 (z)] = g 1 (z) 1

Therefore we have g 1 = F Z, which leads to g = F 1 Z. To sample from a standard distribution p(z), we first compute the inverse of its CDF F 1 Z. Then we apply this transformation on samples from the uniform distribution. 2.1 Example: exponential distribution p(z) = λ exp( λz) (z 0) F Z (z) = z p(t)dt = 1 exp( λz) Z = F 1 Z (U) = 1 log(1 U) λ Another example is the Cauchy distribution. This approach can also be generalized to multiple variables by observing that p(z 1,, z M ) = p(u 1,, u M ) (u 1,, u M ) (4) (z 1,, z M ) We need compute the CDF and its inverse, which is only feasible for a small set of distributions. For example, Gamma distribution can not be sampled in this way. 3 Rejection Sampling Objective: sample from p(z) = 1 Z p p(z). Assume we can sample from a proposal distribution q(z). Introduce a constant k such that kq(z) p(z) holds for all values of z. 1. Sample z 0 q(z) 2. Sample u 0 U(0, kq(z 0 )) 3. Reject the sample z 0 if u 0 p(z 0 ) and accept otherwise The remaing pairs then have uniform distribution under the curve of p(z) and hence the corresponding z values are distributed according to p(z). The proof is as follows: consider the CDF of the accepted points: P rob(z z 0 z accepted) = P rob(z z 0, z accepted) P rob(z accepted) z0 = p(z)dz p(z)dz = CDF (z) Use Gaussian distribution to illustrate the problem of rejection sampling in high-dimensional space. The exponential decrease of acceptance rate with dimenionality is a generic feature of rejection sampling. Although rejection can be a useful technique in one or two dimensions it is unsuited to problems of higher dimensionality. 2

4 Importance Sampling Objective: estimate E p(z) [f(z)]. Assume p(z) = p(z) Z p and proposal distribution q(z) = q(z) Z q. E p(z) [f(z)] = p(z)f(z)dz = q(z) p(z) q(z) f(z)dz = Z q p(z) Z p q(z) f(z)dz Z q 1 r s f(z (s) ) Z p S where r s = p(z(s) ) q(z (s) ) and Z p = 1 p(z)dz Z q Z q p(z) = q(z) q(z)dz 1 S r s Therefore, we have where w s = rs S. t=1 rt E p(z) [f(z)] w s f(z (s) ) As with rejection sampling, the success of the importance sampling approach depends crucially on how well the sampling distribution q(z) matches the desired distribution p(z). 5 Markov Chain Monte Carlo Both rejection sampling and importance sampling do not work well for high-dimensional variables. Markov chain Monte Carlo (MCMC) constructs a Markov chain whose stationary distribution is the posterior p(z X). A set of samples from the Markov chain follow the distribution p(z X) and can therefore be used to compute the expectation in the predictive distribution. 5.1 Metropolis Hastings Algorithm The sampling algorithm achieving the above objective is called Metropolis-Hastings (MH). Below is the MH procedures of sampling from p(z): Initialize z (0) ; for s = 0, 1, 2, 1. Set z = z (s) ; 2. Sample from the proposal distribution z q(z z); 3. Compute acceptance probability α(z z) = p(z )q(z z ) p(z)q(z z) and set r(z z) = min(1, α(z z)); 4. Sample from the uniform distribution u U(0, 1); 5. Set new sample z (s+1) to be equal z if u < r and z (s) otherwise. The key question is how to prove the distribution of z (s) will converge to p(z X). To anwer this question, we need have a brief recap on Markov chain and the properties related to its stationary distribution. 3

5.2 Markov Chain Given a sequence of random variables Z (1),, Z (S) of arbitrary length S, their joint distribution can be written as S p(z (1),, Z (S) ) = p(z (1) ) p(z (s) Z (1:s 1) ) (5) In the first-order Markov chain, we assume p(z (s) Z (1:s 1) ) = p(z (s) Z (s 1) ), then the joint distribution can be written as S p(z (1),, Z (S) ) = p(z (1) ) p(z (s) Z (s 1) ) (6) We assume Z (s) is sufficient to predict the future. If we assume p(z (s) Z (s 1) ) does not depend on s, then the chain is called homogeneous, stationary or time-invariant. 5.3 Transition Matrix When Z (s) is discrete, i.e., Z (s) {1,, K}, the conditional distribution p(z (s) Z (s 1) ) can be written as a K K matrix A, where s=2 s=2 A ij = p(z (s) = j Z (s 1) = i) (7) represents the probability of going from state i to state j in one step. We call A the transition matrix. Each row of the matrix sum to one: A ij = 1 (8) j=1 We can also define the n-step transition matrix A(n) as A ij (n) = p(z (s+n) = j Z (s) = i) (9) which is the probability of going from state i to state j in exactly n steps. According to definition, A(1) = A; Chapman-Kolmogorov equation: A ij (m + n) = K k=1 A ik(m)a kj (n) indicates that A(m + n) = A(m)A(n) Therefore A(n) = A n. 5.4 Stationary Distribution We have been focusing on Markov models as a way of defining joint probability distribution over a sequence of random variables. It can also be viewed as a stochastic dynamic system. In this case, we are interested in the long term distribution over states, which is called stationary distribution. Let π s (j) = p(z (s) = j) be the probability of being in state j at step s, then we have π s (j) = π s 1 (i)a ij (10) If we treat π s = (π s (1),, π s (K)) R 1 K as a row vector, this formula can be rewritten as π is called the stationary distribution of the Markov Chain if it satisfies Once we enter the stationary distribution, we will never leave. π s = π s 1 A (11) π = πa (12) 4

5.5 When does a unique stationary distribution exist? For a Markov chain (with transition matrix A), a sufficient condition to guarantee that π is its stationary distribution is π(i)a ij = π(j)a ji ; ( i, j {1,, K}) This property is called the detailed balance. The following is the proof (πa)(j) = = π(i)a ij π(j)a ji = π(j) = π(j) Since the equality holds for any j, we have πa = π, which is the definition of stationary distribution. To guarantee that π is unique, we need have additional assumption on the Markov chain: We can get from any state to any other state in n steps (n is some integer); 5.6 Prove MH Generates Samples from p(z) We need prove p(z) is the stationary distribution of the Markov chain defined in the Metropolis Hastings algorithm. Or equivalently, we need confirm whether p(z) satisfies the detailed balance equation: A ji p(z)p(z z) = p(z )p(z z ) (13) where p(z z) is the transition probability defined by MH: { p(z q(z z)r(z z) z) = q(z z) + z z q(z z)(1 r(z z)) if z z otherwise (14) This follows from a case analysis: If you move from z to a different z, you must first propose it (with probability q(z z)) and then accept it (with probability r(z z)). If you stay at z, you must either propose it (with probability q(z z)) or you propose z z (with probability q(z z)) and it was rejected (with probability 1 r(z z)). Next we prove the detailed balance equation hods for p(z) and p(z z). Proof: If p(z )q(z z ) < p(z)q(z z), then α(z z) < 1 r(z z) = α(z z) and α(z z ) > 1 r(z z ) = 1. To move from z to a different z z, we must first propose z and accept it: p(z z) = q(z z)r(z z) = q(z z) p(z )q(z z ) p(z)q(z z) = p(z ) p(z) q(z z ) 5

To move from z from z, we must first propose z and accept it: Combine these two equations, we have p(z z ) = q(z z )r(z z ) = q(z z ) The same conclusion holds if p(z )q(z z ) > p(z)q(z z). 6 Gibbs Sampling Consider p(z) = p(z 1,, z M ), the Gibbs sampling works as follows Initialize {z i : i = 1,, M} For s = 1,, S : 1. Sample z (s+1) 1 p(z 1 z (s) 2,, z(s) M ) 2. Sample z (s+1) 2 p(z 2 z (s+1) 1, z (s) 3,, z(s) M ) 3.. 4. Sample z (s+1) j p(z j z (s+1) 1,, z (s+1) j 1 p(z)p(z z) = p(z )p(z z ) (15), z(s) j+1,, z(s) 5.. 6. Sample z (s+1) M p(z M z (s+1) 1, z (s+1) 2,, z (s+1) M 1 ) Gibbs sampling can be viewed as a special case of MH. Consider a MH sampling step involving the variable z k while fixing z k. The proposal distribution is q(z z) = p(z k z k) and we also observe z k = z k (because all the remaining variables do not change). Therefore, the acceptance probability can be written as 6.1 Compare MH vs. Gibbs α(z z) = p(z )q(z z ) p(z)q(z z) M ) = p(z k z k )p(z k )p(z k z k ) p(z k z k )p(z k )p(z k z k) = 1 Gibbs has a few limitations 1. It might be difficult to derive the conditional distributions for each of the random variable in the model. 2. Even if we have the posterior conditional for each variable, it might be that they are not of a known form, and therefore there is not a straightforward way to draw samples from them. 3. The mixing of Gibbs sampling chain might be very slow. 6