Lecture 7 and 8: Markov Chain Monte Carlo

Similar documents
16 : Markov Chain Monte Carlo (MCMC)

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Introduction to Machine Learning CMU-10701

CPSC 540: Machine Learning

19 : Slice Sampling and HMC

17 : Markov Chain Monte Carlo

Bayesian Inference and MCMC

CSC 2541: Bayesian Methods for Machine Learning

Monte Carlo Methods. Leon Gu CSD, CMU

16 : Approximate Inference: Markov Chain Monte Carlo

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov chain Monte Carlo

Approximate Inference using MCMC

Computational statistics

an introduction to bayesian inference

Lecture 6: Graphical Models: Learning

Bayesian Methods for Machine Learning

Markov Chain Monte Carlo (MCMC)

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Variational Scoring of Graphical Model Structures

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Markov Chains and MCMC

Quantifying Uncertainty

Answers and expectations

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo

Lecture : Probabilistic Machine Learning

Markov Chain Monte Carlo methods

Density Estimation. Seungjin Choi

Metropolis-Hastings Algorithm

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC algorithms for fitting Bayesian models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Markov chain Monte Carlo

Markov Chain Monte Carlo

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Markov chain Monte Carlo

The Metropolis-Hastings Algorithm. June 8, 2012

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Reminder of some Markov Chain properties:

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Advanced MCMC methods

Expectation Propagation for Approximate Bayesian Inference

Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning CMU-10701

Markov Chain Monte Carlo methods

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Markov chain Monte Carlo Lecture 9

Bayesian Machine Learning

MCMC notes by Mark Holder

An introduction to Sequential Monte Carlo

Principles of Bayesian Inference

Part 1: Expectation Propagation

Sampling Algorithms for Probabilistic Graphical models

Machine Learning for Data Science (CS4786) Lecture 24

CS281A/Stat241A Lecture 22

Approximate inference in Energy-Based Models

Session 3A: Markov chain Monte Carlo (MCMC)

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

LECTURE 15 Markov chain Monte Carlo

Machine Learning using Bayesian Approaches

Machine Learning Summer School

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Unsupervised Learning

ST 740: Markov Chain Monte Carlo

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Estimation with Sparse Grids

Notes on pseudo-marginal methods, variational Bayes and ABC

Brief introduction to Markov Chain Monte Carlo

STA 294: Stochastic Processes & Bayesian Nonparametrics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

The Origin of Deep Learning. Lili Mou Jan, 2015

Sampling Methods (11/30/04)

Bayesian Machine Learning

Results: MCMC Dancers, q=10, n=500

Variational Inference. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Bayesian Phylogenetics:

GAUSSIAN PROCESS REGRESSION

DAG models and Markov Chain Monte Carlo methods a short overview

Gaussian Processes for Machine Learning

Learning the hyper-parameters. Luca Martino

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Probabilistic Machine Learning

The Recycling Gibbs Sampler for Efficient Learning

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Monte Carlo Inference Methods

The Expectation Maximization or EM algorithm

Transcription:

Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 1 / 28

Objective Approximate expectations of a function φ(x) wrt probability p(x): E p(x) [φ(x)] = φ = φ(x)p(x)dx, where x R D, when these are not analytically tractable, and typically D 1. 2 1 0 1 Assume that we can evaluate φ(x) and p(x), or at least p (x), an un-normalized version of p(x), for any value of x. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 2 / 28

Motivation Such integrals, or marginalizations, are essential for probabilistic inference and learning: Making predictions, e.g. in supervised learning p(y x, D) = p(y x, θ)p(θ D)dθ, where the posterior distribution playes the rôle of p(x) Approximate marginal likelihoods p(d M i ) = p(d θ, M i )p(θ M i )dθ, where the prior distribution playes the rôle of p(x). Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 3 / 28

Numerical integration on a grid Approximate the integral by a sum of products φ(x)p(x)dx 1 T φ(x (τ) )p(x (τ) ), T where the x (τ) lie on an equidistant grid (or fancier versions of this). τ=1 2 1 0 1 Problem: the number of grid points required, k D, grows exponentially with the dimension D. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 4 / 28

Monte Carlo The foundation identity for Monte Carlo approximations is E[φ(x)] ˆφ = 1 T φ(x (τ) ). where x (τ) p(x). T τ=1 Under mild conditions, ˆφ E[φ(x)] as T. For moderate T, ˆφ may still be a good approximation. In fact it is an unbiased estimate with V[ ˆφ] = V[φ] (φ(x), where V[φ] = φ ) 2 p(x)dx. T Note, that this variance in independent of the dimension D of x. This is great, but how do we generate random samples from p(x)? Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 5 / 28

Random number generation There are (pseudo-) random number generators for the uniform [0, 1] distribution. Transformation methods exist to turn these into other simple univariate distributions, such as Gaussian, Gamma,..., as well as some multivariate distributions e.g. Gaussian and Dirichlet. Example: p(x) is uniform [0, 1]. The transformation y = log(x) yields: i.e. an exponential distribution. p(y) = p(x) dx = exp( y), dy Typically p(x) may be more complicated. How can we generate samples from a (multivariate) distribution p(x)? Idea: we can discretize p(x), and sample from the discrete multinomial distribution. Will this work? Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 6 / 28

Rejection sampling Find a tractable distribution q(x) and c 1, such that x, cq(x) p(x). 2 1 p(x) φ(x) c q(x) 0 1 Rejection sampling algorithm: Generate samples independently from q(x) Accept samples with probability p(x)/(cq(x)), otherwise reject Form a Monte Carlo estimate from the accepted samples. This estimate will be exactly unbiased. How effective will it be? Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 7 / 28

Importance sampling Find a tractable q(x) p(x), such that q(x) > 0 whenever p(x) > 0. 2 1 p(x) φ(x) q(x) 0 1 Form the Importance sampling estimate: φ(x) p(x) q(x) q(x)dx ˆφ = 1 T φ(x (τ) ) p(x(τ) ), where x (τ) q(x), T q(x (τ) ) τ=1 }{{} w(x (τ) ) where w(x (τ) ) are called the importance weights. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 8 / 28

There is also a version of importance sampling which doesn t require knowing how q(x) normalizes: ˆφ = τ φ(x(τ) )w(x (τ) ), where w(x (τ) ) = p(x(τ) ) q (x (τ) ), and x(τ) q(x). τ w(x(τ) ) How fast is importance sampling? This depends on the variance of the importance weights. Example: p(x) = N(0, 1) and q(x) = N(0, σ 2 ). The variance of the weights is: p(x) V[w] = E[w 2 ] E 2 2 ( p(x) ) 2 [w] = q(x) 2 q(x)dx q(x) q(x)dx = σ exp( x 2 + x2 )dx 1. 2σ2 which only exists when σ > 1/2. Moral: always chose q(x) to be wider, have heavier tails than p(x) to avoid this problem. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 9 / 28

Rejection and Importance sampling In the multivariate case with Gaussian q(x) we must be able to guarantee that no direction exist when we underestimate the width by more than a factor of 2. In fact, both rejection and importance sampling have severe problems in high dimensions, since it is virtually impossible to get q(x) close enough to p(x), see MacKay chapter 29 for details. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 10 / 28

Independent Sampling vs. Markov Chains So far, we ve considered two methods, Rejection and Importance Sampling, which were both based on independent samples from q(x). However, for many problems of practical interest it is difficult or impossible to find q(x) with the necessary properties. Instead we can abandon the idea of independent sampling, and instead rely on a Markov Chain to generate dependent samples from the target distribution. Independence would be a nice thing, but it is not necessary for the Monte Carlo estimate to be valid. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 11 / 28

Markov Chain Monte Carlo We want to construct a Markov Chain that explores p(x). Markov Chain: x (t) q(x (t) x (t 1) ). MCMC gives approximate, correlated samples from p(x). Challenge: how do we find transition probabilities q(x (t) x (t 1) ), which give rise to the correct stationary distribution p(x)? Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 12 / 28

Discrete Markov Chains Consider p = 3/5 1/5 1/5, Q = 2/3 1/2 1/2 1/6 0 1/2 1/6 1/2 0, Q ij = Q(x i x j ) where Q is a stochastic (or transition) matrix. 1 To machine precision: Q 100 0 = p. 0 p is called a stationary distribution of Q, since Qp = p. Ergodicity is also a requirement. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 13 / 28

In Continuous Spaces In continuous spaces transitions are governed by q(x x). Now, p(x) is a stationary distribution for q(x x) if q(x x)p(x)dx = p(x ). Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 14 / 28

Detailed Balance Detailed balance means q(x x)p(x) = q(x x )p(x ). Now, integrating both sides wrt x, we get q(x x)p(x)dx = q(x x )p(x )dx = p(x ). Thus, detailed balance implies the existence of a stationary distribution Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 15 / 28

The Metropolis-Hastings algorithm The Metropolis-Hastings algorithm: propose a new state x from q(x x (τ) ) compute the acceptance probability a a = p(x ) q(x (τ) x ) p(x (τ) ) q(x x (τ) ) if a > 1 then the proposed state is accepted, otherwise the proposed state is accepted with probability a. If the proposed state is accepted, then x (τ+1) = x otherwise x (τ+1) = x (τ). This Markov chain has p(x) as a stationary distribution. This holds trivially if x (τ+1) = x (τ), otherwise ( p(x)q(x x) = p(x)q(x x) min 1, p(x )q(x x ) ) p(x)q(x x) = min ( p(x)q(x x), p(x )q(x x ) ) = p(x )q(x x ) min ( 1, p(x)q(x x) ) p(x )q(x x ) = p(x )Q(x x ). Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 16 / 28

Some properties of Metropolis Hastings The Metropolis algorithm has p(x) as its stationary distribution If q(x x (τ) ) is symmetric, then the expression for a simplifies to a = p(x )/p(x (τ) ) the algorithm then always accepts if the proposed state has higher probability than the current state and sometimes accepts a state with lower probability. we only need the ratio of p(x) s, so we don t need the normalization constant. This is important, e.g. when sampling from a posterior distribution. The Metropolis algorithm can be widely applied, you just need to specify a proposal distribution. The proposal distribution must satisfy some (mild) constraints (related to ergodicity). Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 17 / 28

The Proposal Distribution Often, Gaussian proposal distributions are used, centered on the current state. You need to specify the width of the proposal distribution. What happens if the proposal distribution is too wide? too narrow? Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 18 / 28

Metropolis Hastings Example 20 iterations of the Metropolis Hastings algorithm for a bivariate Gaussian The proposal distribution was Gaussian centered on the current state. Rejected states are indicated by dotted lines. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 19 / 28

Random walks When exploring a distribution with the Metropolis algorithm, the proposal has to be narrow to avoid too many rejections. If a typical step moves a distance ε, then one needs an expected T = (L/ε) 2 steps to travel a distance of L. This is a problem with random walk behavior. Notice that strong correlations can slow down the Metropolis algorithm, because the proposal width should be set according to the tightest direction. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 20 / 28

Variants of the Metropolis algorithm Instead of proposing a new state by changing simultaneously all components of the state, you can concatenate different proposals changing one component at a time. For example, q j (x x), such that x i = x i, i j. Now, cycle through the q j (x x) in turn. This is valid as each iteration obeys detailed balance the sequence guarantees ergodicity Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 21 / 28

Gibbs sampling Updating one coordinate at a time, and choosing the proposal distribution to be the conditional distribution of that variable given all other variables q(x i x) = p(x i x i) we get a = min (1, p(x i)p(x i x i)p(x i x i ) ) p(x i )p(x i x i )p(x i x i) = 1, i.e., we get an algorithm which always accepts. This is called the Gibbs sampling algorithm. If you can compute (and sample from) the conditionals, you can apply Gibbs sampling. This algorithm is completely parameter free. Can also be applied to subsets of variables. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 22 / 28

Example: Gibbs Sampling 20 iterations of Gibbs sampling on a bivariate Gaussian Notice that strong correlations can slow down Gibbs sampling. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 23 / 28

Hybrid Monte Carlo Define a joint distribution over position and velocity p(x, v) = p(v)p(x) = e E(x) K(v) = e H(x,v). v independent of x and v N(0, 1). sample in the joint space, then ignore the velocities to get the positions. Markov chain: Gibbs sample velocities using N(0, 1) Simulate Hamiltonian dynamics through fictitious time, then negate velocity proposal is deterministic and reversible conservation of energy guarantees p(x, v) = p(x, v ) Metropolis acceptance probability is 1. But we can not usually simulate Hamiltonian dynamics exactly. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 24 / 28

Leap-frog dynamics a discrete approximation of Hamiltonian dynamics: v i (t + ε 2 ) = v i(t) ε E(x(t)) 2 x i x i (t + ε) = x i (t) + εv i (t + ε 2 ) v i (t + ε) = v i (t + ε 2 ) ε E(x(t + ε)) 2 x i using L steps of Leap-frog dynamics H is no-longer exactly conserved dynamics still deterministic and reversible acceptance probability min(1, exp(h(v, x) H(v, x ))) Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 25 / 28

Hybrid Monte Carlo avoids random walks The advantage of introducing velocity variables is that random walks are avoided. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 26 / 28

Summary Rejection and Importance sampling are based on independent samples from q(x). These methods are not suitable for high dimensional problems. The Metropolis method, does not give independent samples, but can be used successfully in high dimensions. Gibbs sampling is used extensively in practice. parameter-free requires simple conditional distributions Hybrid Monte Carlo is used extensively in continuous systems avoids random walks requires setting of ε and L. Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 27 / 28

Monte Carlo in Practice Although very general, and can be applied in systems with 1000 s of variables. Care must be taken when using MCMC high dependency between variables may cause slow exploration how long does burn-in take? how correlated are the samples generated? slow exploration may be hard to detect but you might not be getting the right answer diagnosing a convergence problem may be difficult curing a convergence problem may be even more difficult Ghahramani & Rasmussen (CUED) Lecture 7 and 8: Markov Chain Monte Carlo 28 / 28