Markov Chain Monte Carlo (MCMC)

Similar documents
Markov chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

19 : Slice Sampling and HMC

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

17 : Markov Chain Monte Carlo

16 : Markov Chain Monte Carlo (MCMC)

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Monte Carlo Methods. Leon Gu CSD, CMU

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Lecture 7 and 8: Markov Chain Monte Carlo

Markov Networks.

Bayesian Inference and MCMC

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Introduction to Machine Learning CMU-10701

Markov chain Monte Carlo

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Quantifying Uncertainty

Advanced MCMC methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Monte Carlo Inference Methods

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

17 : Optimization and Monte Carlo Methods

Approximate Inference using MCMC

CPSC 540: Machine Learning

Bayesian Methods for Machine Learning

Graphical Models and Kernel Methods

Results: MCMC Dancers, q=10, n=500

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

MCMC algorithms for fitting Bayesian models

Monte Carlo Methods for Inference and Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Gradient-based Monte Carlo sampling methods

14 : Approximate Inference Monte Carlo Methods

Markov chain Monte Carlo Lecture 9

Answers and expectations

Probabilistic Graphical Models

1 Geometry of high dimensional probability distributions

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Machine Learning for Data Science (CS4786) Lecture 24

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Markov Chain Monte Carlo

MCMC Methods: Gibbs and Metropolis

Kernel adaptive Sequential Monte Carlo

CS 188: Artificial Intelligence. Bayes Nets

Robert Collins CSE586, PSU Intro to Sampling Methods

MCMC and Gibbs Sampling. Sargur Srihari

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

The Origin of Deep Learning. Lili Mou Jan, 2015

Nonparametric Bayesian Methods (Gaussian Processes)

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Hamiltonian Monte Carlo Method

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

16 : Approximate Inference: Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo

Monte Carlo Methods. Geoff Gordon February 9, 2006

Robert Collins CSE586, PSU Intro to Sampling Methods

CSC 2541: Bayesian Methods for Machine Learning

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Approximate inference in Energy-Based Models

Part IV: Monte Carlo and nonparametric Bayes

Sampling Methods (11/30/04)

CS 343: Artificial Intelligence

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Hamiltonian Monte Carlo for Scalable Deep Learning

Robert Collins CSE586, PSU Intro to Sampling Methods

Reminder of some Markov Chain properties:

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

Intelligent Systems I


an introduction to bayesian inference

Sampling Algorithms for Probabilistic Graphical models

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Paul Karapanagiotidis ECO4060

Markov Chains and MCMC

Monte Carlo methods for sampling-based Stochastic Optimization

Monte Carlo in Bayesian Statistics

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Lecture 6: Markov Chain Monte Carlo

Bayesian Networks (Part II)

Use of probability gradients in hybrid MCMC and a new convergence test

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Basic Sampling Methods

Probabilistic Machine Learning

Introduction to Machine Learning CMU-10701

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Doing Bayesian Integrals

Hamiltonian Monte Carlo with Fewer Momentum Reversals

Markov Chain Monte Carlo, Numerical Integration

STAT 425: Introduction to Bayesian Analysis

Inverse Problems in the Bayesian Framework

Markov Chain Monte Carlo

Learning Energy-Based Models of High-Dimensional Data

Transcription:

School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1

Homework 2 Housekeeping Due March 16, 12:00 noon (extended) Midway Project Report Due March 23, 12:00 noon 2

Sample 1: 1. Data 2. Model D = {x (n) } N n=1 n v p d n time flies like an arrow p(x ) = 1 Z( ) Y C2C C(x C ) Y 1 Y 2 Y 3 Y 4 Y 5 Sample 2: Sample 3: Sample 4: n n v d n time flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see 3. Objective X 1 X 2 X 3 X 4 X 5 `( ; D) = NX log p(x (n) ) n=1 5. Inference 1. Marginal Inference p(x C )= 2. Partition Function Z( ) = X x 3. MAP Inference ˆx = argmax x X x 0 :x 0 C =x C Y C2C p(x 0 ) C(x C ) p(x ) 4. Learning = argmax `( ; D) 3

It s Pi Day 2016 so let s compute π. 4

Slide from Ian Murray Properties of Monte Carlo Estimator: f(x)p (x) dx ˆf 1 S S f(x (s) ), x (s) P (x) s=1 Estimator is unbiased: ] E P ({x })[ ˆf (s) = 1 S S E P (x) [f(x)] s=1 = E P (x) [f(x)] Variance shrinks 1/S: ] var P ({x })[ ˆf (s) = 1 S S 2 var P (x) [f(x)] = var P (x) [f(x)] /S Error bars shrink like S s=1 5

Slide from Ian Murray A dumb approximation of π P (x, y) = { 1 0<x<1 and 0<y<1 0 otherwise π =4 I ( (x 2 + y 2 ) < 1 ) P (x, y) dx dy octave:1> S=12; a=rand(s,2); 4*mean(sum(a.*a,2)<1) ans = 3.3333 octave:2> S=1e7; a=rand(s,2); 4*mean(sum(a.*a,2)<1) ans = 3.1418 6

Slide from Ian Murray Aside: don t always sample! Monte Carlo is an extremely bad method; it should be used only when all alternative methods are worse. Alan Sokal, 1996 Example: numerical solutions to (nice) 1D integrals are fast octave:1> 4 * quadl(@(x) sqrt(1-x.^2), 0, 1, tolerance) Gives π to 6 dp s in 108 evaluations, machine precision in 2598. (NB Matlab s quadl fails at zero tolerance) 7

Slide from Ian Murray Sampling from distributions Draw points uniformly under the curve: P (x) x (2) x (3) x (1) x (4) x Probability mass to left of point Uniform[0,1] 8

Slide from Ian Murray Sampling from distributions How to convert samples from a Uniform[0,1] generator: 1 h(y) h(y) = y p(y )dy p(y) Draw mass to left of point: u Uniform[0,1] Sample, y(u) =h 1 (u) 0 Figure from PRML, Bishop (2006) y Although we can t always compute and invert h(y) 9

Slide from Ian Murray Rejection sampling Sampling underneath a P (x) P (x) curve is also valid k Q(x) (x i,u i ) k opt Q(x) P (x) Draw underneath a simple curve k Q(x) P (x): Draw x Q(x) height u Uniform[0,k Q(x)] Discard the point if above P, i.e. if u> P (x) (x j,u j ) x (1) x 10

Slide from Ian Murray Importance sampling Computing P (x) and Q(x), thenthrowing x away seems wasteful Instead rewrite the integral as an expectation under Q: f(x)p (x) dx = 1 S f(x) P (x) Q(x) dx, Q(x) (Q(x) > 0 if P (x) > 0) S f(x (s) ) P (x(s) ) Q(x (s) ), x(s) Q(x) s=1 This is just simple Monte Carlo again, so it is unbiased. Importance sampling applies when the integral is not an expectation. Divide and multiply any integrand by a convenient distribution. 11

Slide from Ian Murray Importance sampling (2) Previous slide assumed we could evaluate P (x) = P (x)/z P f(x)p (x) dx Z Q Z P 1 S 1 S S f(x (s) ) P (x (s) ), x Q(x (s) Q(x) (s) ) s=1 S f(x (s) ) s=1 1 S }{{} r (s) r (s) s r (s ) S f(x (s) )w (s) s=1 This estimator is consistent but biased Exercise: Prove that Z P /Z Q 1 S s r(s) 12

Slide from Ian Murray Summary so far Sums and integrals, often expectations, occur frequently in statistics Monte Carlo approximates expectations with a sample average Rejection sampling draws samples from complex distributions Importance sampling applies Monte Carlo to any sum/integral 13

Slide from Ian Murray Application to large problems Pitfalls of Monte Carlo Rejection & importance sampling scale badly with dimensionality Example: P (x) =N (0, I), Q(x) =N (0, σ 2 I) Rejection sampling: Requires σ 1. Fraction of proposals accepted = σ D Importance sampling: Variance of importance weights = ( ) σ 2 D/2 2 1/σ 1 2 Infinite / undefined variance if σ 1/ 2 14

Outline Review: Monte Carlo MCMC (Basic Methods) Metropolis algorithm Metropolis- Hastings (M- H) algorithm Gibbs Sampling Markov Chains Transition probabilities Invariant distribution Equilibrium distribution Markov chain as a WFSM Constructing Markov chains Why does M- H work? MCMC (Auxiliary Variable Methods) Slice Sampling Hamiltonian Monte Carlo 15

Metropolis, Metropolis- Hastings, Gibbs Sampling MCMC (BASIC METHODS) 16

MCMC Goal: Draw approximate, correlated samples from a target distribution p(x) MCMC: Performs a biased random walk to explore the distribution 17

Simulations of MCMC Visualization of Metroplis- Hastings, Gibbs Sampling, and Hamiltonian MCMC: http://twiecki.github.io/blog/2014/01/02/visualizing- mcmc/ 18

Whiteboard Metropolis Algorithm Metropolis- Hastings Algorithm Gibbs Sampling 19

Gibbs Sampling x 2 p(x) x (t+1) x (t) p(x P (x) 1 x (t) 2 ) a) x 1 ( 20

Gibbs Sampling x 2 p(x) x (t+2) x (t+1) P (x) p(x 2 x (t+1) x (t) 1 ) a) x 1 ( 21

Gibbs Sampling x 2 p(x) x (t+2) x (t+1) x (t+3) x (t+4) x (t) P (x) a) x 1 ( 22

Figure from Jordan Ch. 21 Gibbs Sampling Full conditionals only need to condition on the Markov Blanket MRF Bayes Net Must be easy to sample from conditionals Many conditionals are log- concave and are amenable to adaptive rejection sampling 23

Whiteboard Gibbs Sampling as M- H 24

Figure from Bishop (2006) Random Walk Behavior of M- H For Metropolis- Hastings, a generic proposal distribution is: If ϵ is large, many rejections If ϵ is small, slow mixing q(x x (t) )=N (0, 2 ) max p(x) min q(x x (t) ) 25

Figure from Bishop (2006) Random Walk Behavior of M- H For Rejection Sampling, the accepted samples are are independent But for Metropolis- Hastings, the samples are correlated Question: How long must we wait to get effectively independent samples? min max q(x x (t) ) p(x) A: independent states in the M- H random walk are separated by roughly steps ( max / min ) 2 26

Definitions and Theoretical Justification for MCMC MARKOV CHAINS 27

Whiteboard Markov chains Transition probabilities Invariant distribution Equilibrium distribution Sufficient conditions for MCMC Markov chain as a WFSM 28

Detailed Balance S(x 0 x)p(x) =S(x x 0 )p(x 0 ) Detailed balance means that, for each pair of states x and x, arriving at x then x and arriving at x then x are equiprobable. x x x' x' 29

Whiteboard Simple Markov chain example Constructing Markov chains Transition Probabilities for MCMC 30

Practical Issues Question: Is it better to move along one dimension or many? Answer: For Metropolis- Hasings, it is sometimes better to sample one dimension at a time Q: Given a sequence of 1D proposals, compare rate of movement for one- at- a- time vs. concatenation. Answer: For Gibbs Sampling, sometimes better to sample a block of variables at a time Q: When is it tractable to sample a block of variables? 31

Practical Issues Question: How do we assess convergence of the Markov chain? Answer: It s not easy! Compare statistics of multiple independent chains Ex: Compare log- likelihoods Chain 1 Chain 2 Log- likelihood Log- likelihood # of MCMC steps # of MCMC steps 32

Practical Issues Question: How do we assess convergence of the Markov chain? Answer: It s not easy! Compare statistics of multiple independent chains Ex: Compare log- likelihoods Chain 1 Chain 2 Log- likelihood Log- likelihood # of MCMC steps # of MCMC steps 33

Practical Issues Question: Is one long Markov chain better than many short ones? Note: typical to discard initial samples (aka. burn- in ) since the chain might not yet have mixed Answer: Often a balance is best: Compared to one long chain: More independent samples Compared to many small chains: Less samples discarded for burn- in We can still parallelize Allows us to assess mixing by comparing chains 34

Whiteboard Blocked Gibbs Sampling 35

Slice Sampling, Hamiltonian Monte Carlo MCMC (AUXILIARY VARIABLE METHODS) 36

Slide from Ian Murray Auxiliary variables The point of MCMC is to marginalize out variables, but one can introduce more variables: f(x)p (x) dx = f(x)p (x, v) dx dv 1 S S f(x (s) ), x,v P (x, v) s=1 We might want to do this if P (x v) and P (v x) are simple P (x, v) is otherwise easier to navigate 37

Slice Sampling Motivation: Want samples from p(x) and don t know the normalizer Z Choosing a proposal at the correct scale is difficult Properties: Similar to Gibbs Sampling: one- dimensional transitions in the state space Similar to Rejection Sampling: (asymptotically) draws samples from the region under the curve p(x) An MCMC method with an adaptive proposal 38

Slide from Ian Murray Slice sampling idea Sample point uniformly under curve P (x) P (x) This is just an auxiliary- variable Gibbs Sampler! P (x) (x, u) u x Problem: Sampling from the conditional p(x u) might be infeasible. p(u x) =Uniform[0, P (x)] p(x u) { 1 P (x) u 0 otherwise = Uniform on the slice 39

Figure adapted from MacKay Ch. 29 Slice Sampling 1 40

Figure adapted from MacKay Ch. 29 Slice Sampling 1 41

Figure adapted from MacKay Ch. 29 Slice Sampling 1 42

Slice Sampling Algorithm: Goal: sample (x, u) given (u (t),x (t) ). u Uniform(0,p(x (t) ) Part 1: Stepping Out Sample interval (x l,x r ) enclosing x (t). r Uniform(u, w) (x l,x r )=(x (t) r, x (t) + w r) Expand until endpoints are outside region under curve. while( p(x l ) >u){x l = x l w} while( p(x r ) >u){x r = x r + w} Part 2: Sample x (Shrinking) while(true) { Draw x from within the interval (x l,x r ), then accept or shrink. x Uniform(x l,x r ) if( p(x) >u){break} else if(x >x (t) ){x r = x} else{x l = x} } x (t+1) = x, u (t+1) = u 43

Slice Sampling Algorithm: Goal: sample (x, u) given (u (t),x (t) ). u Uniform(0,p(x (t) ) Part 1: Stepping Out Sample interval (x l,x r ) enclosing x (t). r Uniform(u, w) (x l,x r )=(x (t) r, x (t) + w r) Expand until endpoints are outside region under curve. while( p(x l ) >u){x l = x l w} while( p(x r ) >u){x r = x r + w} Part 2: Sample x (Shrinking) while(true) { Draw x from within the interval (x l,x r ), then accept or shrink. x Uniform(x l,x r ) if( p(x) >u){break} else if(x >x (t) ){x r = x} else{x l = x} } x (t+1) = x, u (t+1) = u 44

Slice Sampling Algorithm: Goal: sample (x, u) given (u (t),x (t) ). u Uniform(0,p(x (t) ) Part 1: Stepping Out Sample interval (x l,x r ) enclosing x (t). r Uniform(u, w) (x l,x r )=(x (t) r, x (t) + w r) Expand until endpoints are outside region under curve. while( p(x l ) >u){x l = x l w} while( p(x r ) >u){x r = x r + w} Part 2: Sample x (Shrinking) while(true) { Draw x from within the interval (x l,x r ), then accept or shrink. x Uniform(x l,x r ) if( p(x) >u){break} else if(x >x (t) ){x r = x} else{x l = x} } x (t+1) = x, u (t+1) = u 45

Slice Sampling Multivariate Distributions Resample each variable x i one- at- a- time (just like Gibbs Sampling) Does not require sampling from p(x i {x j } j6=i ) Only need to evaluate a quantity proportional to the conditional p(x i {x j } j6=i ) / p(x i {x j } j6=i ) 46

Hamiltonian Monte Carlo Suppose we have a distribution of the form: p(x) =exp{ E(x)}/Z where x 2 R N We could use random- walk M- H to draw 2 R samples, but it seems a shame to discard gradient information r x E(x) If we can evaluate it, the gradient tells us where to look for high- probability regions! 47

Background: Hamiltonian Dyanmics Applications: Following the motion of atoms in a fluid through time Integrating the motion of a solar system over time Considering the evolution of a galaxy (i.e. the motion of its stars) molecular dynamics N- body simulations Properties: Total energy of the system H(x,p) stays constant Dynamics are reversible Important for detailed balance 48

Background: Hamiltonian Dyanmics Let x 2 R N p 2 R N Potential energy: Kinetic energy: Total energy: be a position be a momentum E(x) K(p) =p T p/2 H(x, p) =E(x)+K(p) Hamiltonian function Given a starting position x (1) and a starting momentum p (1) we can simulate the Hamiltonian dynamics of the system via: 1. Euler s method 2. Leapfrog method 3. etc. 49

Background: Hamiltonian Dyanmics Parameters to tune: 1. Step size, ϵ 2. Number of iterations, L Leapfrog Algorithm: for in 1...L: p = p 2 r xe(x) x = x + p p = p 2 r xe(x) 50

Figure from Neal (2011) Background: Hamiltonian Dyanmics (a) Euler s method, stepsize 0.3 (b) Modified Euler s method, stepsize 0.3 2 2 1 1 Momentum ( p) 0 Momentum ( p) 0 1 1 2 2 2 1 0 1 2 Position (q) 2 1 0 1 2 Position (q) (c) Leapfrog method, stepsize 0.3 (d) Leapfrog method, stepsize 1.2 2 2 1 1 Momentum ( p) 0 Momentum ( p) 0 1 1 2 2 2 1 0 1 2 Position (q) 2 1 0 1 2 Position (q) 51

Figure from Neal (2011) Goal: Define: Hamiltonian Monte Carlo Preliminaries p(x) =exp{ E(x)}/Z where x 2 R N 2 R K(p) =p T p/2 H(x, p) =E(x)+K(p) p(x, p) =exp{ H(x, p)}/z H =exp{ E(x} exp{ K(p)}/Z H Note: Since p(x,p) is separable ) X p ) X x p(x, p) =exp{ p(x, p) =exp{ E(x}/Z K(x}/Z K Target dist. Gaussian 52

Whiteboard Hamiltonian Monte Carlo algorithm (aka. Hybrid Monte Carlo) 53

Figure from Neal (2011) Hamiltonian Monte Carlo 2 Position coordinates 2 Momentum coordinates 2.6 Value of Hamiltonian 1 1 2.5 0 0 2.4 1 1 2.3 2 2 2.2 2 1 0 1 2 2 1 0 1 2 0 5 10 15 20 25 54

Figure from Neal (2011) M- H vs. HMC Random walk Metropolis Hamiltonian Monte Carlo 2 2 1 1 0 0 1 1 2 2 2 1 0 1 2 2 1 0 1 2 55

Simulations of MCMC Visualization of Metroplis- Hastings, Gibbs Sampling, and Hamiltonian MCMC: http://twiecki.github.io/blog/2014/01/02/visualizing- mcmc/ 56

Slide adapted from Daphe Koller Pros MCMC Summary Very general purpose Often easy to implement Good theoretical guarantees as Cons t!1 Lots of tunable parameters / design choices Can be quite slow to converge Difficult to tell whether it's working 57