Monte Carlo Methods. Geoff Gordon February 9, 2006

Similar documents
CPSC 540: Machine Learning

15-780: Grad AI Lecture 19: Graphical models, Monte Carlo methods. Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

Answers and expectations

MCMC and Gibbs Sampling. Kayhan Batmanghelich

17 : Markov Chain Monte Carlo

CS 343: Artificial Intelligence

Markov chain Monte Carlo Lecture 9

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CSC 446 Notes: Lecture 13

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Graphical Models and Kernel Methods

Sampling Methods (11/30/04)

Probabilistic Machine Learning

Introduction to Machine Learning CMU-10701

Markov chain Monte Carlo

Bayesian Methods for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Bayes Nets: Sampling

Robert Collins CSE586, PSU Intro to Sampling Methods

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Markov-Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Approximate Inference

Machine Learning for Data Science (CS4786) Lecture 24

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Machine Learning for Data Science (CS4786) Lecture 24

Adaptive Monte Carlo methods

Introduction to Machine Learning CMU-10701

Results: MCMC Dancers, q=10, n=500

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

STA 4273H: Statistical Machine Learning

The Ising model and Markov chain Monte Carlo

An introduction to Sequential Monte Carlo

Quantifying Uncertainty

Markov Chain Monte Carlo Methods for Stochastic

Monte Carlo Methods. Leon Gu CSD, CMU

CS 188: Artificial Intelligence. Bayes Nets

Bayesian Inference and MCMC

Markov Chain Monte Carlo Methods for Stochastic Optimization

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Markov Chain Monte Carlo methods

Particle Filtering Approaches for Dynamic Stochastic Optimization

Convex Optimization CMU-10725

Markov chain Monte Carlo

14 : Approximate Inference Monte Carlo Methods

Markov Networks.

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computational statistics

The Origin of Deep Learning. Lili Mou Jan, 2015

LECTURE 15 Markov chain Monte Carlo

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Markov Chain Monte Carlo

6 Markov Chain Monte Carlo (MCMC)

Doing Bayesian Integrals

Introduction to Machine Learning

Advanced MCMC methods

Particle Filtering for Data-Driven Simulation and Optimization

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Markov Chain Monte Carlo (MCMC)

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Bayes Net Representation. CS 188: Artificial Intelligence. Approximate Inference: Sampling. Variable Elimination. Sampling.

28 : Approximate Inference - Distributed MCMC

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Sampling Algorithms for Probabilistic Graphical models

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Approximate inference in Energy-Based Models

Lecture 6: Markov Chain Monte Carlo

Markov Chains and MCMC

Monte Carlo and cold gases. Lode Pollet.

A = {(x, u) : 0 u f(x)},

16 : Markov Chain Monte Carlo (MCMC)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Estimation with Sparse Grids

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

16 : Approximate Inference: Markov Chain Monte Carlo

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

Bayesian Inference for Dirichlet-Multinomials

( ) ( ) Monte Carlo Methods Interested in. E f X = f x d x. Examples:

19 : Slice Sampling and HMC

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Learning Energy-Based Models of High-Dimensional Data

Undirected Graphical Models

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Generating the Sample

Markov Chain Monte Carlo

Bayesian networks: approximate inference

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

F denotes cumulative density. denotes probability density function; (.)

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Graphical Models. Lecture 15: Approximate Inference by Sampling. Andrew McCallum

Transcription:

Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006

Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx

Used for: function approximation f(x) α 1 f 1 (x) + α 2 f 2 (x) +... Orthonormal system: f i (x) 2 dx = 1 and f i (x)f j (x)dx = 0 Fourier (sin x, cos x,... ) Chebyshev (1, x, 2x 2 1, 4x 3 3x,... )... Coefficients are α i = f(x)f i (x)dx

Used for: optimization Optimization problem: minimize T (x) for x X Assume unique global optimum x Define Gibbs distribution with temperature 1/β for β > 0: P β (x) = 1 Z(β) As β, have E x Pβ (x) x exp( βt (x)) Simulated annealing: track E β (x) = xp β (x)dx as β

Used for: Bayes net inference Undirected Bayes net on x = x 1, x 2,...: P (x) = 1 Z j φ j (x) Typical inference problem: compute E(x i ) Belief propagation is fast if argument lists of φ j s are small and form a junction tree If not, MCMC

Used for: SLAM

Used for Image segmentation Tracking radar/sonar returns

Outline Uniform sampling, importance sampling MCMC and Metropolis-Hastings algorithm What if f(x) has internal structure: SIS, SIR (particle filter) Gibbs sampler Combining SIR w/ MCMC

A one-dimensional problem 70 60 50 40 30 20 10 0 1 0.5 0 0.5 1

Uniform sampling 70 60 50 40 30 20 10 0 1 0.5 0 0.5 1 true integral 24.0; uniform sampling 14.7 w/ 30 samples

Uniform sampling Pick an x uniformly at random from X E(f(x)) = = 1 V where V is volume of X So E(V f(x)) = desired integral But variance can be big (esp. if V large) P (x)f(x)dx f(x)dx

Uniform sampling Do it a bunch of times: pick x i, compute V n n i=1 f(x i ) Same expectation, lower variance Variance decreases as 1/n (standard dev 1/ n) Not all that fast; limitation of most MC methods

Nonuniform sampling 70 60 50 40 30 20 10 0 1 0.5 0 0.5 1 true integral 24.0; importance sampling (Q = N(0, 0.25 2 )) 25.8

Importance sampling Suppose we pick x nonuniformly, x Q(x) Q(x) is importance distribution Use Q to (approximately) pick out areas where f is large But E Q (f(x)) = Q(x)f(x)dx Not what we want

Importance sampling Define g(x) = f(x)/q(x) Now E Q (g(x)) = = = Q(x)g(x)dx Q(x)f(x)/Q(x)dx f(x)dx

Importance sampling So, sample x i from Q, take average of g(x i ): 1 n i f(x i )/Q(x i ) w i = 1/Q(x i ) is importance weight Uniform sampling is just importance sampling with Q = uniform = 1/V

Parallel importance sampling Suppose f(x) = P (x)g(x) Desired integral is f(x)dx = E P (g(x)) But suppose we only know g(x) and λp (x)

Parallel importance sampling Pick n samples x i from proposal Q(x) If we could compute importance weights w i = P (x i )/Q(x i ), then E Q [w i g(x i )] = = Q(x) P (x) Q(x) g(x)dx f(x)dx so 1 n i w i g(x i ) would be our IS estimate

Parallel importance sampling Assign raw importance weights ŵ i = λp (x i )/Q(x i ) E(ŵ i ) = = λ = λ So w i is an unbiased estimate of λ Define w = 1 n i w i Q(x)(λP (x)/q(x))dx P (x)dx also unbiased, but lower variance

Parallel importance sampling ŵ i / w is approximately w i, but computed without knowing λ So, make the estimate f(x)dx 1 n i ŵ i w g(x i)

Parallel IS bias 3 2.5 2 1.5 1 0.5 0 0 1 2 3 Parallel IS is biased E( w) = λ, but E(1/ w) 1/λ in general Bias 0 as n, since variance of w 0

Parallel IS example 2 1.5 1 0.5 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 Q : (x, y) N(1, 1) θ U( π, π) f(x, y, θ) = Q(x, y, θ)p (o = 0.8 x, y, θ)/z

Parallel IS example 2 1.5 1 0.5 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 Posterior E(x, y, θ) = (0.496, 0.350, 0.084)

Back to n dimensions Picking a good sampling distribution becomes hard in high-d Major contribution to integral can be hidden in small areas Danger of missing these areas need to search for areas of large f(x) Naively, searching could bias our choice of x in strange ways, making it hard to design an unbiased estimator

Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x) if it is small This chain encodes a search strategy: start at an arbitrary x, run chain for a while to find an x with reasonably high f(x) For x found by an arbitrary search algorithm, don t know what importance weight we should use to correct for search bias For x found by M after sufficiently many moves, can use stationary distribution of M, P M (x), as importance weight

Picking P M MCMC works well if f(x)/p M (x) has low variance f(x) P M (x) means there s a region of comparatively large f(x) that we don t sample enough f(x) P M (x) means we waste samples in regions where f(x) 0 So, e.g., if f(x) = g(x)p (x), could ask for P M = P

Metropolis-Hastings Way of getting chain M with desired P M Basic strategy: start from arbitrary x Repeatedly tweak x a little to get x If P M (x ) P M (x)α, move to x If P M (x ) P M (x)α, stay at x In intermediate cases, randomize

Proposal distributions MH has one parameter: how do we tweak x to get x Encoded in one-step proposal distribution Q(x x) Good proposals explore quickly but remain in regions of high P M (x) Optimal proposal: P (x x) = P M (x ) for all x

Metropolis-Hastings algorithm MH transition probability T M (x x) is defined as follows: Sample x Q(x x) Compute p = P M(x ) Q(x x ) P M (x) Q(x x) = P M(x ) P M (x) α With probability p, set x x Repeat Stop after, say, t steps (possibly t distinct samples)

Metropolis-Hastings notes Only need P M up to constant factor nice for problems where normalizing constant is hard Efficiency determined by how fast Q(x x) moves us around how high acceptance probability p is Tension between fast Q and high p

Metropolis-Hastings proof Given P M (x) and T M (x x) Want to show P M is stationary distribution for T M Based on detailed balance condition P M (x)t M (x x) = P M (x )T M (x x ) x, x Detailed balance implies P M (x)t M (x x)dx = = P M (x ) = P M (x ) So, if we can show detailed balance we are done P M (x )T M (x x )dx T M (x x )dx

Proving detailed balance Want to show P M (x)t M (x x) = P M (x )T M (x x ) for x x P M (x)t M (x x) = P M (x)q(x x) max P M (x )T M (x x ) = P M (x )Q(x x ) max Exactly one of the two max statements chooses 1 Wlog, suppose it s the first ( ( 1, P M(x ) P M (x) Q(x x ) Q(x x) 1, P M(x) Q(x x) P M (x ) Q(x x ) ) )

Detailed balance P M (x)t M (x x) = P M (x)q(x x) P M (x )T M (x x ) = P M (x )Q(x x ) P M(x) Q(x x) P M (x ) Q(x x ) = P M (x)q(x x) So, P M is stationary distribution of Metropolis-Hastings sampler

Metropolis-Hastings example 1 0.5 0 0.5 1 1 0.5 0 0.5 1

MH example accuracy True E(x 2 ) 0.28 σ = 0.25 in proposal leads to acceptance rate 55 60% After 1000 samples minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541 final estimate 0.308716

Structure in f(x) Suppose f(x) = g(x)p (x) as above And suppose P (x) can be factored, e.g., P (x) = 1 Z φ 12(x 1, x 2 )φ 13 (x 1, x 3 )φ 245 (x 2, x 4, x 5 )... Then we can take advantage of structure to sample from P efficiently and compute E P (g(x))

Linear or tree structure P (x) = 1 Z φ 12(x 1, x 2 )φ 23 (x 2, x 3 )φ 34 (x 3, x 4 ) Pick a node as root arbitrarily Sample a value for root Sample children conditional on parents Repeat until we have sampled all of x

Sequential importance sampling Assume a chain graph x 1... x T (tree would be fine too) Want to estimate E(x T ) Can evaluate but not sample from P (x 1 ), P (x t+1 x t )

Sequential importance sampling Suppose we have proposals Q(x 1 ), Q(x t+1 x t ) Sample x 1 Q(x 1 ), compute weight w 1 = P (x 1 )/Q(x 1 ) Sample x 2 Q(x 2 x 1 ), weight w 2 = w 1 P (x 2 x 1 )/Q(x 2 x 1 )... continue until last variable x T Weight w T at final step is P (x)/q(x)

Problems with SIS w T often has really high variance We often only know P (x t+1 x t ) up to a constant factor For example, in an HMM after conditioning on observation y t+1, P (x t+1 x t, y t+1 ) = 1 Z P (x t+1 x t )P (y t+1 x t+1 )

Parallel SIS Apply parallel IS trick to SIS: Generate n SIS samples x i with weights w i Normalize w i so i w i = n Gets rid of problem of having to know normalized P s Introduces a bias which 0 as n Still not practical (variance of w i )

Sequential importance resampling SIR = particle filter, sample = particle Run SIS, keep weights normalized to sum to n Monitor variance of weights If too few particles get most of the weight, resample to fix it Resampling reduces variance of final estimate, but increases bias due to normalization

Resampling After normalization, suppose a particle has weight 0 < w < 1 Set its weight to 1 w/ probability w, or to 0 w/ probability 1 w E(weight) is still w, but can throw particle away if weight is 0

Resampling A particle with weight w 1 will get w copies for sure, plus one with probability w w Total number of particles is n Can make it exactly n: High-weight particles are replicated at expense of low-weight ones

SIR example [DC factored filter movie]

Gibbs sampler Recall P (x) = 1 Z φ 12(x 1, x 2 )φ 13 (x 1, x 3 )φ 245 (x 2, x 4, x 5 )... What if we don t have a nice tree structure?

Gibbs sampler MH algorithm for sampling from P (x) Proposal distribution: pick an i at random, resample x i from its conditional distribution holding x i fixed That is, Q(x, x ) = 0 if x and x differ in more than one component If x and x differ in component i, Q(x x) = 1 n P (x i x i)

Gibbs acceptance probability MH acceptance probability is p = P (x ) Q(x x ) P (x) Q(x x) For Gibbs, suppose we are resampling x 1 which participates in φ 7 (x 1, x 4 ) and φ 9 (x 1, x 3, x 6 ) First factor is easy P (x ) P (x) = φ 7(x 1, x 4)φ 9 (x 1, x 3, x 6 ) φ 7 (x 1, x 4 )φ 9 (x 1, x 3, x 6 )

Gibbs acceptance probability Second factor: Q(x x ) Q(x x) = P (x 1 x 1 ) P (x 1 x 1) P (x 1 x 1) is simple too: P (x 1 x 1) = 1 Z φ 7(x 1, x 4)φ 9 (x 1, x 3, x 6 ) So Q(x x ) Q(x x) = φ 7(x 1, x 4 )φ 9 (x 1, x 3, x 6 ) φ 7 (x 1, x 4)φ 9 (x 1, x 3, x 6 )

Better yet P (x ) P (x) = φ 7(x 1, x 4)φ 9 (x 1, x 3, x 6 ) φ 7 (x 1, x 4 )φ 9 (x 1, x 3, x 6 ) The two factors cancel! So p = 1: always accept Q(x x ) Q(x x) = φ 7(x 1, x 4 )φ 9 (x 1, x 3, x 6 ) φ 7 (x 1, x 4)φ 9 (x 1, x 3, x 6 )

Gibbs in practice Simple to implement Often works well Common failure mode: knowing x i locks down x i Results in slow mixing, since it takes a lot of low-probability moves to get from x to a very different x

Locking down E.g., handwriting recognition: antidisestablishmen?arianism Even if we do propose and accept antidisestablishmenqarianism, likely to go right back E.g., image segmentation: if all my neighboring pixels in an 11 11 region are background, I m highly likely to be background as well E.g., HMMs: knowing x t 1 and x t+1 often gives a good idea of x t Sometimes conditional on values of other variables: ai? {aid, ail, aim, air} but th? the (and maybe thy or tho)

Worked example [switch to Matlab]

Related topics Reversible-jump MCMC for when we don t know the dimension of x Rao-Blackwellization hybrid between Monte-Carlo and exact treat some variables exactly, sample over rest Swendsen-Wang modification to Gibbs that mixes faster in locked-down distributions Data-driven proposals: EKPF, UPF