Markov chain Monte Carlo

Similar documents
Markov chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo

Monte Carlo Inference Methods

Monte Carlo Methods for Inference and Learning

Quantifying Uncertainty

16 : Markov Chain Monte Carlo (MCMC)

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Bayesian Inference and MCMC

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Reminder of some Markov Chain properties:

Advanced MCMC methods

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Lecture 7 and 8: Markov Chain Monte Carlo

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

MCMC and Gibbs Sampling. Kayhan Batmanghelich

CPSC 540: Machine Learning

Markov Chains and MCMC

19 : Slice Sampling and HMC

Introduction to Machine Learning CMU-10701

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Answers and expectations

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Monte Carlo Methods. Leon Gu CSD, CMU

17 : Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Markov chain Monte Carlo Lecture 9

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Approximate Inference using MCMC

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Bayesian Methods for Machine Learning

CSC 446 Notes: Lecture 13

16 : Approximate Inference: Markov Chain Monte Carlo

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Monte Carlo Methods. Geoff Gordon February 9, 2006


Markov Chain Monte Carlo

14 : Approximate Inference Monte Carlo Methods

Robert Collins CSE586, PSU Intro to Sampling Methods

an introduction to bayesian inference

CSC 2541: Bayesian Methods for Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Sampling Methods (11/30/04)

Probabilistic Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Graphical Models and Kernel Methods

Markov Chain Monte Carlo (MCMC)

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Introduction to Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

The Origin of Deep Learning. Lili Mou Jan, 2015

STA 294: Stochastic Processes & Bayesian Nonparametrics

Stochastic optimization Markov Chain Monte Carlo

Sampling Algorithms for Probabilistic Graphical models

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Lecture 8: Bayesian Estimation of Parameters in State Space Models

MCMC: Markov Chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo

MCMC algorithms for fitting Bayesian models

Bayes Nets: Sampling

A = {(x, u) : 0 u f(x)},

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

I. Bayesian econometrics

Session 3A: Markov chain Monte Carlo (MCMC)

Lecture 6: Markov Chain Monte Carlo

Robert Collins CSE586, PSU. Markov-Chain Monte Carlo

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Markov-Chain Monte Carlo

Markov Chains and MCMC

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Likelihood-free MCMC

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Markov Chain Monte Carlo, Numerical Integration

Convergence Rate of Markov Chains

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Particle-Based Approximate Inference on Graphical Model

Monte Carlo in Bayesian Statistics

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Sampling from complex probability distributions

Robert Collins CSE586, PSU Intro to Sampling Methods

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Bagging During Markov Chain Monte Carlo for Smoother Predictions

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Introduction to Gaussian Processes

Brief introduction to Markov Chain Monte Carlo

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Robert Collins CSE586, PSU Intro to Sampling Methods

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Approximate inference in Energy-Based Models

Lecture 6: Monte-Carlo methods

Transcription:

Markov chain Monte Carlo Probabilistic Models of Cognition, 2011 http://www.ipam.ucla.edu/programs/gss2011/ Roadmap: Motivation Monte Carlo basics What is MCMC? Metropolis Hastings and Gibbs...more tomorrow. Iain Murray http://homepages.inf.ed.ac.uk/imurray2/

Eye-balling samples Sometimes samples are pleasing to look at: (if you re into geometrical combinatorics) Figure by Propp and Wilson. Source: MacKay textbook. Sanity check probabilistic modeling assumptions: Data samples MoB samples RBM samples

The need for integrals P (y x, D) = = dθ P (y, θ x, D) dθ P (y θ, D) P (θ x, D) y P (y x, D) x

A statistical problem What is the average height of the GSS2011 lecturers? Method: measure their heights, add them up and divide by N 25. What is the average height f of people p in California C? E p C [f(p)] 1 C f(p), p C intractable? 1 S S f ( p (s)), for random survey of S people {p (s) } C s=1 Surveying works for large and notionally infinite populations.

Simple Monte Carlo Statistical sampling can be applied to any expectation: In general: f(x)p (x) dx 1 S S f(x (s) ), x (s) P (x) s=1 Example: making predictions p(x D) = P (x θ, D) P (θ D) dθ 1 S S P (x θ (s), D), s=1 θ (s) P (θ D) More examples: E-step statistics in EM, Boltzmann machine learning

Properties of Monte Carlo Estimator: f(x) P (x) dx ˆf 1 S S f(x (s) ), x (s) P (x) s=1 Estimator is unbiased: ] E P ({x })[ ˆf (s) = 1 S S s=1 E P (x) [f(x)] = E P (x) [f(x)] Variance shrinks 1/S: ] var P ({x })[ ˆf (s) = 1 S S 2 Error bars shrink like S s=1 var P (x) [f(x)] = var P (x) [f(x)] /S

Aside: don t always sample! Monte Carlo is an extremely bad method; it should be used only when all alternative methods are worse. Alan Sokal, 1996

A dumb approximation of π P (x, y) = { 1 0<x<1 and 0<y <1 0 otherwise π = 4 I ( (x 2 + y 2 ) < 1 ) P (x, y) dx dy octave:1> S=12; a=rand(s,2); 4*mean(sum(a.*a,2)<1) ans = 3.3333 octave:2> S=1e7; a=rand(s,2); 4*mean(sum(a.*a,2)<1) ans = 3.1418

Alternatives to Monte Carlo There are other methods of numerical integration! Example: (nice) 1D integrals are easy: octave:1> 4 * quadl(@(x) sqrt(1-x.^2), 0, 1, tolerance) Gives π to 6 dp s in 108 evaluations, machine precision in 2598. (NB Matlab s quadl fails at tolerance=0, but Octave works.) In higher dimensions sometimes determinstic approximations work: Variational Bayes, EP, INLA,...

Reminder Want to sample to approximate expectations: f(x)p (x) dx 1 S S f(x (s) ), x (s) P (x) s=1 How do we get the samples?

Sampling simple distributions Use library routines for univariate distributions (and some other special cases) This book (free online) explains how some of them work http://cg.scs.carleton.ca/~luc/rnbookindex.html

Sampling discrete values u Uniform[0, 1] u=0.4 x=b There are more efficient ways for large numbers of values and samples. See Devroye book.

Sampling from densities How to convert samples from a Uniform[0,1] generator: 1 h(y) h(y) = y p(y ) dy p(y) u Uniform[0,1] Sample, y(u) = h 1 (u) 0 Figure from PRML, Bishop (2006) y Although we can t always compute and invert h(y)

Sampling from densities Draw points uniformly under the curve: P (x) x (2) x (3) x (1) x (4) x Probability mass to left of point Uniform[0,1]

Rejection sampling Sampling from π(x) using tractable q(x): Figure credit: Ryan P. Adams

Importance sampling Throwing away samples seems wasteful Instead rewrite the integral as an expectation under Q: f(x) P (x) dx = 1 S f(x) P (x) Q(x) dx, Q(x) (Q(x) > 0 if P (x) > 0) S f(x (s) ) P (x(s) ) Q(x (s) ), x(s) Q(x) s=1 This is just simple Monte Carlo again, so it is unbiased. Importance sampling applies when the integral is not an expectation. Divide and multiply any integrand by a convenient distribution.

Importance sampling (2) Previous slide assumed we could evaluate P (x) = P (x)/z P f(x) P (x) dx Z Q Z P 1 S 1 S S f(x (s) ) P (x (s) ), x Q(x (s) Q(x) (s) ) s=1 S f(x (s) ) s=1 1 S }{{} r (s) r (s) s r (s ) S f(x (s) ) w (s) s=1 This estimator is consistent but biased Exercise: Prove that Z P /Z Q 1 S s r(s)

Summary so far Sums and integrals, often expectations, occur frequently in statistics Monte Carlo approximates expectations with a sample average Rejection sampling draws samples from complex distributions Importance sampling applies Monte Carlo to any sum/integral Next: Why are we not done? MCMC, Metropolis Hastings and Gibbs

Reminder Need to sample large, non-standard distributions: P (x D) 1 S S P (x θ), θ P (θ D) s=1 When there are nuisance parameters: P (θ D) = dα P (θ, α D) θ, α P (θ, α D) P (α) P (θ α) P (D θ) and discard α s

Application to large problems Rejection & importance sampling scale badly with dimensionality Example: P (x) = N (0, I), Q(x) = N (0, σ 2 I) Rejection sampling: Requires σ 1. Fraction of proposals accepted = σ D Importance sampling: ( ) σ Var[P (x)/q(x)] = 2 D/2 2 1/σ 1 2 Infinite / undefined variance if σ 1/ 2

Importance sampling weights w =0.00548 w =1.59e-08 w =9.65e-06 w =0.371 w =0.103 w =1.01e-08 w =0.111 w =1.92e-09 w =0.0126 w =1.1e-51

Metropolis algorithm 3 Perturb parameters: Q(θ ; θ), e.g. N (θ, σ 2 ( ) P (θ D) Accept with probability min 1, P (θ D) Otherwise keep old parameters Detail: Metropolis, as stated, requires Q(θ ; θ) = Q(θ; θ ) 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 This subfigure from PRML, Bishop (2006)

Markov chain Monte Carlo Construct a biased random walk that explores target dist P (x) Markov steps, x t T (x t x t 1 ) MCMC gives approximate, correlated samples from P (x)

Transition operators Discrete example P = 3/5 1/5 T = 1/5 2/3 1/2 1/2 1/6 0 1/2 T ij = T (x i x j ) 1/6 1/2 0 P is an invariant distribution of T because T P =P, i.e. T (x x)p (x) = P (x ) x Also P is the equilibrium distribution of T : To machine precision: T 100 1 0 = 3/5 1/5 0 1/5 = P Ergodicity requires: T K (x x)>0 for all x : P (x ) > 0, for some K

Reverse operators If T leaves P (x) stationary, we can define a reverse operator R(x x ) T (x x) P (x) = T (x x) P (x) x T (x x) P (x) = T (x x) P (x) P (x ) A necessary (and sufficient) condition: there exists R such that: T (x x) P (x) = R(x x ) P (x ), x, x If R = T, operator satisfies detailed balance (not necessary)

Balance condition x x and x x are equally probable: T (x x) P (x) = R(x x ) P (x ) Implies that P (x) is left invariant: T (x x) P (x) = P (x ) R(x x ) x x 1 Enforcing the condition is easy: it only involves isolated pairs

Metropolis algorithm 3 Perturb parameters: Q(θ ; θ), e.g. N (θ, σ 2 ( ) P (θ D) Accept with probability min 1, P (θ D) Otherwise keep old parameters Detail: Metropolis, as stated, requires Q(θ ; θ) = Q(θ; θ ) 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 This subfigure from PRML, Bishop (2006)

Metropolis Hastings Transition operator Propose a move from the current state Q(x ; x), e.g. N (x, σ 2 ) ( ) Accept with probability min 1, P (x )Q(x;x ) P (x)q(x ;x) Otherwise next state in chain is a copy of current state Notes Can use P P (x); normalizer cancels in acceptance ratio Satisfies detailed balance (shown below) Q must be chosen so chain is ergodic P (x) T (x x) = P (x) Q(x ; x) min ( 1, P (x )Q(x; x ) P (x)q(x ; x) ) ( = min P (x)q(x ; x), P (x )Q(x; x ) ) ( = P (x ) Q(x; x P (x)q(x ) ; x) ) min 1, P (x )Q(x; x = P (x ) T (x x ) )

Matlab/Octave code for demo function samples = dumb_metropolis(init, log_ptilde, iters, sigma) D = numel(init); samples = zeros(d, iters); state = init; Lp_state = log_ptilde(state); for ss = 1:iters % Propose prop = state + sigma*randn(size(state)); Lp_prop = log_ptilde(prop); if log(rand) < (Lp_prop - Lp_state) % Accept state = prop; Lp_state = Lp_prop; end samples(:, ss) = state(:); end

Step-size demo Explore N (0, 1) with different step sizes σ sigma = @(s) plot(dumb_metropolis(0, @(x) -0.5*x*x, 1e3, s)); sigma(0.1) 99.8% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 sigma(1) 68.4% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 sigma(100) 0.5% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000

Gibbs sampling z 2 L A method with no rejections: Initialize x to some value Pick each variable in turn or randomly and resample P (x i x j i ) l z 1 Figure from PRML, Bishop (2006) Proof of validity: a) check detailed balance for component update. b) Metropolis Hastings proposals P (x i x j i ) accept with prob. 1 Apply a series of these operators. Don t need to check acceptance.

Gibbs sampling Alternative explanation: Chain is currently at x At equilibrium can assume x P (x) Consistent with x j i P (x j i ), x i P (x i x j i ) Pretend x i was never sampled and do it again. This view may be useful later for non-parametric applications

Summary so far We need approximate methods to solve sums/integrals Monte Carlo does not explicitly depend on dimension, although simple methods work only in low dimensions Markov chain Monte Carlo (MCMC) can make local moves. By assuming less, it s more applicable to higher dimensions simple computations easy to implement (harder to diagnose).