Monte Carlo Inference Methods

Similar documents
Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Markov chain Monte Carlo

Monte Carlo Methods for Inference and Learning

17 : Markov Chain Monte Carlo

Bayesian Inference and an Intro to Monte Carlo Methods

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Markov chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Bayesian Methods for Machine Learning

MCMC algorithms for fitting Bayesian models

STA 4273H: Statistical Machine Learning

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian Inference and MCMC

LECTURE 15 Markov chain Monte Carlo

CPSC 540: Machine Learning

Advanced MCMC methods

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction to Machine Learning

Computational statistics

DAG models and Markov Chain Monte Carlo methods a short overview

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

MCMC and Gibbs Sampling. Kayhan Batmanghelich

I. Bayesian econometrics

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013


Bagging During Markov Chain Monte Carlo for Smoother Predictions

A = {(x, u) : 0 u f(x)},

Answers and expectations

STAT 425: Introduction to Bayesian Analysis

Reminder of some Markov Chain properties:

Notes on pseudo-marginal methods, variational Bayes and ABC

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Principles of Bayesian Inference

19 : Slice Sampling and HMC

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Markov Chain Monte Carlo methods

Closed-form Gibbs Sampling for Graphical Models with Algebraic constraints. Hadi Mohasel Afshar Scott Sanner Christfried Webers

Graphical Models and Kernel Methods

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Approximate Slice Sampling for Bayesian Posterior Inference

CSC 2541: Bayesian Methods for Machine Learning

Tutorial on Probabilistic Programming with PyMC3

CS281A/Stat241A Lecture 22

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Learning the hyper-parameters. Luca Martino

Quantifying Uncertainty

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Approximate Inference using MCMC

Density Estimation. Seungjin Choi

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Monte Carlo in Bayesian Statistics

Elliptical slice sampling

Kernel adaptive Sequential Monte Carlo

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Bayesian Machine Learning

Session 3A: Markov chain Monte Carlo (MCMC)

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

an introduction to bayesian inference

Introduction to Machine Learning CMU-10701

Probabilistic Machine Learning

ST 740: Markov Chain Monte Carlo

Bayesian Phylogenetics:

Bayes: All uncertainty is described using probability.

Introduction to Gaussian Processes

Markov Chain Monte Carlo (MCMC)

Sampling Algorithms for Probabilistic Graphical models

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

Advanced Statistical Modelling

Markov Chain Monte Carlo

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

On Markov chain Monte Carlo methods for tall data

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Approximate Slice Sampling for Bayesian Posterior Inference

Lecture : Probabilistic Machine Learning

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Riemann Manifold Methods in Bayesian Statistics

Hierarchical Bayesian Modeling

Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Bayesian Linear Models

Hamiltonian Monte Carlo

Unsupervised Learning

Markov Chain Monte Carlo methods

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Sampling Methods (11/30/04)

Gradient-based Monte Carlo sampling methods

Approximate Bayesian Computation: a simulation based approach to inference

Bayesian Linear Models

STA 4273H: Statistical Machine Learning

Inference in state-space models with multiple paths from conditional SMC

Markov Chain Monte Carlo, Numerical Integration

Markov chain Monte Carlo

Bayes Nets: Sampling

Lecture 6: Markov Chain Monte Carlo

Transcription:

Monte Carlo Inference Methods Iain Murray University of Edinburgh http://iainmurray.net

Monte Carlo and Insomnia Enrico Fermi (1901 1954) took great delight in astonishing his colleagues with his remarkably accurate predictions of experimental results...... his guesses were really derived from the statistical sampling techniques that he used to calculate with whenever insomnia struck! The beginning of the Monte Carlo method, N. Metropolis

Overview Gaining insight from random samples Inference / Computation What does my data imply? What is still uncertain? Sampling methods: Importance, Rejection, Metropolis Hastings, Gibbs, Slice Practical issues / Debugging

Linear regression y = θ 1 x + θ 2, p(θ) = N (θ; 0, 0.4 2 I) 4 2 y 0-2 -4-6 Prior p(θ) -2 0 2 4 x

Linear regression y (n) = θ 1 x (n) + θ 2 + ɛ (n), ɛ (n) N (0, 0.1 2 ) 4 2 0 y -2-4 -6 p(θ Data) p(data θ) p(θ) -2 0 2 4 x

Linear regression (zoomed in) 0-0.5 y -1-1.5 p(θ Data) p(data θ) p(θ) -2 0 2 4 x

Model mismatch 2 0 2 4 6 2 0 2 What will Bayesian linear regression do?

Quiz Given a (wrong) linear assumption, which explanations are typical of the posterior distribution? 2 2 2 0 0 0 2 4 2 2 A B C 4 4 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 D All of the above E None of the above Z Not sure

Underfitting 4 2 0 2 4 6 4 2 0 2 4 Posterior very certain despite blatant misfit. Peaked around least bad option.

Roadmap Looking at samples Monte Carlo computations How to actually get the samples

Simple Monte Carlo Integration f(θ) π(θ) dθ = average over π of f S 1 S f(θ (s) ), θ (s) π s=1 Unbiased Variance 1/S

Prediction p(y x, D) = p(y x, θ) p(θ D) dθ 1 S s p(y x, θ (s) ), θ (s) p(θ D) y x x

Prediction p(y x, D) = p(y x, θ) p(θ D) dθ 1 S s p(y x, θ (s) ), θ (s) p(θ D) p(y x, θ (s) ) y x x

Prediction p(y x, D) = p(y x, θ) p(θ D) dθ 1 S s p(y x, θ (s) ), θ (s) p(θ D) p(y x, θ (s) ) y 1 12 12 s=1 p(y x, θ (s) ) x x

Prediction p(y x, D) = p(y x, θ) p(θ D) dθ 1 S s p(y x, θ (s) ), θ (s) p(θ D) p(y x, D) y 1 100 100 s=1 p(y x, θ (s) ) x x

More interesting models Noisy function values: y (n) p(y f(x (n) ; w), Σ) What weights and noise are plausible? p(σ), p(w α) Some observations corrupted? if z (n) =1, then y (n) p(y 0, Σ N ),... p(z =1) = ɛ... A lot of choices. Joint probability: [ ] p(y (n) z (n), x (n), w, Σ, Σ N ) p(z (n) ɛ) p(x (n) ) p(ɛ) p(σ N ) p(σ) p(w α) p(α) n

Inference Observe data: D = { x (n), y (n) } Unknowns: θ = { w, α, ɛ, Σ, {z (n) },... } p(θ D) = p(d θ) p(θ) p(d) p(d, θ)

Marginalization Interested in particular parameter θ i p(θ i D) = p(θ D) dθ \i Sampling solution: Sample everything: θ (s) p(θ D) θ (s) i comes from marginal p(θ i D) (But see also Rao Blackwellization )

Roadmap Looking at samples Monte Carlo computations How to actually get the samples Standard generators, Markov chains Practical issues

Sampling simple distributions Use library routines for univariate distributions (and some other special cases) This book (free online) explains how some of them work http://luc.devroye.org/rnbookindex.html

Target distribution π(θ) = π (θ) Z, e.g., π (θ) = p(d θ) p(θ)

Sampling discrete values u Uniform[0, 1] u=0.4 θ =b Large number of samples? 1) Alias method, Devroye book; 2) Ex 6.3 MacKay book

Sampling from a density Math: A (s) Uniform[0, 1], θ (s) = Φ 1 (A (s) ) where cdf Φ(θ) = θ π(θ ) dθ Geometry: sample under curve π (θ) θ (2) θ (3) θ (1) θ (4) θ

Rejection Sampling kq(θ) reject while true: θ q h Uniform[0, kq(θ)] if h < π (θ): return θ reject reject π (θ) accept θ (1) θ

Rejection Sampling kq(θ) reject k opt q(θ) while true: θ q h Uniform[0, kq(θ)] if h < π (θ): return θ reject reject π (θ) accept θ (1) θ

Importance Sampling Rewrite integral: expectation under simple distribution q: f(θ) π(θ) dθ = f(θ) π(θ) q(θ) q(θ) dθ, 1 S S s=1 f(θ (s) ) π(θ(s) ) q(θ (s) ), θ(s) q Unbiased if q(θ)>0 where π(θ)>0. Can have infinite variance.

Importance Sampling 2 Can t evaluate π(θ) = π (θ) Z, Z = Alternative version: π (θ) dθ θ (s) q, r (s) = π (θ (s) ) q(θ (s) ), r(s) = r (s) s r (s ) Biased but consistent estimator: f(θ) π(θ) dθ S f(θ (s) ) r (s) s=1

Linear regression 60 samples from prior: 0-0.5 y -1-1.5 θ (s) prior, p(θ) -2 0 2 4 x

Linear regression 60 samples from prior, importance reweighted: 0-0.5 y -1-1.5 r(θ (s) ) p(data θ (s) ) -2 0 2 4 x

Linear regression 10,000 samples from prior, importance reweighted: 0-0.5 y -1-1.5 r(θ (s) ) p(data θ (s) ) -2 0 2 4 x

High dimensional θ 12 curves from prior and 12 curves from posterior 5 y 0-5 p(θ Data) p(data θ) p(θ) -4-2 0 2 4 x

High dimensional θ 10,000 samples from prior, importance reweighted: 5 y 0-5 r(θ (s) ) p(data θ (s) ) -4-2 0 2 4 x

High dimensional θ 100,000 samples from prior, importance reweighted: 5 y 0-5 r(θ (s) ) p(data θ (s) ) -4-2 0 2 4 x

High dimensional θ 1,000,000 samples from prior, importance reweighted: 5 y 0-5 r(θ (s) ) p(data θ (s) ) -4-2 0 2 4 x

Roadmap Looking at samples Monte Carlo computations How to actually get the samples Standard generators, Markov chains Practical issues

>30,000 citations Marshall Rosenbluth s account: http://dx.doi.org/10.1063/1.1887186

Markov chains p(θ (s+1) θ (1)... θ (s) ) = T (θ (s+1) θ (s) ) Divergent Sσ e.g., random walk diffusion T (θ (s+1) θ (s) ) = N (θ (s+1) ; θ (s), σ 2 ) Absorbing states Cycles

Markov chain equilibrium π(θ) lim s p(θ(s) ) = π(θ (s) ) Ergodic if true for all θ (s=0) (other definitions of ergodic exist) Possible to get anywhere in K steps, (T K (θ θ) > 0 for all pairs) no cycles or islands

Invariant/stationary condition If θ (s) is a sample from π, θ (s+1) is also a sample from π. p(θ ) = T (θ θ) π(θ) dθ = π(θ )

Metropolis Hastings θ q(θ ; θ (s) ) if accept: θ (s+1) θ else: θ (s+1) θ (s) P (accept) = min ( 1, π (θ ) q(θ (s) ; θ ) π (θ (s) ) q(θ ; θ (s) ) )

Example / warning 0 1 10 Proposal: { θ (s+1) = 9θ (s) + 1, 0 < θ (s) < 1 θ (s+1) = (θ (s) 1)/9, 1 < θ (s) < 10 Accept move with probability: ( min 1, π (θ ) q(θ; θ ) ( ) π (θ) q(θ = min 1, π (θ ) ) ; θ) π (θ) (Wrong!)

Example / warning θ θ Accept θ with probability: ( min 1, q(θ; θ ) π (θ ) ) q(θ ; θ) π (θ) ( = min 1, 1 1/9 π (θ ) ) π (θ) Really, Green (1995): ( min 1, θ θ π (θ ) ) π (θ) ( = min 1, 9 π (θ ) ) π (θ)

Matlab/Octave code for demo function samples = metropolis ( init, log_pstar_fn, iters, sigma ) D = numel ( init ); samples = zeros (D, iters ); state = init ; Lp_state = log_pstar_fn ( state ); for ss = 1: iters % Propose prop = state + sigma * randn ( size ( state )); Lp_prop = log_pstar_fn ( prop ); if rand () < exp( Lp_prop - Lp_state ) % Accept state = prop ; Lp_state = Lp_prop ; end samples (:, ss) = state (:); end

Step-size demo Explore N (0, 1) with different step sizes σ sigma = @(s) plot(metropolis(0, @(x)-0.5*x*x, 1e3, s)); sigma(0.1) 99.8% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 sigma(1) 68.4% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 sigma(100) 0.5% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000

Markov chain Monte Carlo (MCMC) User chooses π(θ) Explore some plausible θ For large s: p(θ (s) ) = π(θ (s) ) p(θ (s+1) ) = π(θ (s+1) ) E [ ] f(θ (s) ) = E [ ] f(θ (s+1) ) = f(θ)π(θ) dθ = lim S 1 S f(θ (s) ) s

Markov chain mixing Initialization + 2000 steps x sample p(θ (s=100) ) p(θ (s=2000) ) π(θ)

Creating an MCMC scheme M H gives T (θ θ) with invariant π Lots of options for q(θ ; θ): Local diffusions Appproximations of π Update one variable or all?... Multiple valid operators T A, T B, T C,...

Composing operators If p(θ (1) ) = π(θ (1) ) θ (2) T A ( θ (1) ) p(θ (2) ) = π(θ (2) ) θ (3) T B ( θ (2) ) p(θ (3) ) = π(θ (3) ) θ (4) T C ( θ (3) ) p(θ (4) ) = π(θ (4) ) Composition T C T B T A leaves π invariant Valid MCMC method if ergodic overall

Gibbs sampling Pick variables in turn or randomly, θ 2 and resample p(θ i θ j i )? p(θ 1 θ 2 ) θ 1 T i (θ θ) = p(θ i θ j i ) δ(θ j i θ j i ) LHS adapted from Fig 11.11 Bishop PRML

Gibbs sampling correctness p(θ) = p(θ i θ \i ) p(θ \i ) Simulate by drawing θ \i, then θ i θ \i Draw θ \i : sample θ, throw initial θ i away

Blocking / Collapsing Infer θ = (w, z) given D = {x (n), y (n) }. Model: w N (0, I) z n Bernoulli(0.1) { y (n) N (w x (n), 0.1 2 ) z n = 0 N (0, 1) z n = 1 Block Gibbs: resample p(w z, D) and p(z w, D) Collapsing: run MCMC on p(z D) or p(w D)

Auxiliary variables Collapsing: analytically integrate variables out Auxiliary methods: introduce extra variables; integrate by MCMC Explore: π(θ, h), where π(θ, h) dh = π(θ) Swendsen Wang, Hamiltonian Monte Carlo (HMC), Slice Sampling, Pseudo-Marginal methods...

Slice sampling idea Sample uniformly under curve π (θ) π(θ) π (θ) (θ, h) h p(h θ) = Uniform[0, π (θ)] p(θ h) { 1 π (θ) h 0 otherwise θ = Uniform on the slice

Slice sampling Unimodal conditionals (θ, h) h θ Rejection sampling p(θ h) using broader uniform

Slice sampling Unimodal conditionals (θ, h) h θ Adaptive rejection sampling p(θ h)

Slice sampling Unimodal conditionals (θ, h) h θ Quickly find new θ No rejections recorded

Slice sampling Multimodal conditionals π (θ) (θ, h) h Use updates that leave p(θ h) invariant: place bracket randomly around point linearly step out until ends are off slice sample on bracket, shrinking as before θ

Slice sampling Advantages of slice-sampling: Easy only requires π (θ) π(θ) No rejections Step sizes adaptive Other versions: Neal (2003): http://www.cs.toronto.edu/ radford/slice-aos.abstract.html Elliptical Slice Sampling: http://iainmurray.net/pub/10ess/ Pseudo-Marginal Slice Sampling: http://arxiv.org/abs/1510.02958

Roadmap Looking at samples Monte Carlo computations How to actually get the samples Standard generators, Markov chains Practical issues

Empirical diagnostics Recommendations Rasmussen (2000) For diagnostics: Standard software packages like R-CODA For opinion on thinning, multiple runs, burn in, etc. Practical Markov chain Monte Carlo Charles J. Geyer, Statistical Science. 7(4):473 483, 1992. http://www.jstor.org/stable/2246094

Getting it right θ We write MCMC code to update θ D Idea: also write code to sample D θ D Both codes leave p(θ, D) invariant Run codes alternately. Check θ s match prior Geweke, JASA, 99(467):799 804, 2004.

Other consistency checks Do I get the right answer on tiny versions of my problem? Can I make good inferences about synthetic data drawn from my model? Posterior Model checking: Gelman et al. Bayesian Data Analysis textbook and papers.

Summary Write down the probability of everything p(d, θ) Condition on data, D, explore unknowns θ by MCMC Samples give plausible explanations Look at them Average their predictions

Which method? Simulate / sample with known distribution: Exact samples, rejection sampling Posterior distribution, small, noisy problem Importance sampling Posterior distribution, interesting problem Start with MCMC Slice sampling, M-H if careful, Gibbs if clever Hamiltonian methods, HMC, uses gradients

Acknowledgements / References My first reading: MacKay s book and Neal s review: www.inference.phy.cam.ac.uk/mackay/itila/ www.cs.toronto.edu/ radford/review.abstract.html Handbook of Markov Chain Monte Carlo (Brooks, Gelman, Jones, Meng eds, 2011) http://www.mcmchandbook.net/handbooksamplechapters.html Some relevant workshops (apologies for omissions) ABC in Montreal Scalable Monte Carlo Methods for Bayesian Analysis of Big Data Black box learning and inference Bayesian Nonparametrics: The Next Generation Alternatives: Probabilistic Integration; Advances in Approximate Bayesian Inference

Practical examples My exercise sheet: http://iainmurray.net/teaching/09mlss/ BUGS examples and more in STAN: https://github.com/stan-dev/example-models/ Kaggle entry: http://iainmurray.net/pub/12kaggle dark/

Appendix slides

Reverse operator R(θ θ ) = T (θ θ) π(θ) T (θ θ) π(θ) dθ = T (θ θ) π(θ) π(θ ) Necessary condition: T (θ θ) π(θ) = R(θ θ ) π(θ ), θ, θ If R = T, known as detailed balance (not necessary)

Balance condition T (θ θ) π(θ) = R(θ θ ) π(θ ) Implies that π(θ) is left invariant: T (θ θ) π(θ) dθ = π(θ ) 1 R(θ θ ) dθ

Metropolis Hastings θ q(θ ; θ (s) ) if accept: θ (s+1) θ else: θ (s+1) θ (s) Transitions satisfy detailed balance: ( ) T (θ q(θ ; θ) min 1, π(θ ) q(θ;θ ) π(θ) q(θ θ) = ;θ) θ θ... θ = θ T (θ θ)π(θ) symmetric in θ, θ

http://www.kaggle.com/c/darkworlds

...