Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Similar documents
Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Approximate Inference using MCMC

STAT 425: Introduction to Bayesian Analysis

Markov Chain Monte Carlo

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Bayesian Methods for Machine Learning

Introduction to Machine Learning

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

CPSC 540: Machine Learning

Basic Sampling Methods

Monte Carlo Inference Methods

17 : Markov Chain Monte Carlo

Bayesian Inference and MCMC

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction to Machine Learning CMU-10701

16 : Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo

ST 740: Markov Chain Monte Carlo

Principles of Bayesian Inference

Markov Chain Monte Carlo

Statistical Machine Learning Lectures 4: Variational Bayes

MCMC algorithms for fitting Bayesian models

CSC 2541: Bayesian Methods for Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Density Estimation. Seungjin Choi

an introduction to bayesian inference

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

LECTURE 15 Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Reminder of some Markov Chain properties:

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Particle-Based Approximate Inference on Graphical Model

The Expectation-Maximization Algorithm

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Session 3A: Markov chain Monte Carlo (MCMC)

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Bayesian Linear Models

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo methods

Lecture 2: From Linear Regression to Kalman Filter and Beyond

16 : Approximate Inference: Markov Chain Monte Carlo

Probabilistic Machine Learning

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS281A/Stat241A Lecture 22

MCMC and Gibbs Sampling. Sargur Srihari

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Markov chain Monte Carlo Lecture 9

Graphical Models and Kernel Methods

MCMC Methods: Gibbs and Metropolis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Lecture 4: Probabilistic Learning

Bayesian data analysis in practice: Three simple examples

Previously Monte Carlo Integration

Markov chain Monte Carlo

Probabilistic Graphical Models

Computational statistics

Introduction to Probability and Statistics (Continued)

Monte Carlo in Bayesian Statistics

Markov Chain Monte Carlo (MCMC)

Nonparameteric Regression:

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Part 1: Expectation Propagation

Metropolis-Hastings Algorithm

Sampling Methods (11/30/04)

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Markov Chain Monte Carlo

Learning the hyper-parameters. Luca Martino

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

STA 4273H: Statistical Machine Learning

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET. Machine Learning T. Schön. (Chapter 11) AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET

Bayesian Machine Learning

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Sampling Algorithms for Probabilistic Graphical models

Bayesian Methods in Multilevel Regression

A Review of Pseudo-Marginal Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

DAG models and Markov Chain Monte Carlo methods a short overview

Markov Chain Monte Carlo Methods

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Adaptive Monte Carlo methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Pattern Recognition and Machine Learning

Bayesian Networks in Educational Assessment

Transcription:

1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey

2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)] = f(z)p(z)dz Examples Bayesian prediction: p(z new z, D) = p(z new θ)p(θ D)dθ = E p(θ D) [p(z new θ)] Difficult variational updates: log q(z 1 ) E p(z2 )[log p(z 1, z 2 )] Difficult E-step in EM: Q(θ, θ old ) = E p(z D,θold )[log p(z, D θ)]

3 / 27 Approximating the integral by samples E p(z) [f(z)] = f(z)p(z)dz L f(z (l) ) l=1 where z (l) are samples drawn from p(z (l) ). As long as iid samples are drawn from the true p(z (l) ), 20 samples are sufficient for a good approximation.

Sampling from inverse CDF 1 Draw u Uniform(0, 1) Calculate y = h 1 (u) Because: P r(h 1 (u) y) = P r(u h(y)) = h(y) Problem: How do we compute h 1 (u) for an arbitrary distribution? 1 Bishop, PRML, 2006 4 / 27

2 5 / 27 Rejection Sampling 2 Target distribution p(z), and envelop distribution q(z) Procedure: z (t) q(z) u (t) Uniform(0, kq(z (t) )) Accept sample if u (t) p(z) p(accept) = p(z) kq(z) q(z)dz = 1 p(z)dz k

Adaptive Rejection Sampling 3 Envelope function is a set of piecewise exponential functions: q(z) = k i λ i exp{ λ i (z z i 1 )} z i 1 z z i Each rejected sample is added as a grid point. Acceptance rate decays exponentially wrt dimensionality! 3 Bishop, PRML, 2006 6 / 27

7 / 27 Importance Sampling (1) E p(z) [f(z)] = = f(z)p(z)dz f(z) p(z) q(z) q(z)dz Draw l samples from q(z). Then, E p(z) [f(z)] 1 L L l=1 f(z (l) p(z (l) ) ) q(z (l) ) }{{} importance weight

8 / 27 Importance Sampling (2) (+) All samples are retained. (-) Too much dependent on how similar q(z) is to p(z). (-) No diagnostic measures available!

9 / 27 Markov Chain Monte Carlo Robust to high dimensionalities Samples form a Markov chain with a transition function T (z z ) Samples are drawn from the target distribution p(z) if, p(z) is invariant wrt T (z z ), p(z) = p(z )T (z z )dz. the Markov chain governed by T (z z ) is ergodic. Invariance : Ensured by detailed balance: p(z)t (z z) = p(z )T (z z ) Ergodicity : More tricky. Imposed by sampling algorithms.

10 / 27 Metropolis-Hastings Procedure: Propose the next state by Q(z ( z), e.g. N (z, σ 2 Accept with probability min 1, p(z )Q(z z ) ) p(z)q(z z) Stay at the current state (add another copy of it to the samples list) otherwise The proposal variance σ 2 is very influential. Determines step size If large, low acceptance rate If small, slow convergence

11 / 27 Metropolis-Hastings (2) Detailed balance is provided: ( p(z)t (z z) = p(z)q(z z) min 1, p(z )Q(z z ) ) p(z)q(z z) = min ( p(z)q(z z), p(z )Q(z z ) ) ( p(z)q(z = p(z )Q(z z ) z) ) min p(z )Q(z z ), 1 = p(z )T (z z )

Metropolis-Hastings (3) 4 1-D Demo: 4 Murray,MLSS,2009 12 / 27

13 / 27 Gibbs Sampling Procedure: Initialize z (1) 1, z(1) 2, z(1) 3 For l = 1 to L 1 z (l+1) 1 p(z 1 z (l) 2, z(l) 3 ) z (l+1) 2 p(z 2 z (l+1) 1, z (l) 3 ) z (l+1) 3 p(z 3 z (l+1) 1, z (l+1) 2 )

14 / 27 Gibbs Sampling (2) Invariance: All conditioned variates are constant by definition, and the remaining variable is sampled from the true distribution. Ergodicity: Guaranteed if all conditional probabilities are non-zero in their entire domain. Gibbs sampling is a special case of Metropolis-Hastings with q k (z z) = p(z k z \k ), thus A(z z) = p(z k z \k)p(z \k)p(z k z \k) p(z k z \k )p(z \k )p(z k z \k ) = 1 Hence, all samples are accepted.

Gibbs Sampling (3) 5 Step size is governed by covariances of conditional distributions. Iterative conditional modes: Instead of sampling, update wrt a point estimate (e.g. mean, mode). 5 Bishop, PRML, 2006 15 / 27

16 / 27 Collapsed Gibbs Sampling Integrating out some of the variables may yield others to appear conditionally-independent, which entails faster convergence. Rao-Blackwell Theorem: Let z and θ be dependent variables, and f(z, θ) be some scalar function. Then, var z,θ [f(z, θ)] var z [E θ [f(z, θ) z]].

Example: Gaussian Mixture Model 6 Employ conjugate priors to: cluster means cluster covariances mixture probabilities Then integrate them out! 6 Murphy, Mach. Learn., 2012 17 / 27

18 / 27 Implementation tricks Thinning : Take every Kth sample to decorrelate Burn-in : Discard first (e.g. half) of the samples which were prior to mixing Multiple runs : To neutralize the effect of initialization

19 / 27 Diagnosing Convergence 1: Traceplots

20 / 27 Diagnosing Convergence 2: Running mean plots

21 / 27 Diagnosing Conv. 3: Rubin-Gelman Metric Calculate within-chain variance W and between-chain variance B Calculate estimated variance ˆ V ar(θ) = (1 1/n)W + (1/n)B Calculate and monitor Potential Scale Reduction Factor (PSRF) Vˆar(θ) ˆR = W ˆR should get smaller until convergence.

22 / 27 Diagnosing Convergence 4: Other metrics Geweke diagnostic: Take first x and last y samples in the chain and test if they come from the same distribution. Raftery and Lewis diagnostic: Calculate nr of iterations until a desired level of accuracy is reached for a posterior quantile. Heidelberg and Welch diagnostic: Repeated significance testing (stationary vs null)

23 / 27 Example: Bayesian logistic regression p(f i w, x i ) = N (f i w T x i, σ 2 ), i = 1,, N 1 p(y i f i ) = 1 + e f, iy i i = 1,, N p(w d α d ) = N (w d 0, α 1 d ), d = 1,, D p(α d ) = G(α d a, b), d = 1,, D

24 / 27 Let s aim for a Gibbs samples We require the following conditional distributions: p(w f, α, X, y), (1) p(α w, f, X, y), (2) p(f w, α, X, y) (3)

The log joint N N log p(w, f, α, X, y) = log p(f i w, x i ) + log p(y i f i ) i=1 i=1 D D + log p(w i α i ) + log p(α i ) d=1 d=1 = 1 2 log σ2 I 1 2σ 2 (f T w T X T )(f Xw) N log(1 + e y if i ) + 1 D log α d 1 2 2 wt Aw + i=1 D (a 1) log α d d=1 d=1 d=1 D bα d + const where A dd = α d and A ij = 0, i j 25 / 27

26 / 27 The conditionals p(α d α d, w, f, X, y) = G(α d a + 1 2, b + 1 2 w2 d ) ( ( p(w f, α, X, y) = N w X T X + A ) 1 X T f, ( X T X + A ) ) 1 p(f w, α, X, y) = Metropolis with q(f i ) = N (f i w T x i, σ 2 )

27 / 27 Useful references Robert and Casella, Monte Carlo Statistical Methods, 2004 Bishop, Pattern Recognition & Mach. Learning, 2006, Ch. 11 Murphy, Machine Learning: A Probabilistic Perspective, 2012 Gelman et al., Bayesian Data Analysis, 2013 Murray, Markov Chain Monte Carlo, MLSS, 2009