Lecture 8: The Metropolis-Hastings Algorithm

Similar documents
A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Markov Chain Monte Carlo (MCMC)

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

Adaptive Monte Carlo methods

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

17 : Markov Chain Monte Carlo

Introduction to Machine Learning CMU-10701

Markov Chain Monte Carlo

Computational statistics

Bayesian Methods for Machine Learning

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Computer intensive statistical methods

I. Bayesian econometrics

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

6 Markov Chain Monte Carlo (MCMC)

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

16 : Approximate Inference: Markov Chain Monte Carlo

The Metropolis-Hastings Algorithm. June 8, 2012

Sampling from complex probability distributions

Answers and expectations

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

CPSC 540: Machine Learning

Monte Carlo methods for sampling-based Stochastic Optimization

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

16 : Markov Chain Monte Carlo (MCMC)

PSEUDO-MARGINAL METROPOLIS-HASTINGS APPROACH AND ITS APPLICATION TO BAYESIAN COPULA MODEL

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Bayesian Methods with Monte Carlo Markov Chains II

MCMC Methods: Gibbs and Metropolis

Convex Optimization CMU-10725

In many cases, it is easier (and numerically more stable) to compute

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Convergence Rate of Markov Chains

University of Toronto Department of Statistics

DAG models and Markov Chain Monte Carlo methods a short overview

eqr094: Hierarchical MCMC for Bayesian System Reliability

Lecture 6: Markov Chain Monte Carlo

Geometric Ergodicity of a Random-Walk Metorpolis Algorithm via Variable Transformation and Computer Aided Reasoning in Statistics

Control Variates for Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Chains and MCMC

Monte Carlo Methods. Leon Gu CSD, CMU

Computer intensive statistical methods

19 : Slice Sampling and HMC

CSC 446 Notes: Lecture 13

F denotes cumulative density. denotes probability density function; (.)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Quantitative Non-Geometric Convergence Bounds for Independence Samplers

Markov chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

MARKOV CHAIN MONTE CARLO

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Bayesian Inference and MCMC

STA 4273H: Statistical Machine Learning

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Quantifying Uncertainty

Markov chain Monte Carlo Lecture 9

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

On the Applicability of Regenerative Simulation in Markov Chain Monte Carlo

Lecture Notes based on Koop (2003) Bayesian Econometrics

Markov Chain Monte Carlo for Linear Mixed Models

Markov Chain Monte Carlo methods

Markov chain Monte Carlo methods in atmospheric remote sensing

Sampling Algorithms for Probabilistic Graphical models

Markov chain Monte Carlo

Advances and Applications in Perfect Sampling

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods

Weak convergence of Markov chain Monte Carlo II

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Bayesian Linear Regression

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

The Recycling Gibbs Sampler for Efficient Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Multivariate Logistic Regression

STA 294: Stochastic Processes & Bayesian Nonparametrics

MCMC algorithms for fitting Bayesian models

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Markov Chains and MCMC

Kernel adaptive Sequential Monte Carlo

Computational Statistics. Jian Pei School of Computing Science Simon Fraser University

Gradient-based Monte Carlo sampling methods

arxiv: v1 [stat.co] 28 Jan 2018

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic

MCMC algorithms for Subset Simulation

David Giles Bayesian Econometrics

Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

INTRODUCTION TO BAYESIAN STATISTICS

Surveying the Characteristics of Population Monte Carlo

On the flexibility of Metropolis-Hastings acceptance probabilities in auxiliary variable proposal generation

Markov Chain Monte Carlo in Practice

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Transcription:

30.10.2008

What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X 1,..., X p ) in turn by drawing from the full conditionals: X (t) j Two drawbacks: f Xj X j ( X (t) 1,..., X(t) j 1, X(t 1) j+1,..., X(t 1) p ) Requires that it is possible / easy to sample from the full conditionals. Can yields a slowly mixing chain if (some of) the components of (X 1,..., X p ) are highly correlated. What we will see today: Metropolis-Hastings algorithm Key idea: Use rejection mechanism, with a local proposal : We let the newly proposed X depend on the previous state of the chain X (t 1). Samples (X (0), X (1),...) form a Markov chain (like the Gibbs sampler).

The Metropolis-Hastings algorithm Algorithm 5.1: Metropolis-Hastings Starting with X (0) := (X (0) 1,..., X(0) p ) iterate for t = 1, 2,... 1. Draw X q( X (t 1) ). 2. Compute { } α(x X (t 1) f(x) q(x (t 1) X) ) = min 1, f(x (t 1) ) q(x X (t 1). ) 3. With probability α(x X (t 1) ) set X (t) = X, otherwise set X (t) = X (t 1).

Illustration of the Metropolis-Hastings method x (3) =x (4) =x (5) =x (6) =x (7) X (t) 2 x (15) x (11) =x (12) =x (13) =x (14) x (0) =x (1) =x (2) x (8) x (10) x (9) X (t) 1

Basic properties of the Metropolis-Hastings algorithm The probability that a newly proposed value is accepted given X (t 1) = x (t 1) is a(x (t 1) ) = α(x x (t 1) )q(x x (t 1) ) dx. The probability of remaining in state X (t 1) is P(X (t) = X (t 1) X (t 1) = x (t 1) ) = 1 a(x (t 1) ). The probability of acceptance does not depend on the normalisation constant: If f(x) = C π(x), then α(x X (t 1) ) = π(x) q(x(t 1) X) π(x (t 1) ) q(x X (t 1) )

The Metropolis-Hastings Transition Kernel Lemma 5.1 The transition kernel of the Metropolis-Hastings algorithm is K(x (t 1), x (t) ) = α(x (t) x (t 1) )q(x (t) x (t 1) ) +(1 a(x (t 1) ))δ x (t 1)(x (t) ), where δ x (t 1) ( ) denotes Dirac-mass on {x(t 1) }.

Theoretical properties Proposition 5.1 The Metropolis-Hastings kernel satisfies the detailed balance condition K(x (t 1), x (t) )f(x (t 1) ) = K(x (t), x (t 1) )f(x (t) ). Thus f(x) is the invariant distribution of the Markov chain (X (0), X (1),...). Furthermore the Markov chain is reversible.

Example 5.1: Reducible Metropolis-Hastings Consider the target distribution f(x) = (I [0,1] (x) + I [2,3] (x))/2. and the proposal distribution q( x (t 1) ): X X (t 1) = x (t 1) U[x (t 1) δ, x (t 1) + δ] 1/(2δ) 1/2 δ δ q( x (t 1) ) f( ) x (t 1) 1 2 3 Reducible if δ 1: the chain stays either in [0, 1] or [2, 3].

Further theoretical properties The Markov chain (X (0), X (1),...) is irreducible if q(x x (t 1) ) > 0 for all x, x (t 1) supp(f): every state can be reached in a single step (strongly irreducible). (less strict conditions can be obtained, see e.g.(see Roberts & Tweedie, 1996) The chain is Harris recurrent if it is irreducible. (see e.g. Tierney, 1994) The chain is aperiodic, if there is positive probability that the chain remains in the current state, i.e. P(X (t) = X (t 1) ) > 0,

An ergodic theorem Theorem 5.1 If the Markov chain generated by the Metropolis-Hastings algorithm is irreducible, then for any integrable function h : E R 1 lim n n n h(x (t) ) E f (h(x)) t=1 for every starting value X (0). Interpretation: We can approximate expectations by their empirical counterparts using a single Markov chain.

5.3 Random-walk Metropolis 5.4 Choosing the proposal distribution

Random-walk Metropolis: Idea In the Metropolis-Hastings algorithm the proposal is from X q( X (t 1) ). A popular choice for the proposal is q(x x (t 1) ) = g(x x (t 1) ) with g being a symmetric distribution, thus X = X (t 1) + ɛ, ɛ g. Probability of acceptance becomes { } { } min f(x) g(x X (t 1) ) f(x) 1, f(x (t 1) ) g(x (t 1) = min 1, X) f(x (t 1) ), We accept... every move to a more probable state with probability 1. moves to less probable states with a probability f(x)/f(x (t 1) ) < 1.

Random-walk Metropolis: Algorithm Random-Walk Metropolis Starting with X (0) := (X (0) 1,..., X(0) p ) and using a symmetric random walk proposal g, iterate for t = 1, 2,... 1. Draw ɛ g and set X = X (t 1) + ɛ. 2. Compute { } α(x X (t 1) f(x) ) = min 1, f(x (t 1). ) 3. With probability α(x X (t 1) ) set X (t) = X, otherwise set X (t) = X (t 1). Popular choices for g are (multivariate) Gaussians or t-distributions (the latter having heavier tails)

Example 5.2: Bayesian probit model (1) Medical study on infections resulting from birth by Cesarean section 3 influence factors: indicator whether the Cesarian was planned or not (z i1 ), indicator of whether additional risk factors were present at the time of birth (z i2 ), and indicator of whether antibiotics were given as a prophylaxis (z i3 ). Response variable: number of infections Y i that were observed amongst n i patients having the same covariates. # births planned risk factors antibiotics infection total y i n i z i1 z i2 z i3 11 98 1 1 1 1 18 0 1 1 0 2 0 0 1 23 26 1 1 0 28 58 0 1 0 0 9 1 0 0 8 40 0 0 0

Example 5.2: Bayesian probit model (2) Model for Y i : Y i Bin(n i, π i ), π = Φ(z iβ), where z i = [1, z i1, z i2, z i3 ] T and Φ( ) being the CDF of a N(0, 1). Prior on the parameter of interest β: β N(0, I/λ). The posterior density of β is ( N ) f(β y 1,..., y n ) Φ(z iβ) yi (1 Φ(z iβ)) n i y i i=1 exp λ 2 3 j=0 β 2 j

Example 5.2: Bayesian probit model (3) Use the following random walk Metropolis algorithm (50,000 samples). Starting with any β (0) iterate for t = 1, 2,...: 1. Draw ɛ N(0, Σ) and set β = β (t 1) + ɛ. 2. Compute { } α(β β (t 1) f(β Y 1,..., Y n ) ) = min 1, f(β (t 1). Y 1,..., Y n ) 3. With probability α(β β (t 1) ) set β (t) = β, otherwise set β (t) = β (t 1). (for the moment we use Σ = 0.08 I, and λ = 10).

Example 5.2: Bayesian probit model (4) Convergence of the β (t) j is to a distribution, not a value!

Example 5.2: Bayesian probit model (5) Convergence of cumulative averages t tau=1 β(τ) j /t is to a value.

Example 5.2: Bayesian probit model (6)

Example 5.2: Bayesian probit model (7) Posterior mean 95% credible interval intercept β 0-1.0952-1.4646-0.7333 planned β 1 0.6201 0.2029 1.0413 risk factors β 2 1.2000 0.7783 1.6296 antibiotics β 3-1.8993-2.3636-1.471

Choosing a good proposal distribution Ideally: Markov chain with small correlation ρ(x (t 1), X (t) ) between subsequent values. fast exploration of the support of the target f. Two sources for this correlation: the correlation between the current state X (t 1) and the newly proposed value X q( X (t 1) ) (can be reduced using a proposal with high variance) the correlation introduced by retaining a value X (t) = X (t 1) because the newly generated value X has been rejected (can be reduced using a proposal with small variance) Trade-off for finding the ideal compromise between: fast exploration of the space (good mixing behaviour) obtaining a large probability of acceptance For multivariate distributions: covariance of proposal should reflect the covariance structure of the target.

Example 5.3: Choice of proposal (1) Target distribution, we want to sample from: N(0, 1) (i.e. f( ) = φ (0,1) ( )) We want to use a random walk Metropolis algorithm with ε N(0, σ 2 ) What is the optimal choice of σ 2? We consider four choices σ 2 = 0.1 2, 1, 2.38 2, 10 2.

Example 5.3: Choice of proposal (2) σ 2 = 1 σ 2 = 0.1 2-6 -4-2 0 2 4 6-6 -4-2 0 2 4 6

Example 5.3: Choice of proposal (3) σ 2 = 10 2 σ 2 = 2.38 2-6 -4-2 0 2 4 6-6 -4-2 0 2 4 6

Example 5.3: Choice of proposal (4) Autocorrelation Probability of acceptance ρ(x (t 1), X (t) ) α(x, X (t 1) ) Mean 95% CI Mean 95% CI σ 2 = 0.1 2 0.9901 (0.9891,0.9910) 0.9694 (0.9677,0.9710) σ 2 = 1 0.7733 (0.7676,0.7791) 0.7038 (0.7014,0.7061) σ 2 = 2.38 2 0.6225 (0.6162,0.6289) 0.4426 (0.4401,0.4452) σ 2 = 10 2 0.8360 (0.8303,0.8418) 0.1255 (0.1237,0.1274) Suggests: Optimal choice is 2.38 2 > 1.

Example 5.4: Bayesian probit model (revisited) So far we used: Var(ɛ) = 0.08 I). Better choice: Let Var(ɛ) reflect the covariance structure Frequentist asymptotic theory: Var(ˆβ m.l.e ) = (Z DZ) 1 D is a suitable diagonal matrix Better choice: Var(ɛ) = 2 (Z DZ) 1 Increases rate of acceptance from 13.9% to 20.0% and reduces autocorrelation: Σ = 0.08 I β 0 β 1 β 2 β 3 Autocorrelation ρ(β (t 1) j, β (t) j ) 0.9496 0.9503 0.9562 0.9532 Σ = 2 (Z DZ) 1 β 0 β 1 β 2 β 3 Autocorrelation ρ(β (t 1) j, β (t) j ) 0.8726 0.8765 0.8741 0.8792 (in this example det(0.08 I) = det(2 (Z DZ) 1 ))