Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Similar documents
Bayesian GLMs and Metropolis-Hastings Algorithm

Part 8: GLMs and Hierarchical LMs and GLMs

MCMC Methods: Gibbs and Metropolis

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

CS281A/Stat241A Lecture 22

Reminder of some Markov Chain properties:

Bayesian Linear Regression

MCMC algorithms for fitting Bayesian models

Principles of Bayesian Inference

Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computational statistics

Gibbs Sampling in Linear Models #2

F denotes cumulative density. denotes probability density function; (.)

Markov Chain Monte Carlo, Numerical Integration

I. Bayesian econometrics

Module 11: Linear Regression. Rebecca C. Steorts

Markov chain Monte Carlo

Introduction to Machine Learning CMU-10701

Bayesian Linear Models

Monte Carlo Composition Inversion Acceptance/Rejection Sampling. Direct Simulation. Econ 690. Purdue University

ST 740: Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Markov Chain Monte Carlo methods

Bayesian Linear Models

Computer intensive statistical methods

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Linear Models

CSC 2541: Bayesian Methods for Machine Learning

Metropolis-Hastings Algorithm

DAG models and Markov Chain Monte Carlo methods a short overview

Answers and expectations

Multivariate Normal & Wishart

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Advanced Statistical Modelling

CSC 2541: Bayesian Methods for Machine Learning

STAT 425: Introduction to Bayesian Analysis

Robert Collins CSE586, PSU Intro to Sampling Methods

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Direct Simulation Methods #2

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

CPSC 540: Machine Learning

Bayesian Phylogenetics:

LECTURE 15 Markov chain Monte Carlo

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Other Noninformative Priors

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Markov Chain Monte Carlo (MCMC)

Lecture 8: Bayesian Estimation of Parameters in State Space Models

MCMC Sampling for Bayesian Inference using L1-type Priors

Probabilistic Machine Learning

Markov chain Monte Carlo General Principles

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Markov chain Monte Carlo

Contents. Part I: Fundamentals of Bayesian Inference 1

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

MCMC for Cut Models or Chasing a Moving Target with MCMC

Probability and Estimation. Alan Moses

TDA231. Logistic regression

The Metropolis-Hastings Algorithm. June 8, 2012

Markov chain Monte Carlo Lecture 9

Bayesian Inference and MCMC

Introduction to Bayesian Statistics 1

Introduction to Bayesian Methods

MCMC: Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

MARKOV CHAIN MONTE CARLO

MCMC Review. MCMC Review. Gibbs Sampling. MCMC Review

Stat 516, Homework 1

Markov chain Monte Carlo

Monte Carlo integration

Generating the Sample

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods


Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Module 6: Introduction to Gibbs Sampling. Rebecca C. Steorts

Lecture : Probabilistic Machine Learning

Bayesian Estimation with Sparse Grids

16 : Approximate Inference: Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Convergence Rate of Markov Chains

Module 7: Introduction to Gibbs Sampling. Rebecca C. Steorts

On Markov chain Monte Carlo methods for tall data

Bayesian Inference and Decision Theory

Lecture 7 and 8: Markov Chain Monte Carlo

Monte Carlo Methods. Leon Gu CSD, CMU

Bayesian Methods for Machine Learning

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

COS513 LECTURE 8 STATISTICAL CONCEPTS

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

INTRODUCTION TO BAYESIAN STATISTICS

Learning the hyper-parameters. Luca Martino

Convex Optimization CMU-10725

Gibbs Sampling in Linear Models #1

Transcription:

Metropolis Hastings Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 9 1

The Metropolis-Hastings algorithm is a general term for a family of Markov chain simulation methods that are useful for drawing samples from Bayesian posterior distributions. The Gibbs sampler can be viewed as a special case of Metropolis-Hastings (as well will soon see). Here, we review the basic Metropolis algorithm and its generalization to the Metropolis-Hastings algorithm, which is often useful in applications (and has many extensions). 2

The setup Suppose we can sample from p(θ y). Then we could generate θ (1),..., θ (S) iid p(θ y) and obtain Monte Carlo approximations of posterior quantities E[g(θ) y] 1/S S g(θ (i) ). i=1 3

Review of Metropolis But what if we cannot sample directly from p(θ y)? The important concept here is that we are able to construct a large collection of θ values (rather than them being iid, since this most certain for most realistic situations will not hold). Thus, for any two different θ values θ a and θ b, we need #θ s in the collection = θ a #θ p(θ a y) s in the collection = θ b p(θ b y). How might we intuitively construct such a collection? 4

Review of Metropolis Assume {θ (1),..., θ (s) }. Suppose adding new value θ (s+1). Consider adding a value θ which is nearby θ (s). Should we include θ or not? If p(θ y) > p(θ (s) y), then we want more θ s in the set than θ (s) s. But if p(θ y) < p(θ (s) y), we shouldn t necessarily include θ. Perhaps our decision to include θ or not should be based upon a comparison of p(θ y) and p(θ (s) y). Consider the ratio r = p(θ y) p(θ (s) y) = p(y θ )p(θ ) p(y θ (s) )p(θ (s) ). 5

Review of Metropolis Having computed r, what should we do next? 6

Review of Metropolis If r > 1 (intuition): Since θ (s) is already in our set, we should include θ as it has a higher probability than θ (s). (procedure): Accept θ into our set and let θ (s+1) = θ. If r < 1 (intuition): The relative frequency of θ-values in our set equal to θ compared to those equal to θ (s) should be p(θ y) p(θ (s) y) = r. This means that for every instance of θ (s), we should only have a fraction of an instance of a θ value. (procedure): Set θ (s+1) equal to either θ or θ (s) with probability r and 1 r respectively. 7

Review of Metropolis This is basic intuition behind the Metropolis (1953) algorithm. More formally, it It proceeds by sampling a proposal value θ nearby the current value θ (s) using a symmetric proposal distribution J(θ θ (s) ). What does symmetry mean here? It means that J(θ a θ b ) = J(θ b θ a ). That is, the probability of proposing θ = θ a given that θ (s) = θ b is equal to the probability of proposing θ = θ b given that θ (s) = θ a. Symmetric proposals include: J(θ θ (s) ) = Uniform(θ (s) δ, θ (s) + δ) and J(θ θ (s) ) = Normal(θ (s), δ 2 ). 8

Review of Metropolis The Metropolis algorithm proceeds as follows: 1. Sample θ J(θ θ (s) ). 2. Compute the acceptance ratio (r): r = p(θ y) p(θ (s) y) = p(y θ )p(θ ) p(y θ (s) )p(θ (s) ). 3. Let θ (s+1) = { θ θ (s) with prob min(r,1) otherwise. Remark: Step 3 can be accomplished by sampling u Uniform(0, 1) and setting θ (s+1) = θ if u < r and setting θ (s+1) = θ (s) otherwise. 9

10 Metropolis for Normal-Normal (review) Let s test out the Metropolis algorithm for the conjugate Normal-Normal model with a known variance situation. That is let X 1,..., X n θ iid Normal(θ, σ 2 ) θ Normal(µ, τ 2 ). Recall that the posterior of θ is Normal(µ n, τn), 2 where n/σ 2 µ n = x n/σ 2 + 1/τ 2 + µ 1/τ 2 n/σ 2 + 1/τ 2 and τ 2 n = 1 n/σ 2 + 1/τ 2.

11 Metropolis for Normal-Normal (review) Suppose (taken from Hoff, 2009), σ 2 = 1, τ 2 = 10, µ = 5, n = 5, and y = (9.37, 10.18, 9.16, 11.60, 10.33). For these data, µ n = 10.03 and τ 2 n = 0.20. Let s use metropolis to estimate the posterior (just as an illustration). Based on this model and prior, we need to compute the acceptance ratio r r = p(θ x) p(θ (s) x) = p(x θ )p(θ ) p(x θ (s) )p(θ (s) ) ( = i dnorm(x i, θ ) (, σ) dnorm(θ ), µ, τ) i dnorm(x i, θ (s), σ) dnorm(θ (s), µ, τ) (1) (2)

12 Metropolis for Normal-Normal (review) In many cases, computing the ratio r directly can be numerically unstable, however, this can be modified by taking log r. This results in log r = [ ] log dnorm(x i, θ, σ) log dnorm(x i, θ (s), σ) i + [ ] log dnorm(θ, µ, τ) log dnorm(θ (s), µ, τ). i Then a proposal is accepted if log u < log r, where u is sampled from the Uniform(0,1).

13 Metropolis for Normal-Normal (review) We run 10,000 iterations of the Metropolis algorithm stating at θ (0) = 0. and using a normal proposal distribution, where θ (s+1) Normal(θ (s), 2). θ 5 6 7 8 9 10 11 density 0.0 0.2 0.4 0.6 0.8 0 2000 4000 6000 8000 8.5 9.0 9.5 10.5 11.5 iteration θ Figure 1: Results from the Metropolis sampler for the normal model.

14 Metropolis Hastings Let s recall what a Markov chain is. The Gibbs sampler and the Metropolis algorithm are both ways of generating Markov chains that approximate a target probability distribution.

15 Consider a simple example where our target probability distribution is p o (u, v), a bivariate distribution for two random variables U and V. In the one-sample normal problem, U = θ, V = σ 2 and p o (u, v) = p(θ, σ 2 y). Gibbs: iteratively sample values of U and V from their conditional distributions. That is, 1. update U : sample u (s+1) p o (u v (s) ) 2. update V : sample v (s+1) p o (v u (s+1) ). Metropolis: proposes changes to X = (U, V ) and then accepts or rejects those changes based on p o.

16 An alternative way to implement the Metropolis algorithm is to propose and then accept or reject change to one element at a time: 1. update U : 1.1 sample u J u (u u (s) ) 1.2 compute r = p o(u, v (s) ) p o (u (s), v (s) ) 1.3 set u (s+1) equal to u or u (s+1) with prob min(1,r) and max(0,1-r). 2. update V : sample v (s+1) p o (v u (s+1) ). 2.1 sample v J u (v v (s) ) 2.2 compute r = p o(u (s+1), v ) p o (u (s+1), v (s) ) 2.3 set v (s+1) equal to v or v (s) with prob min(1,r) and max(0,1-r). Here, J u and J v are separate symmetric proposal distributions for U and V.

17 The Metropolis algorithm generates proposals from J u and J v It accepts them with some probability min(1,r). Similarly, each step of Gibbs can be seen as generating a proposal from a full conditional and then accepting it with probability 1. The Metropolis-Hastings (MH) algorithm generalizes both of these approaches by allowing arbitrary proposal distributions. The proposal distributions can be symmetric around the current values, full conditionals, or something else entirely.

A MH algorithm for approximating p o (u, v) runs as follows: 1. update U : 1.1 sample u J u (u u (s), v (s) ) 1.2 compute r = p o(u, v (s) ) p o (u (s), v (s) ) J u(u (s) u, v (s) ) J u (u u (s), v (s) ) 1.3 set u (s+1) equal to u or u (s+1) with prob min(1,r) and max(0,1-r). 2. update V : 2.1 sample v J v (u u (s+1), v (s) ) 2.2 compute r = p o(u (s+1), v ) p o (u (s+1), v (s) ) J u(v (s+1) u (s+1), v ) ) J u (v u (s+1), v (s) ) 2.3 set v (s+1) equal to v or v (s+1) with prob min(1,r) and max(0,1-r). Above: J u and J v are not required to be symmetric. They cannot depend on U or V values in our sequence previous to the most current values. This requirement ensures that the sequence is a Markov chain. 18

19 Doesn t the algorithm above look familiar? Yes, it looks a lot like Metropolis, except the acceptance ratio r contains an extra factor: It contains the ratio of the prob of generating the current value from the proposed to the prob of generating the proposed from the current. This can be viewed as a correction factor. If a value u is much more likely to be proposed than the current value u (s) then we must down-weight the probability of accepting u. Otherwise, such a value u will be overrepresented in the chain. Exercise 1: Show that Metropolis is a special case of MH. Hint: Think about the jumps J. Exercise 2: Show that Gibbs is a special case of MH. Hint: Show that r = 1.

20 We implement the Metropolis algorithm for a Poisson regression model. We have a sample from a population of 52 song sparrows that was studied over the course of a summer and their reproductive activities were recorded. In particular, their age and number of new offspring were recorded for each sparrow (Arcese et al., 1992). A simple probability model to fit the data would be a Poisson regression where, Y = number of offspring conditional on x = age. Thus, we assume that Y θ x Poisson(θ x ). For stability of the model, we assume that the mean number of offspring θ x is a smooth function of age. Thus, we express θ x = β 1 + β 2 x + β 3 x 2.

21 Remark: This parameterization allows some values of θ x to be negative, so as an alternative we reparameterize and model the log-mean of Y, so that which implies that log E(Y x) = log θ x = log(β 1 + β 2 x + β 3 x 2 ) θ x = exp(β 1 + β 2 x + β 3 x 2 ) = exp(β T x). Now back to the problem of implementing Metropolis. For this problem, we will write log E(Y i x i ) = log(β 1 + β 2 x i + β 3 x 2 i ) = β T x i, where x i is the age of sparrow i. We will abuse notation slightly and write x i = (1, x i, x 2 i ).

22 We will assume the prior on the regression coefficients is iid Normal(0,100). Given a current value β (s) and a value β generated from J(β, β (s) ) the acceptance ration for the Metropolis algorithm is: r = p(β X, y) p(β (s) X, y) = n i=1 dpois(y i, x T 3 i β ) n i=1 dpois(y i, x T i β(s) ) j=1 dnorm(β j, 0, 10) 3 j=1 dnorm(β(s) j, 0, 10).

23 We just need to specify the proposal distribution for θ A convenient choice is a multivariate normal distribution with mean β (s). In many problems, the posterior variance can be an efficient choice of a proposal variance. But we don t know it here. However, it s often sufficient to use a rough approximation. In a normal regression problem, the posterior variance will be close to σ 2 (X T X) 1 where σ 2 is the variance of Y. In our problem: E log Y = β T x so we can try a proposal variance of ˆσ 2 (X T X) 1 where ˆσ 2 is the sample variance of log(y + 1/2). Remark: Note we add 1/2 because otherwise log 0 is undefined. The code of implementing the algorithm will be done in the corresponding lab.

24 β3 0.3 0.2 0.1 0.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 2000 6000 10000 iteration 0 10 20 30 40 lag 0 5 10 20 30 lag/10 Figure 2: Plot of the Markov chain in β 3 along with autocorrelations functions

More details of this example will be done in lab. 25