Metropolis-Hastings sampling

Similar documents
Bayesian linear regression

Principles of Bayesian Inference

Monte Carlo in Bayesian Statistics

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

MCMC algorithms for fitting Bayesian models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Bayesian Linear Regression

Bayesian Linear Models

Metropolis-Hastings Algorithm

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

ST 740: Markov Chain Monte Carlo

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo

Bayesian Linear Models

STA 4273H: Statistical Machine Learning

eqr094: Hierarchical MCMC for Bayesian System Reliability

Computer Practical: Metropolis-Hastings-based MCMC

Bayesian Linear Models

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Gaussian Process Regression

DAG models and Markov Chain Monte Carlo methods a short overview

Monte Carlo Inference Methods

MCMC for Cut Models or Chasing a Moving Target with MCMC

Bayesian GLMs and Metropolis-Hastings Algorithm

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

MCMC Methods: Gibbs and Metropolis

17 : Markov Chain Monte Carlo

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Computer intensive statistical methods

Bayesian Methods for Machine Learning

Part 8: GLMs and Hierarchical LMs and GLMs

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Simulating Random Variables

Introduction to Probabilistic Machine Learning

Markov Chain Monte Carlo, Numerical Integration

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

6.1 Approximating the posterior by calculating on a grid

INTRODUCTION TO BAYESIAN STATISTICS

19 : Slice Sampling and HMC

A Review of Pseudo-Marginal Markov Chain Monte Carlo

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Tutorial on Probabilistic Programming with PyMC3

Coupled Hidden Markov Models: Computational Challenges

CSC 2541: Bayesian Methods for Machine Learning

Bayesian model selection for computer model validation via mixture model estimation

Principles of Bayesian Inference

CPSC 540: Machine Learning

Computational statistics

Learning the hyper-parameters. Luca Martino

Lecture 4: Dynamic models

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Sargur Srihari

Bayesian Phylogenetics:

Advanced Statistical Modelling

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Dynamic Generalized Linear Models

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Reminder of some Markov Chain properties:

Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University.

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Reducing The Computational Cost of Bayesian Indoor Positioning Systems

Principles of Bayesian Inference

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Markov Chain Monte Carlo methods

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

The Recycling Gibbs Sampler for Efficient Learning

Hierarchical Modeling for Spatial Data

Monte Carlo Methods in Bayesian Inference: Theory, Methods and Applications

MONTE CARLO METHODS. Hedibert Freitas Lopes

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Inference and MCMC

CS281A/Stat241A Lecture 22

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STAT 425: Introduction to Bayesian Analysis

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Control Variates for Markov Chain Monte Carlo

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian data analysis in practice: Three simple examples

Online Appendix for: A Bounded Model of Time Variation in Trend Inflation, NAIRU and the Phillips Curve

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Bayesian non-parametric model to longitudinally predict churn

CTDL-Positive Stable Frailty Model

Manifold Monte Carlo Methods

Likelihood-free MCMC

Darwin Uy Math 538 Quiz 4 Dr. Behseta

MALA versus Random Walk Metropolis Dootika Vats June 4, 2017

PSEUDO-MARGINAL METROPOLIS-HASTINGS APPROACH AND ITS APPLICATION TO BAYESIAN COPULA MODEL

Markov Chains: Basic Theory Definition 1. A (discrete time/discrete state space) Markov chain (MC) is a sequence of random quantities {X k }, each tak

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Lecture 7 and 8: Markov Chain Monte Carlo

Riemann Manifold Methods in Bayesian Statistics

θ 1 θ 2 θ n y i1 y i2 y in Hierarchical models (chapter 5) Hierarchical model Introduction to hierarchical models - sometimes called multilevel model

Transcription:

Metropolis-Hastings sampling Gibbs sampling requires that a sample from each full conditional distribution. In all the cases we have looked at so far the conditional distributions were conjugate so sampling was straight-forward. What if they are not conjugate? We could make draws from the conditional distributions using rejection sampling. This works well if there are only a few non-conjugate parameters but can be difficult to tune. Other methods have been proposed: 1. Auxiliary variables 2. Slice sampling 3. Metropolis-Hastings sampling Metropolis-Hastings sampling is the most widely used. ST740 (3) Computing - Part 3 Page 1

Auxiliary Variables Sometimes the model can be reexpressed so that the full conditionals are conjugate by adding auxiliary variables. For example, consider the zero-inflated Poisson model with Y 1,..., Y n f(y θ, λ) with f(y θ, λ) = θi(y = 0) + (1 θ)poisson(y λ) and priors θ Beta(a, b) and λ Gamma(c, d). The full conditionals for θ and λ are not conjugate. Consider the following model δ(0) Z i = 1 Y i Z i Poisson(λ) Z i = 0 where δ(0) is the point mass distribution at zero and Z i Bernoulli(θ). The Z i are auxiliary variables. After their inclusion, the model is equivalent to the zero-inflated Poisson and conditionally conjugate. It is equivalent because: ST740 (3) Computing - Part 3 Page 2

Auxiliary Variables The full conditionals are: p(z i θ, λ, y) = p(θ λ, Z 1,..., Z n, y) = p(λ θ, Z 1,..., Z n, y) = ST740 (3) Computing - Part 3 Page 3

Code #bookkeeping n <- length(y) sumy <- sum(y) keepers <- matrix(0,iters,2) #initial values z <- rep(1,n) theta <- 0.5 lambda <- 1 for(i in 1:iters){ #update z P1 <- ifelse(y==0,1,0)*theta P0 <- dpois(y,lambda)*(1-theta) z <- rbinom(n,1,p1/(p1+p0)) #update theta and lambda n1 <- sum(z==1) n0 <- n-n1 theta <- rbeta(1,a+n1,b+n0) lambda <- rgamma(1,c+sumy,d+n0) keepers[i,]<-c(theta,lambda) Code is online at http://www4.stat.ncsu.edu/ reich/st740/code/zip.r. ST740 (3) Computing - Part 3 Page 4

Slice sampling It is not obvious how to find auxiliary variables to give conjugacy. Slice sampling is a formulaic way to define auxiliary variables. Say the univariate posterior is p(θ y) h(θ y). We add auxiliary variable U θ, y U(0, h(θ y)). The joint distribution is The marginal distribution of θ y is The full conditional distributions are How to sample from p(θ U, y)? It is called slice sampling because: ST740 (3) Computing - Part 3 Page 5

Metropolis-Hastings sampling Metropolis-Hastings sampling is like Gibbs sampling in that you begin with initial values for all parameters, and then update them one at a time conditioned on the current value of all other parameters. In this algorithm, we do not need to sample from the full conditionals. All we need is to be able to evaluate the posterior up to a normalizing constant. Basic plan for each update: 1. Draw a candidate for θ j 2. Compare the posterior of the candidate and the current value 3. If the candidate looks good, keep it; if not, keep the current value. Formally, the algorithm to update θ j given the current value θ j = θ (t 1) j 1. Draw a candidate for θ j g(θ j θ, y). 2. Compute the acceptance ratio 3. Generate U Uniform(0,1). R = p(y θ )p(θ ) g(θj θ ) p(y θ )p(θ ) g(θ j θ ). is: 4. Set θ (t) j = θ j θ j U < R U > R Notation: θ and θ are current parameter vectors with j th element θ j and θ j, respectively. ST740 (3) Computing - Part 3 Page 6

Metropolis-Hastings sampling Selecting the candidate distribution is crucial. The most common is the Gaussian random walk θ j N(θ j, c 2 j). In this case g is symmetric, i.e., g(θ j θ ) = g(θ j θ ) and R = p(y θ )p(θ ) p(y θ )p(θ ). When the candidate distribution is symmetric, the algorithm is the Metropolis algorithm. The algorithm has tuning parameter c j. For a random walk candidate, c j is tuned so that the acceptance probability is 30-50%. If c is too large, the acceptance ratio becomes: If c is too small, the acceptance ratio becomes: Both are bad. Why? ST740 (3) Computing - Part 3 Page 7

Metropolis-Hastings Sampling Consider the logistic regression for binary data Y i {0, 1 and covariate vector X i = (X i1,..., X jp ) T Prob(Y i = 1 β) = exp(x T i β) 1 + exp(x T i β) (1) β j N(µ j, σj 2 ). (2) The full posterior is: ST740 (3) Computing - Part 3 Page 8

Code #full joint log posterior of beta y for the logistic regression model: log_post<-function(y,x,beta,prior.mn,prior.sd){ xbeta <- X%*%beta xbeta <- ifelse(xbeta>10,10,xbeta) like <- sum(dbinom(y,1,expit(xbeta),log=true)) prior <- sum(dnorm(beta,prior.mn,prior.sd,log=true)) like+prior logit<-function(x){log(x/(1-x)) expit<-function(x){exp(x)/(1+exp(x)) # Start sampling p<-ncol(x) #Initial values: beta <- rnorm(p,0,1) keep.beta <- matrix(0,n.samples,p) acc <- rep(0,p) cur_log_post <- log_post(y,x,beta,prior.mn,prior.sd) for(i in 1:n.samples){ #Update beta using MH sampling: for(j in 1:p){ # Draw candidate: canbeta <- beta canbeta[j] <- rnorm(1,beta[j],can.sd) can_log_post <- log_post(y,x,canbeta,prior.mn,prior.sd) # Compute acceptance ratio: R <- exp(can_log_post-cur_log_post) U <- runif(1) if(u<r){ beta <- canbeta cur_log_post <- can_log_post acc[j] <- acc[j]+1 keep.beta[i,]<-beta Code is online at http://www4.stat.ncsu.edu/ reich/st740/code/logistic.r. ST740 (3) Computing - Part 3 Page 9

Metropolis-Hastings sampling What is the ideal candidate distribution? If the candidate distribution is the full conditional, g(θ θ j ) = p(θ j y, θ l for all l j), what is the acceptance ratio? Is this a good candidate? What does this say about the relationship between Gibbs and Metropolis-Hastings sampling? ST740 (3) Computing - Part 3 Page 10

Metropolis-Hastings sampling Even when the full conditional is not available, with some effort you can find a better candidate distribution than a random walk. For example, the Langevin-Hasting algorithm tries to match the candidate to the full conditional using a first-order Taylor approximation, ( θ j N θj + c 2 f(y θ )p(θ ) ), c 2. θj This candidate distribution is not symmetric. Hamiltonian MCMC involves an even more elaborate candidate distribution. Gelman et al favor this method and discuss it extensively in their book. ST740 (3) Computing - Part 3 Page 11

Metropolis-Hastings sampling Comments: How to handle bounded parameters, say σ 2 (0, )? Blocked Metropolis is possible. which is hard to tune. This requires a high-dimensional candidate distribution Adaptive Metropolis-Hastings tries to match the candidate distribution to the full conditional using past samples. This is performed in the ARMS package in R. You can also keep sampling each parameter until you get a success. This requires a complicated acceptance ratio. Combining Metropolis-Hastings and Gibbs sampling: Since Gibbs is a special case of Metropolis-Hastings we can easily combine the two. If some parameters have conjugate full conditionals then they can be updated from their full conditional as in Gibbs; for other parameters you can use an Metropolis step. This is really one large Metropolis sampler, but with the full conditional as the candidate when possible. In the logistic regression example we had prior β j N(0, σ 2 ) where σ 2 was fixed. What if instead σ 2 InvGamma(a, b)? The full conditions for β j remain unknown, but σ 2 has a conjugate prior. Example code given on the next page. ST740 (3) Computing - Part 3 Page 12

Code for(i in 1:n.samples){ #Update beta using MH sampling: for(j in 1:p){ # Draw candidate: canbeta <- beta canbeta[j] <- rnorm(1,beta[j],can.sd) can_log_post <- log_post(y,x,canbeta,0,sqrt(sigma2)) # Compute acceptance ratio: R <- exp(can_log_post-cur_log_post) U <- runif(1) if(u<r){ beta <- canbeta cur_log_post <- can_log_post # Update sigmaˆ2 using Gibbs sigma2<-1/rgamma(1,p/2+a,sum(betaˆ2)/2+b) ST740 (3) Computing - Part 3 Page 13