MALA versus Random Walk Metropolis Dootika Vats June 4, 2017

Similar documents
Markov Chain Monte Carlo (MCMC)

An introduction to adaptive MCMC

MCMC Methods: Gibbs and Metropolis

Examples of Adaptive MCMC

ST 740: Markov Chain Monte Carlo

Markov chain Monte Carlo

Kernel Sequential Monte Carlo

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Lecture 8: The Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Package mcmcse. July 4, 2017

A Dirichlet Form approach to MCMC Optimal Scaling

Markov chain Monte Carlo methods in atmospheric remote sensing

Introduction to Machine Learning CMU-10701

Gradient-based Monte Carlo sampling methods

16 : Approximate Inference: Markov Chain Monte Carlo

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Reminder of some Markov Chain properties:

Log-concave sampling: Metropolis-Hastings algorithms are fast!

Advanced Statistical Modelling

Computational statistics

The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

eqr094: Hierarchical MCMC for Bayesian System Reliability

Markov Chain Monte Carlo methods

1 Geometry of high dimensional probability distributions

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Riemann Manifold Methods in Bayesian Statistics

Markov Chain Monte Carlo

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Bayesian Methods for Machine Learning

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Zig-Zag Monte Carlo. Delft University of Technology. Joris Bierkens February 7, 2017

University of Toronto Department of Statistics

Markov-Chain Monte Carlo

TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS

arxiv: v1 [stat.co] 2 Nov 2017

Kernel adaptive Sequential Monte Carlo

Dimension-Independent likelihood-informed (DILI) MCMC

In many cases, it is easier (and numerically more stable) to compute

The two subset recurrent property of Markov chains

MCMC algorithms for fitting Bayesian models

Examples of Adaptive MCMC

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Methodology for inference on the Markov modulated Poisson process and theory for optimal scaling of the random walk Metropolis

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Metropolis-Hastings sampling

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Practical: Metropolis-Hastings-based MCMC

17 : Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Introduction to Bayesian methods in inverse problems

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

Learning the hyper-parameters. Luca Martino

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

MARKOV CHAIN MONTE CARLO

Principles of Bayesian Inference

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Statistical Data Mining and Medical Signal Detection. Lecture Five: Stochastic Algorithms and MCMC. July 6, Motoya Machida

Darwin Uy Math 538 Quiz 4 Dr. Behseta

Robert Collins CSE586, PSU Intro to Sampling Methods

Statistical analysis of neural data: Monte Carlo techniques for decoding spike trains

Directional Metropolis Hastings algorithms on hyperplanes

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Kernel Adaptive Metropolis-Hastings

Variance Bounding Markov Chains

The random walk Metropolis: linking theory and practice through a case study.

Monte Carlo methods for sampling-based Stochastic Optimization

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Monte Carlo integration

Monte Carlo Methods. Leon Gu CSD, CMU

Adaptive Metropolis with Online Relabeling

The random walk Metropolis: linking theory and practice through a case study.

Metropolis-Hastings Algorithm

The Random Walk Metropolis: Linking Theory and Practice Through a Case Study

19 : Slice Sampling and HMC

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Applicability of subsampling bootstrap methods in Markov chain Monte Carlo

Bayesian Estimation of Expected Cell Counts by Using R

The lmm Package. May 9, Description Some improved procedures for linear mixed models

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Improved Robust MCMC Algorithm for Hierarchical Models

STA 294: Stochastic Processes & Bayesian Nonparametrics

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Bayesian Gaussian Process Regression

Package lmm. R topics documented: March 19, Version 0.4. Date Title Linear mixed models. Author Joseph L. Schafer

Control Variates for Markov Chain Monte Carlo

Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

Multivariate Normal & Wishart

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models

Lecture 7 and 8: Markov Chain Monte Carlo

Markov chain Monte Carlo

The zig-zag and super-efficient sampling for Bayesian analysis of big data

Metric Predicted Variable on One Group

Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions

Transcription:

MALA versus Random Walk Metropolis Dootika Vats June 4, 2017 Introduction My research thus far has predominantly been on output analysis for Markov chain Monte Carlo. The examples on which I have implemented our methods have been Gibbs samplers, vanilla Metropolis-Hastings samplers, or Metropolis-within-Gibbs samplers. I have been somewhat distant from the wide variety of samplers that users can choose from. One of the more popular ones is the Metropolis-adjusted Langevin Algorithm (MALA), introduced in Roberts and Tweedie (1996) and further studied in Roberts and Rosenthal (1998). The MALA is a Metropolis-Hastings sampler with a special proposal distribution. This proposal distribution distribution evaluates the gradient of the log density at the current state and moves the center of the proposal distribution by a scaled factor of this gradient. MALA is based on Langevin diffusions, a connection I am going to ignore in this article due to lack of knowledge at this point. MALA Let π(x) be the target distribution for the MCMC sampler defined on a p-dimensional space. A generic Metropolis-Hastings sampler at the current value x proposes a value y from a proposal distribution with density q(x, ). The proposed value y is accepted with probability π(y)q(y, x) min 1,. π(x)q(x, y) Different choices of q lead to different samplers. For MALA, q(x, y) is the density for the distribution ( ) N x + σ2 1 2 log π(x), σ2 1D where σ 2 1 is the step-size greater than 0, and D is a p p positive definite matrix. The D here is similar to the choice of covariance matrix in the proposal for the random walk Metropolis sampler. We will just use D to be the diagonal matrix in our example. If σ 2 1 is tuned properly, the MALA proposal forces the center of the distribution to move climb the gradient. This enables the sampler to move away from the tails faster than other naive samplers. We will compare the performance of the MALA sampler with that of the random walk Metropolis (RWM) sampler. The RWM sampler uses the proposal distribution where the purpose of σ 2 2 and D is the same as before. N(x, σ 2 2D), Roberts and Rosenthal (1998) concluded that an optimal acceptance rate for MALA is.574 and Roberts, Gelman, and Gilks (1997) concluded that the optimal acceptance rate for the RWM is.234. We will tune both σ 2 1 and σ 2 2 to acheive these acceptance rates. 1

Example 1: Gaussian Let the target distribution be a bivariate Normal distribution (nothing too complicated). (( ) ( )) 3 2.5 N 2,. 6.5 1 The mean vector we denote by µ and the 2 2 covariance matrix by Σ. The density for this distribution is π(x) exp (x µ)t Σ 1 (x µ) 2 log π(x) = log const (x µ)t Σ 1 (x µ) 2 log π(x) = Σ 1 (x µ.) mu <- c(3, 6) Sigma <- matrix(c(2,.5,.5, 1), nrow = 2, ncol = 2) Sigma.inv <- solve(sigma) # Calculates the loglikelihood of the bivariate normal loglike <- function(x) return(as.numeric(- t((x - mu)) %*% Sigma.inv %*% (x - mu)/2) ) First I write down the function for the RWM sampler. Note that this proposal is symmetric. set.seed(100) rwm <- function(n = 1e5, sigma) chain.rwm <- matrix(0, nrow = N, ncol = 2) accept <- 0 # Starting value is the origin chain.rwm[1,] <- c(0,0) for(i in 2:N) prop <- rnorm(2, mean = chain.rwm[i-1, ], sd = sigma) log.ratio <- loglike(prop) - loglike(chain.rwm[i-1, ]) if(log(runif(1)) < log.ratio) chain.rwm[i, ] <- prop accept <- accept+1 else chain.rwm[i, ] <- chain.rwm[i-1, ] return(list("chain" = chain.rwm, "accept" = accept/n)) out.rwm <- rwm(sigma = 1) out.rwm$accept ## [1] 0.58784 2

# Calibrated sigma to get close to optimal rate out.rwm <- rwm(sigma = 2.5) out.rwm$accept ## [1] 0.25726 Coding up the MALA sampler is slightly more complicated since the proposal distribution is no longer symmetric and the densities do not cancel out in the acceptance ratio. # Calculates the log gradient of the density proplike <- function(x, y, sigma) grad <- Sigma.inv %*% (y - mu) mu.m <- y + sigma^2 * grad/2 return(as.numeric(- t((x - mu.m)) %*% (x - mu.m)/(2*sigma^2) ) ) # Mala sampler mala <- function(n = 1e5, sigma) chain.mala <- matrix(0, nrow = N, ncol = 2) accept <- 0 # Starting value is the origin chain.mala[1,] <- c(0,0) for(i in 2:N) grad <- Sigma.inv %*% (chain.mala[i-1, ] - mu) mu.m <- chain.mala[i-1, ] + sigma^2 * grad/2 prop <- rnorm(2, mean = mu.m, sd = sigma) log.ratio <- loglike(prop) - loglike(chain.mala[i-1, ]) if(log(runif(1)) < log.ratio) chain.mala[i, ] <- prop accept <- accept+1 else chain.mala[i, ] <- chain.mala[i-1, ] return(list("chain" = chain.mala, "accept" = accept/n)) out.mala <- mala(sigma = 1) out.mala$accept + proplike(chain.mala[i-1, ], prop, sigma) ## [1] 0.29073 # Tuning Mala out.mala <- mala(sigma =.5) out.mala$accept ## [1] 0.5763 3

To compare the performance of the two samplers, we plot some graphs. The first is the traceplot for the two components. par(mfrow = c(1, 2)) plot(tail(1:1e5, 1e4), tail(out.rwm$chain[,1],1e4), ylab = "First Component", main = "", type lines(tail(1:1e5, 1e4),tail(out.mala$chain[,1], 1e4), col = "red") plot(tail(1:1e5, 1e4), tail(out.rwm$chain[,2], 1e4), ylab = "Second Component", lines(tail(1:1e5, 1e4),tail(out.mala$chain[,2], 1e4), col = "red") main = "", t First Component 2 2 6 Second Component 3 5 7 9 90000 96000 90000 96000 index index The performance looks similar in their trace plot. MALA looks like it may produce thinner tails and focus on areas of high probability. par(mfrow = c(1,2)) plot(density(out.rwm$chain[,1]), ylab = "First Component", main = "") lines(density(out.mala$chain[,1]), col = "red") plot(density(out.rwm$chain[,2]), ylab = "Second Component", main = "") lines(density(out.mala$chain[,2]), col = "red") First Component 0.00 0.15 Second Component 0.0 0.2 0.4 2 2 6 0 4 8 N = 100000 Bandwidth = 0.12 N = 100000 Bandwidth = 0.090 In terms of autocorrelation, we see the following results. par(mfrow = c(2,3)) acf(out.rwm$chain[,1], main = "RWM: First") acf(out.rwm$chain[,2], main = "RWM: Second") 4

ccf(out.rwm$chain[,1], out.rwm$chain[,2], main = "RWM: CCF") acf(out.mala$chain[,1], main = "MALA: First") acf(out.mala$chain[,2], main = "MALA: Second") ccf(out.mala$chain[,1], out.mala$chain[,2], main = "MALA: CCF") RWM: First RWM: Second RWM: CCF 0.0 0.1 0.2 0.3 0 10 20 30 40 50 0 10 20 30 40 50 40 0 20 40 MALA: First MALA: Second MALA: CCF 0.00 0.10 0.20 0.30 0 10 20 30 40 50 0 10 20 30 40 50 40 0 20 40 Interestingly, MALA produces much higher autocorrelation, and significantly higher crosscorrelation. This would imply that the multivariate effective sample size for MALA for estimating the mean of the Normal distribution will be smaller. Here is the implementation. library(mcmcse) ## mcmcse: Monte Carlo Standard Errors for MCMC ## Version 1.2-1 created on 2016-03-24. ## copyright (c) 2012, James M. Flegal, University of California,Riverside ## John Hughes, University of Minnesota ## Dootika Vats, University of Minnesota ## For citation information, type citation("mcmcse"). ## Type help("mcmcse-package") to get started. c(multiess(out.rwm$chain), multiess(out.mala$chain)) ## [1] 12862.660 1910.657 And clearly, the effective sample size for a Monte Carlo sample of size 1e5 is much smaller for MALA than it is for the RWM. So both RWM and MALA seem to yield decent density estimates but for estimation of the mean (3,6), the RWM will clearly be favored over MALA due to lost of efficiency. 5

Intuitively I think MALA will work better for fatter tail distributions. Let s see Example 2: Multimodal Let the target be a t distribution with 5 degrees of freedom π(x) ) 3 (1 + x2. 5 π(x) ) 3 (1 + x2 5 log π(x) = log const 3 log log π(x) = 6x 5 + x 2. # Calculates the loglikelihood of the bivariate normal loglike <- function(x) return( -3*log(1 + x^2/5) ) Below is the RWM implementation rwm <- function(n = 1e5, sigma) chain.rwm <- numeric(length = N) accept <- 0 # Starting value is far from 0 chain.rwm[1] <- 10 for(i in 2:N) prop <- rnorm(1, mean = chain.rwm[i-1 ], sd = sigma) log.ratio <- loglike(prop) - loglike(chain.rwm[i-1]) if(log(runif(1)) < log.ratio) chain.rwm[i ] <- prop accept <- accept+1 else chain.rwm[i ] <- chain.rwm[i-1 ] return(list("chain" = chain.rwm, "accept" = accept/n)) out.rwm <- rwm(sigma = 1) out.rwm$accept ## [1] 0.72299 # Calibrated sigma to get close to optimal rate out.rwm <- rwm(sigma = 5) out.rwm$accept ) (1 + x2 5 6

## [1] 0.2745 Coding up the MALA sampler is slightly more complicated since the proposal distribution is no longer symmetric and the densities do not cancel out in the acceptance ratio. # Calculates the log gradient of the density proplike <- function(x, y, sigma) grad <- -6*y/(5 + y^2) mu.m <- y + sigma^2 * grad/2 return(as.numeric(- t((x - mu.m)) %*% (x - mu.m)/(2*sigma^2) ) ) # Mala sampler mala <- function(n = 1e5, sigma) chain.mala <- numeric(length = N) accept <- 0 # Starting value is far from 0 chain.mala[1] <- 10 for(i in 2:N) grad <- -6*chain.mala[i-1]/(5 + chain.mala[i-1]^2) mu.m <- chain.mala[i-1] + sigma^2 * grad/2 prop <- rnorm(1, mean = mu.m, sd = sigma) log.ratio <- loglike(prop) - loglike(chain.mala[i-1]) if(log(runif(1)) < log.ratio) chain.mala[i ] <- prop accept <- accept+1 else chain.mala[i] <- chain.mala[i-1 ] return(list("chain" = chain.mala, "accept" = accept/n)) out.mala <- mala(sigma = 1) out.mala$accept + proplike(chain.mala[i-1], prop, sigma) - pr ## [1] 0.93122 # Tuning Mala out.mala <- mala(sigma = 2.2) out.mala$accept ## [1] 0.5353 par(mfrow = c(1, 2)) plot(tail(1:1e5, 1e4), tail(out.rwm$chain,1e4),, main = "", type = 'l', xlab = "index") lines(tail(1:1e5, 1e4),tail(out.mala$chain, 1e4), col = "red") plot(density(out.rwm$chain), main = "") 7

lines(density(out.mala$chain), col = "red") tail(out.rwm$chain, 10000) 5 0 5 10 90000 96000 Density 0.0 0.2 10 0 10 index N = 100000 Bandwidth = 0.097 Ah, we see MALA exploring the space more than RWM. Let us zoom into that. par(mfrow = c(1, 1)) plot(density(out.rwm$chain), main = " truth is BLUE", xlim = range(c(-5,5))) lines(density(out.mala$chain), col = "red") lines(seq(-5,5, length = 1e4), dt( seq(-5,5, length = 1e4), df = 5), col = "blue") truth is BLUE Density 0.0 0.2 4 2 0 2 4 N = 100000 Bandwidth = 0.09795 Hmm, density estimation seems similar after 1e5 iterations. par(mfrow = c(1,2)) acf(out.rwm$chain, main = "RWM") acf(out.mala$chain, main = "MALA") 8

RWM MALA 0 20 40 0 20 40 Since the seems lower for MALA, the effective sample size for estimating the mean will be larger. c(ess(out.rwm$chain), ess(out.mala$chain)) ## se se ## 18930.44 38630.07 Thus for a distribution with fatter tails, MALA performs (marginally) better. For more complicated high dimensional target distributions, I can understand how MALA would be a handy tool. References Roberts, Gareth O, and Jeffrey S Rosenthal. 1998. Optimal Scaling of Discrete Approximations to Langevin Diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60. Wiley Online Library: 255 68. Roberts, Gareth O, and Richard L Tweedie. 1996. Exponential Convergence of Langevin Distributions and Their Discrete Approximations. Bernoulli. JSTOR, 341 63. Roberts, Gareth O, Andrew Gelman, and Walter R Gilks. 1997. Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. The Annals of Applied Probability 7. Institute of Mathematical Statistics: 110 20. 9