MALA versus Random Walk Metropolis Dootika Vats June 4, 2017 Introduction My research thus far has predominantly been on output analysis for Markov chain Monte Carlo. The examples on which I have implemented our methods have been Gibbs samplers, vanilla Metropolis-Hastings samplers, or Metropolis-within-Gibbs samplers. I have been somewhat distant from the wide variety of samplers that users can choose from. One of the more popular ones is the Metropolis-adjusted Langevin Algorithm (MALA), introduced in Roberts and Tweedie (1996) and further studied in Roberts and Rosenthal (1998). The MALA is a Metropolis-Hastings sampler with a special proposal distribution. This proposal distribution distribution evaluates the gradient of the log density at the current state and moves the center of the proposal distribution by a scaled factor of this gradient. MALA is based on Langevin diffusions, a connection I am going to ignore in this article due to lack of knowledge at this point. MALA Let π(x) be the target distribution for the MCMC sampler defined on a p-dimensional space. A generic Metropolis-Hastings sampler at the current value x proposes a value y from a proposal distribution with density q(x, ). The proposed value y is accepted with probability π(y)q(y, x) min 1,. π(x)q(x, y) Different choices of q lead to different samplers. For MALA, q(x, y) is the density for the distribution ( ) N x + σ2 1 2 log π(x), σ2 1D where σ 2 1 is the step-size greater than 0, and D is a p p positive definite matrix. The D here is similar to the choice of covariance matrix in the proposal for the random walk Metropolis sampler. We will just use D to be the diagonal matrix in our example. If σ 2 1 is tuned properly, the MALA proposal forces the center of the distribution to move climb the gradient. This enables the sampler to move away from the tails faster than other naive samplers. We will compare the performance of the MALA sampler with that of the random walk Metropolis (RWM) sampler. The RWM sampler uses the proposal distribution where the purpose of σ 2 2 and D is the same as before. N(x, σ 2 2D), Roberts and Rosenthal (1998) concluded that an optimal acceptance rate for MALA is.574 and Roberts, Gelman, and Gilks (1997) concluded that the optimal acceptance rate for the RWM is.234. We will tune both σ 2 1 and σ 2 2 to acheive these acceptance rates. 1
Example 1: Gaussian Let the target distribution be a bivariate Normal distribution (nothing too complicated). (( ) ( )) 3 2.5 N 2,. 6.5 1 The mean vector we denote by µ and the 2 2 covariance matrix by Σ. The density for this distribution is π(x) exp (x µ)t Σ 1 (x µ) 2 log π(x) = log const (x µ)t Σ 1 (x µ) 2 log π(x) = Σ 1 (x µ.) mu <- c(3, 6) Sigma <- matrix(c(2,.5,.5, 1), nrow = 2, ncol = 2) Sigma.inv <- solve(sigma) # Calculates the loglikelihood of the bivariate normal loglike <- function(x) return(as.numeric(- t((x - mu)) %*% Sigma.inv %*% (x - mu)/2) ) First I write down the function for the RWM sampler. Note that this proposal is symmetric. set.seed(100) rwm <- function(n = 1e5, sigma) chain.rwm <- matrix(0, nrow = N, ncol = 2) accept <- 0 # Starting value is the origin chain.rwm[1,] <- c(0,0) for(i in 2:N) prop <- rnorm(2, mean = chain.rwm[i-1, ], sd = sigma) log.ratio <- loglike(prop) - loglike(chain.rwm[i-1, ]) if(log(runif(1)) < log.ratio) chain.rwm[i, ] <- prop accept <- accept+1 else chain.rwm[i, ] <- chain.rwm[i-1, ] return(list("chain" = chain.rwm, "accept" = accept/n)) out.rwm <- rwm(sigma = 1) out.rwm$accept ## [1] 0.58784 2
# Calibrated sigma to get close to optimal rate out.rwm <- rwm(sigma = 2.5) out.rwm$accept ## [1] 0.25726 Coding up the MALA sampler is slightly more complicated since the proposal distribution is no longer symmetric and the densities do not cancel out in the acceptance ratio. # Calculates the log gradient of the density proplike <- function(x, y, sigma) grad <- Sigma.inv %*% (y - mu) mu.m <- y + sigma^2 * grad/2 return(as.numeric(- t((x - mu.m)) %*% (x - mu.m)/(2*sigma^2) ) ) # Mala sampler mala <- function(n = 1e5, sigma) chain.mala <- matrix(0, nrow = N, ncol = 2) accept <- 0 # Starting value is the origin chain.mala[1,] <- c(0,0) for(i in 2:N) grad <- Sigma.inv %*% (chain.mala[i-1, ] - mu) mu.m <- chain.mala[i-1, ] + sigma^2 * grad/2 prop <- rnorm(2, mean = mu.m, sd = sigma) log.ratio <- loglike(prop) - loglike(chain.mala[i-1, ]) if(log(runif(1)) < log.ratio) chain.mala[i, ] <- prop accept <- accept+1 else chain.mala[i, ] <- chain.mala[i-1, ] return(list("chain" = chain.mala, "accept" = accept/n)) out.mala <- mala(sigma = 1) out.mala$accept + proplike(chain.mala[i-1, ], prop, sigma) ## [1] 0.29073 # Tuning Mala out.mala <- mala(sigma =.5) out.mala$accept ## [1] 0.5763 3
To compare the performance of the two samplers, we plot some graphs. The first is the traceplot for the two components. par(mfrow = c(1, 2)) plot(tail(1:1e5, 1e4), tail(out.rwm$chain[,1],1e4), ylab = "First Component", main = "", type lines(tail(1:1e5, 1e4),tail(out.mala$chain[,1], 1e4), col = "red") plot(tail(1:1e5, 1e4), tail(out.rwm$chain[,2], 1e4), ylab = "Second Component", lines(tail(1:1e5, 1e4),tail(out.mala$chain[,2], 1e4), col = "red") main = "", t First Component 2 2 6 Second Component 3 5 7 9 90000 96000 90000 96000 index index The performance looks similar in their trace plot. MALA looks like it may produce thinner tails and focus on areas of high probability. par(mfrow = c(1,2)) plot(density(out.rwm$chain[,1]), ylab = "First Component", main = "") lines(density(out.mala$chain[,1]), col = "red") plot(density(out.rwm$chain[,2]), ylab = "Second Component", main = "") lines(density(out.mala$chain[,2]), col = "red") First Component 0.00 0.15 Second Component 0.0 0.2 0.4 2 2 6 0 4 8 N = 100000 Bandwidth = 0.12 N = 100000 Bandwidth = 0.090 In terms of autocorrelation, we see the following results. par(mfrow = c(2,3)) acf(out.rwm$chain[,1], main = "RWM: First") acf(out.rwm$chain[,2], main = "RWM: Second") 4
ccf(out.rwm$chain[,1], out.rwm$chain[,2], main = "RWM: CCF") acf(out.mala$chain[,1], main = "MALA: First") acf(out.mala$chain[,2], main = "MALA: Second") ccf(out.mala$chain[,1], out.mala$chain[,2], main = "MALA: CCF") RWM: First RWM: Second RWM: CCF 0.0 0.1 0.2 0.3 0 10 20 30 40 50 0 10 20 30 40 50 40 0 20 40 MALA: First MALA: Second MALA: CCF 0.00 0.10 0.20 0.30 0 10 20 30 40 50 0 10 20 30 40 50 40 0 20 40 Interestingly, MALA produces much higher autocorrelation, and significantly higher crosscorrelation. This would imply that the multivariate effective sample size for MALA for estimating the mean of the Normal distribution will be smaller. Here is the implementation. library(mcmcse) ## mcmcse: Monte Carlo Standard Errors for MCMC ## Version 1.2-1 created on 2016-03-24. ## copyright (c) 2012, James M. Flegal, University of California,Riverside ## John Hughes, University of Minnesota ## Dootika Vats, University of Minnesota ## For citation information, type citation("mcmcse"). ## Type help("mcmcse-package") to get started. c(multiess(out.rwm$chain), multiess(out.mala$chain)) ## [1] 12862.660 1910.657 And clearly, the effective sample size for a Monte Carlo sample of size 1e5 is much smaller for MALA than it is for the RWM. So both RWM and MALA seem to yield decent density estimates but for estimation of the mean (3,6), the RWM will clearly be favored over MALA due to lost of efficiency. 5
Intuitively I think MALA will work better for fatter tail distributions. Let s see Example 2: Multimodal Let the target be a t distribution with 5 degrees of freedom π(x) ) 3 (1 + x2. 5 π(x) ) 3 (1 + x2 5 log π(x) = log const 3 log log π(x) = 6x 5 + x 2. # Calculates the loglikelihood of the bivariate normal loglike <- function(x) return( -3*log(1 + x^2/5) ) Below is the RWM implementation rwm <- function(n = 1e5, sigma) chain.rwm <- numeric(length = N) accept <- 0 # Starting value is far from 0 chain.rwm[1] <- 10 for(i in 2:N) prop <- rnorm(1, mean = chain.rwm[i-1 ], sd = sigma) log.ratio <- loglike(prop) - loglike(chain.rwm[i-1]) if(log(runif(1)) < log.ratio) chain.rwm[i ] <- prop accept <- accept+1 else chain.rwm[i ] <- chain.rwm[i-1 ] return(list("chain" = chain.rwm, "accept" = accept/n)) out.rwm <- rwm(sigma = 1) out.rwm$accept ## [1] 0.72299 # Calibrated sigma to get close to optimal rate out.rwm <- rwm(sigma = 5) out.rwm$accept ) (1 + x2 5 6
## [1] 0.2745 Coding up the MALA sampler is slightly more complicated since the proposal distribution is no longer symmetric and the densities do not cancel out in the acceptance ratio. # Calculates the log gradient of the density proplike <- function(x, y, sigma) grad <- -6*y/(5 + y^2) mu.m <- y + sigma^2 * grad/2 return(as.numeric(- t((x - mu.m)) %*% (x - mu.m)/(2*sigma^2) ) ) # Mala sampler mala <- function(n = 1e5, sigma) chain.mala <- numeric(length = N) accept <- 0 # Starting value is far from 0 chain.mala[1] <- 10 for(i in 2:N) grad <- -6*chain.mala[i-1]/(5 + chain.mala[i-1]^2) mu.m <- chain.mala[i-1] + sigma^2 * grad/2 prop <- rnorm(1, mean = mu.m, sd = sigma) log.ratio <- loglike(prop) - loglike(chain.mala[i-1]) if(log(runif(1)) < log.ratio) chain.mala[i ] <- prop accept <- accept+1 else chain.mala[i] <- chain.mala[i-1 ] return(list("chain" = chain.mala, "accept" = accept/n)) out.mala <- mala(sigma = 1) out.mala$accept + proplike(chain.mala[i-1], prop, sigma) - pr ## [1] 0.93122 # Tuning Mala out.mala <- mala(sigma = 2.2) out.mala$accept ## [1] 0.5353 par(mfrow = c(1, 2)) plot(tail(1:1e5, 1e4), tail(out.rwm$chain,1e4),, main = "", type = 'l', xlab = "index") lines(tail(1:1e5, 1e4),tail(out.mala$chain, 1e4), col = "red") plot(density(out.rwm$chain), main = "") 7
lines(density(out.mala$chain), col = "red") tail(out.rwm$chain, 10000) 5 0 5 10 90000 96000 Density 0.0 0.2 10 0 10 index N = 100000 Bandwidth = 0.097 Ah, we see MALA exploring the space more than RWM. Let us zoom into that. par(mfrow = c(1, 1)) plot(density(out.rwm$chain), main = " truth is BLUE", xlim = range(c(-5,5))) lines(density(out.mala$chain), col = "red") lines(seq(-5,5, length = 1e4), dt( seq(-5,5, length = 1e4), df = 5), col = "blue") truth is BLUE Density 0.0 0.2 4 2 0 2 4 N = 100000 Bandwidth = 0.09795 Hmm, density estimation seems similar after 1e5 iterations. par(mfrow = c(1,2)) acf(out.rwm$chain, main = "RWM") acf(out.mala$chain, main = "MALA") 8
RWM MALA 0 20 40 0 20 40 Since the seems lower for MALA, the effective sample size for estimating the mean will be larger. c(ess(out.rwm$chain), ess(out.mala$chain)) ## se se ## 18930.44 38630.07 Thus for a distribution with fatter tails, MALA performs (marginally) better. For more complicated high dimensional target distributions, I can understand how MALA would be a handy tool. References Roberts, Gareth O, and Jeffrey S Rosenthal. 1998. Optimal Scaling of Discrete Approximations to Langevin Diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60. Wiley Online Library: 255 68. Roberts, Gareth O, and Richard L Tweedie. 1996. Exponential Convergence of Langevin Distributions and Their Discrete Approximations. Bernoulli. JSTOR, 341 63. Roberts, Gareth O, Andrew Gelman, and Walter R Gilks. 1997. Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. The Annals of Applied Probability 7. Institute of Mathematical Statistics: 110 20. 9