Hamiltonian Monte Carlo

Size: px

Start display at page:

Download "Hamiltonian Monte Carlo"

Ernest Wilkerson
5 years ago
Views:

1 Chapter 7 Hamiltonian Monte Carlo As with the Metropolis Hastings algorithm, Hamiltonian (or hybrid) Monte Carlo (HMC) is an idea that has been knocking around in the physics literature since the 1980s (see the excellent review article cum tutorial by Neal, 2011, from which I adapted the material in this chapter), and has been used albeit infrequently for statistical inference since the 1990s. In essence, HMC is a clever way to propose values within an MCMC routine. Over one single iteration, the parameter vector follows a deterministic trajectory that determines its precise next value. This is made random by drawing a value for an augmented momentum vector, which controls how the parameter evolves over that trajectory for that iteration. The beauty of this approach is that, for suitably tuned conditions governing the implementation of the trajectory, the proposed value can be far from the original value. In particular, HMC has been shown (see Neal, 2011) to get good mixing for high dimensional problems in which it is hard to devise an effective proposal distribution in a vanilla MCMC routine. 7.1 Fundamentals Two terms are introduced in HMC. One is the potential energy of a particular value of the parameters. This is just a constant minus the log posterior. The other is the momentum, an auxilliary vector of the same length as parameter vector. The momentum will be changed from iteration to iteration, and within iterations it will change as the parameter configuration explores the parameter space. In vanilla MCMC, the proposed values are created by adding a random displacement from the current location (usually centred on the current location), whereas in HMC, a random initial momentum is generated, given which the proposed value is deterministic 1. The momentum vector changes over the course of an iteration, according to the gradient of the potential energy (or, equivalently, the log posterior). This means that to implement an HMC sampler, you need to be able to evaluate the vector of partial derivatives of the log posterior. Below we will use P () for the 1 There is a small modification that introduces an additional random term: see later. 127

2 128 CHAPTER 7. HAMILTONIAN MONTE CARLO potential energy at, and G d () = dp () d d (7.1) for the gradient for the dth parameter, and G() for the entire vector of gradients. Since in each step within each iteration you will have to evaluate the gradient, you really need to have a closed form for the gradient for this to be computationally effective. An entire iteration proceeds as in the following pseudo-code: 1. Set 0 =, the current value, and let i = 0 be the current step. 2. Generate a momentum m from a (multivariate) standard normal N(0, I), where I is a matrix with 1s on the leading diagonal and 0s elsewhere, with the same number of rows as the dimension of D = dim(). Set m 0 = m. 3. Repeat the following steps N times: a) Set m i+1/2 = m i ɛg( i )/2. b) Set i+1 = i + ɛm i+1/2. c) Set m i+1 = m i+1/2 ɛg( i )/2. d) Increment i by Set = N and m = m N to be the proposed parameters and momenta. 5. Evaluate the potential energy at both old and proposed values of the parameters, P E () and P E ( ) (recall the potential energy is the negative of the log posterior). 6. Evaluate the kinetic energy at the old and proposed momenta, K E (m) = D d=1 2 d /2 and K E(m ) = D d=1 ( d )2 /2. 7. Accept the proposal with probability 7.2 Refinements exp P E () P E ( ) + K E (m) K E (m ) Choosing the number of steps and the step size As with vanilla MCMC, there are various run-time parameters of the algorithm that need to be selected. In MCMC, these are (i) the distribution family for the proposal, (ii) the mean or location of the proposal distribution and (iii) the variance of the distribution. Typically, a normal distribution centred on the current value is used, and so only (iii) need be selected. In HMC, two decisions must be made: (i) the number of steps (N in the above algorithm) and (ii) the step-size (ɛ). In my (limited) experience, N = 100 (or 1000) steps seems to have reasonable performance. The more steps, the slower the routine will be, but too few steps and the routine will lead to either nearby proposals (not desirable) or to a poor approximation of the trajectory resulting from the initial momentum.

3 7.3. EXAMPLE: EXPONENTIAL MODEL 129 The best choice of step-size may depend on the application. I would suggest starting with ɛ = 0.01 and then increasing or decreasing the size following trial runs. It has been suggested that using a constant step-size may, in some unlucky situations, lead to poor mixing, such as in light-tailed distributions (see references in Neal, 2011). Neal s suggestion is to randomise the choice of ɛ at each iteration not each step to overcome this. Dealing with boundaries to the parameter space If the parameter space is bounded and if partway through a trajectory the parameter configuration leaves the legal parameter space, something needs to be done to prevent errors. The solution is to bounce the trajectory backwards on collision with the boundary, by setting to and m to m immediately after updating at step 3b. The physical analogue is a ball bouncing off a wall: the momentum in the dimension perpendicular to the wall (only) is inverted. 7.3 Example: exponential model To allow visualisation of HMC, we will consider a one-dimensional (but real) example, with a boundary on the single parameter. The example is the leukæmia trial once more, and we shall fit an exponential model to the placebo arm data (which were uncensored event times; also recall: there was no real difference between the exponential and various two parameter models as measured by the DIC). The single parameter, the rate parameter, must be positive. From the data, the average relapse time is around 10d, so the rate should be somewhere around 0.1/d. We will use an improper flat prior on the support of. The potential energy (i.e. minus the log posterior) is evaluated using the following function: potential_energy=function(theta,data) if(theta<0)return(999999) return(-length(data)*log(theta)+sum(data*theta)) If the parameter value is illegal, a large value is returned. The gradient of the potential energy is evaluated thus: gradient=function(theta,data) if(theta<0)return(0) return(-length(data)/la+sum(data)) Each step within an iteration is performed using the following function (note that the preceding two functions are arguments to this):

4 130 CHAPTER 7. HAMILTONIAN MONTE CARLO onestep=function(gradient,data,theta,momentum,epsilon) momentum_p = momentum - epsilon*0.5*gradient(theta,data) theta_p = theta + epsilon*momentum_p if(theta_p<0)theta_p=-theta_p;momentum_p=-momentum_p momentum_p = momentum_p - epsilon*0.5*gradient(theta_p,data) return(list(theta=theta_p,momentum=momentum_p)) Note the line that checks whether the configuration is illegal would differ for multi-parameter models: each dimension would be checked and, for each that is illegal, that specific parameter and momentum would flip like a ball bouncing off a wall. The following function takes an existing parameter value and performs a series of steps using the function above. hmc = function(potential_energy, gradient, data, epsilon0, steps, old_theta) old_momentum = rnorm(length(old_theta),0,1) new_theta = old_theta new_momentum = old_momentum epsilon=abs(rnorm(1,epsilon0,0.2*epsilon0)) for(step in 1:steps) os=onestep(gradient,data,new_theta,new_momentum,epsilon) new_theta=os$theta new_momentum=os$momentum old_potential_energy = potential_energy(old_theta,data) new_potential_energy = potential_energy(new_theta,data) old_kinetic_energy = sum(old_momentum^2)/2 new_kinetic_energy = sum(new_momentum^2)/2 la=old_potential_energy-new_potential_energy+ old_kinetic_energy-new_kinetic_energy lu=-rexp(1) reject=false if(lu>la)reject=true if(reject)return(old_theta) if(!reject)return(new_theta) Note the jittering of ɛ. Note also that the potential energy at the previous location is not stored in memory instead it is recalculated from scratch, which is inefficient (though conceptually cleaner). The following code runs the routine: data=c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23)

5 7.3. EXAMPLE: EXPONENTIAL MODEL 131 theta=0.1 HMCits=1000 storetheta=array(0,c(hmcits,length(theta))) for(iteration in 1:HMCits) if(iteration%%100==0)print(iteration) theta=hmc(potential_energy,gradient,data,0.01,100,theta) storetheta[iteration,]=theta posterior posterior posterior posterior Figure 7.1: Proposal densities at four starting points. Starting points are indicated by points, and a kernel density estimate from 1000 simulations is presented by the coloured line. In grey is the actual posterior density. Figure 7.1 shows the proposal distributions that result from this routine for four different starting values in the middle and tails of the posterior distribution. Note how even very extreme starting values have a high probability of allowing proposals back to the main body of the distribution, or directly to the other

6 132 CHAPTER 7. HAMILTONIAN MONTE CARLO tail of the distribution, and that for a starting value near the middle of the distribution, the proposal distribution almost exactly matches the posterior iteration iteration iteration iteration Figure 7.2: Trace plots for several tuning parameter configurations. Top left: ɛ = 0.01, N = 100. Top right: ɛ = 0.001, N = 100. Bottom left: ɛ = 0.05, N = 100. Bottom right: ɛ = 0.01, N = Figure 7.2 shows trace plots for four different configurations of the number of steps (N) and the average stepsize ɛ (for each iteration, the value of ɛ is drawn from a normal distribution with standard deviation 20% of this average). The mixing appears good for all except the routine with an overly large ɛ. The density plots are presented in figure 7.3. As can be seen, all except the poorly mixing chain perform extremely well and the runs with a small ɛ are virtually indistinguishable from the actual posterior.

7 7.4. IN MULTIPLE DIMENSIONS 133 density Figure 7.3: Density estimates of posterior for several tuning parameter configurations. Orange: ɛ = 0.01, N = 100. Blue: ɛ = 0.001, N = 100. Gold: ɛ = 0.05, N = 100. Green: ɛ = 0.01, N = In multiple dimensions The attraction of HMC is that it promises to perform efficiently in multiple dimensions, especially when the posterior for several parameters are closely correlated to each other. Neal (2011) shows an application to a one-hundred dimensional posterior in which HMC gives extremely good mixing (in contrast to vanilla MCMC) and very accurate estimates of the true parameters (ditto): see high figures 5.6 and 5.7. In figure 7.4, I present the proposal distributions for a two dimensional problem with correlated parameters, illustrating how the HMC routine adapts to the correlated shape as each proposal is formed, thereby allowing efficient movement around the posterior distribution. Activities The best way to understand HMC (as with all the methods in the course) is to apply it. You might start with a simple problem like estimating a probability (see applications in chapter 1) to ensure you can visualise and understand what is going on in HMC. You should also attempt a two (or more) dimensional problem, so you are comfortable with extending the code presented above to vectors of parameters (the gradient function must change to output a vector (one gradient per parameter), and onestep needs to check each entry in the parameter vector for illegality).

8 134 CHAPTER 7. HAMILTONIAN MONTE CARLO Figure 7.4: Proposal distribution for four starting values in twodimensional problem. The target distribution is indicated in grey in the background (it is a bivariate normal). For four starting values, a set of random momentum variables is drawn and the resulting trajectories plotted as lines. References Neal RM (2011) MCMC using Hamiltonian dynamics. In Brooks S, Gelman A, Jones G and Meng X-L, eds, Handbook of Markov chain Monte Carlo Chapman and Hall/CRC.

Introduction to Hamiltonian Monte Carlo Method

Introduction to Hamiltonian Monte Carlo Method Mingwei Tang Department of Statistics University of Washington mingwt@uw.edu November 14, 2017 1 Hamiltonian System Notation: q R d : position vector, p R