Introduction to Hamiltonian Monte Carlo Method Mingwei Tang Department of Statistics University of Washington mingwt@uw.edu November 14, 2017 1
Hamiltonian System Notation: q R d : position vector, p R d : momentum vector Hamiltonian H(p, q): R 2d R 1 Evolution equation for Hamilton system dq dt = H p dp dt = H q (1) 2
Potential and Kinetic Decompose the Hamiltonian H(p, q) = U(q) + K(p). U(q): potential energy depend on position K(p): Kinetic energy depend on momentum Motivating example: Free fall U(q) = mgq K(p) = 1 2 mv 2 = p2 2m H(p, q) = mgq + p2 is the total energy 2m Velocity: v = dq dt = H p = p/m Force F = dp dt = H q = mg 3
Properties of Hamiltonian system 1. Reversibility: The mapping T s: (q(t), p(t)) (q(t + s), p(t + s)) is one-to-one Has inverse T s : negate p, apply T s. negate p again 2. Conserved (Hamiltonian invariant) dh dt = dq H dt q + dp H dt p = H H p q H H q p = 0 H(p, q) is constant over time t. 3. Volume preservation: The map T s preserves the volume ( ) Tδ For small δ, Jacobian det 1 (p, q) 4
Idea of HMC D : Observed data, q : parameters (latent variables), π(q) prior distribution Likelihood function L(D q) Posterior distribution Pr(q D) L(D q)π(q) Position parameters, potential U(q) log-posterior U(q) = log [L(D q)π(q)] Introduce ancillary variable p for Kinetic energy d pi 2 K(p) = log (N (0, M)) 2m i i=1 p, q are independent Hamiltonian: H(p, q) = U(q) + K(p) 5
Idea of HMC: Cont Now we defined U(q) and K(p). Relate that to a distribution Canonical distribution Pr(p, q) = 1 Z exp ( H(p, q)/t ) = 1 exp ( U(q)/T ) exp ( K(p)/T ) (2) Z where T : temperature, Z normalizing constant Ususally set T = 1, Pr(q, p) Posterior distribution Mulitivaranit Guassian Goal: sample (p, q) jointly from canonical distribution 6
Ideal HMC Specify variance matrix M, time s > 0 For i = 1,..., N 1. Sample p (i) from N (0, M) 2. Starting with current (p (i), q (i 1) ), integral on Hamiltonian system for s period: (p, q ) T s((p (i), q (i 1) )) (leaves H(, ) invariant) 3. q (i) q, p (i) p Output q (1),..., q (N) as posterior samples Problem: The Hamiltonian system may not have a closed-form solution Need numerical method to for ODE system 7
Numerical ODE integrator Targeting problem: dq dt = H p = M 1 p dp dt = H = log (L(D q)π(q)) q Leap-frog method, for small time ɛ > 0 p(t + ɛ/2) = p(t) (ɛ/2) U q (q(t)) q(t + ɛ) = q(t) + ɛm 1 p(t + ɛ/2) p(t + ɛ) = p(t + ɛ/2) (ɛ/2) U (q(t + ɛ/2)) q 8
Numerical stability for Hamiltonian system 9
Property of Leap frog Time reversibility: Integrate n steps forward and then n steps backward, arrive at same starting position. Symplectic property: Converse the (slightly modified) energy 10
Idea HMC review Specify variance M, time s > 0 For i = 1,..., N 1. Sample p (i) from N (0, M) 2. Starting with current (p (i), q (i 1) ), integral on Hamiltonian system for s period: (p, q ) T s((p (i), q (i 1) )) 3. q (i) q, p (i) p Output q (1),..., q (N) as posterior samples Numerical method does not leave H(p, q) unchanged during integration H((p, q )) H((p (i), q (i 1) )) Need to adjust that 11
HMC in practice Specify variance matrix M, step size ɛ > 0, L : number of the leap frog steps For i = 1,..., N 1. Sample p (i) from N (0, M) 2. Starting with current (p (i), q (i 1) ), p p (p, q ) Leapfrog(p (i), q (i 1), ɛ, L) 3. Metropolis-Hastings with probability { Pr(p, q } ) α = min 1, Pr(p (i), q (i 1) ) set q (i) q, p (i) p (leaves canonical distribution invariant) Output q (1),..., q (N) as posterior samples 12
Comparison with random walk Metropolis-Hastings HMC: proposal based on Hamiltonian dynamics, not random walk Random walk Metropolis-Hastings (RWMH) need more steps to get a independent sample Optimum acceptance: HMC (65%), RWMH (23%) Computation d: Number of iterations to get a independent sample: HMC: O(d 1/4 ) vs RWMH: O(d) Total number of computations O(d 5/4 ) vs RWMH: O(d 2 ) See (Roberts et al. 2001) and (Neal 2011) for more details 13
Tuning parameters Stepsize ɛ: Large ɛ: Low acceptance rate Small ɛ: Waste computation, random walk behavior (ɛl) too small might need different ɛ for different region, eg. choose ɛ by random Number of leap-frog steps L: Trajectory length is crucial for exploring state space systematically More constrained in some directions, but much less constrained in other directions U-turns in long-trajectory 14
NUTS Solution: No-U-Turn Sampler (NUTS) (Hoffman et al. 2014) Adaptive way to select number of leap-frog step L Adaptive way to select step size ɛ The exact algorithm behind Stan! 15
NUTS: Select L Criterion for U-turns d q t q 0 2 = (q t q 0) T p t < 0 (3) dt 2 Start from (p (i), q (i 1) ) 1. Run leap-frog steps until (3) happens. Have candidate set B of (p, q) pairs 2. Select subset C B satisfies detail balanced equation 3. Random select q (i) from C 16
Selecting stepsize ɛ Warm-up phase M adapt H t be the acceptance probability at t-th iterations e.g { } Pr(p, q ) H t = min 1, Pr(p (t), q (t 1) ) h t(ɛ) = E t[h t ɛ] one step Dual averaging in each iteration for solving h t(ɛ) = δ where δ is the optimum acceptance rate, for HMC δ = 0.65 Find ɛ after M adapt iterations 17
Summary HMC: A MCMC algorithm make use of Hamiltonian dynamics Parameters as position, posterior likelihood as potential energy Propose new state based on Hamiltonian dynamics Leap-frog for numerical simulation, sensitive for tuning NUTS: A HMC with adaptive tuning on (L, ɛ) for more efficient proposal L: Avoid U-turns ɛ: Dual-averaging optimization to make the acceptance rate close to optimum 18
Implement your Own HMC Review the Hamiltonian dynamics dq dt = H p = M 1 p dp dt = H = log (L(D q)π(q)) q Need gradient log (L(D q)π(q)) = log (L(D q)) + log(π(q)) Stan: automatic gradient calculation Gradient: Stan can do gradient-based optimization (quasi-newton method L-BFGS) 19