Hamiltonian Monte Carlo for Scalable Deep Learning

Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

Preface Markov Chain Monte Carlo (MCMC) techniques are powerful algorithms for fitting probabilistic models Variations such as Gibbs samplers work well for some high-dimensional situations, but have issues scaling to today s challenges and model architectures Hamiltonian Monte Carlo (HMC) is a more proposal-efficient variant of MCMCs that is a promising catalyst for innovation in deep learning and probabilistic graphical models Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 02/24

Outline Review Metropolis-Hastings Introduction to Hamiltonian Monte Carlo (HMC) Brief review of neural networks and fitting methods Discussion of Stochastic Gradient HMC (T. Chen et al., 2014) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 03/24

Introduction to Hamiltonian Monte Carlo Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 04/24

Review: Metropolis-Hastings 1/3 Metropolis et al., 1953, and Hastings, 1970 The original Metropolis et al. can be used to compute integrals of a distribution, e.g. the normalization for a Bayesian posterior J = f x P x dx = E P (f) Originally for statistical mechanics, more specifically, calculating potential of 2D spheres (particles) in a square with fast electronic computing machines Size N = 224 particles, time = 16 hours (on prevailing machines) Metropolis et al., 1953 Advantage is that it depends only the ratio of the P(x )/P(x) of the probability distribution evaluated at two points, x and x in some data D Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 05/24

Review: Metropolis-Hastings 2/3 We can use this ratio to accept or reject moving from randomly generated points x x with acceptance ratio P(x )/P(x) This allows us to sample by accumulating a running random-walk (Markov chain) list of correlated samples under a symmetric proposal scheme from the target distribution, which we can then estimate Hastings extended this to permit (but not require) an asymmetric proposal scheme, which speeds the process and improves mixing A(x x) = min[1, P x P x g x x g x x ]πr2 Regardless, we also accumulate a burn-in period of bad initial samples we have to ignore (this slows convergence, as do correlated samples) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 06/24

Review: Metropolis-Hastings 3/3 We have to remember that Metropolis-Hastings has a few restrictions A Markov chain won t converge to a target distribution P(x) unless it converges to a stationary distribution π x = P x If π x is not unique, we can also get multiple answers! (This is bad) So we require the equality P x x)p x = P x x )P(x ), e.g. reversibility Additionally, the proposal is symmetric when g x x g x x = 1, e.g. Gaussian These are called random-walk algorithms When P(x ) P x, they move to the high-density region with certainty, else with acceptance ratio P(x )/P x Note a proposal with higher variance typically yields a lower acceptance ratio Finally, remember Gibbs sampling, useful for certain high-dimensional situations, is a special case of Metropolis-Hastings using proposals conditioned on values of other dimensions Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 07/24

Hamiltonian Monte Carlo 1/5 Duane et al., 1987, Neal, 2012, and Betancourt, 2017 Duane proposed Hybrid Monte Carlo to more efficiently computer integrals in lattice field theory Hybrid was due to the fact that it infused Hamiltonian equations of motion to generate a candidate point x instead of just RNG As Neal describes, this allows us to push the candidate points further out with momentum because the dynamics Are reversible (necessary for convergence to unique target distribution) Preserve the Hamiltonian (so we can still use momentum) Preserve volume (which makes acceptance probabilities solvable) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 08/24

Hamiltonian Monte Carlo 2/5 A Hamiltonian is an energy function of form H q, p = U q + K(p) Position Momentum Potential Energy Kinetic Energy Hamilton s equations govern the change of this system over time dq i dt = H p i = [M 1 p] i, dp i dt = H q i = U q i J H = 0 dxd I dxd I dxd 0 dxd We can set U q = log q + C and K(p) = p T M 1 p/2 where C is constant and M is a PSD mass matrix that determines our momentum (kinetic energy) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 09/24

Hamiltonian Monte Carlo 3/5 In a Bayesian setting, we set q to be the prior, π(q), times the likelihood given data D, L q D) for our potential energy U q = log[π q L q D) ] If we choose a Gaussian proposal (Metropolis), we set kinetic energy K p = d i=1 p i 2 2m i We then generate q, p via Hamiltonian dynamics and use the difference in energy levels as our acceptance ratio in the MH algorithm A q, p q, p) = min 1, exp H q, p + H q, p Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 10/24

Hamiltonian Monte Carlo 4/5 Converting proposal and acceptance steps to this energy form is convoluted, however, we can now use Hamiltonian dynamics to walk farther without sacrificing acceptance ratio Classic method of solving Hamiltonian s differential equations is Euler s Method, which traverses a small distance ε > 0 for L steps p i t + ε = p i t ε U q i [q t ], q i t + ε = q i t + ε p i t m i We can also employ the more efficient leapfrog technique to quickly propose a candidate p i t + ε/2 = p i t (ε/2) U q i [q t ], q i t + ε = q i t + ε p i t + ε/2 m i p i t + ε = p i t + ε/2 (ε/2) U q i [q t + ε ] Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 11/24

Hamiltonian Monte Carlo 5/5 The HMC algorithm adds two steps to MH: Sample p the momentum parameter (typically symmetric, Gaussian) Compute L steps of size ε to find a new q, p Betancourt explains we can sample a momentum to easily change energy levels, then we use Hamiltonian dynamics to traverse our q-space (state space) We no longer have to wait for random walk to slowly explore as we can easily find samples well-distributed across our posterior with high acceptance ratios (same energy levels) Graphic from Betancourt, 2017 However, as Chen et al, 2014 describes, we do still have to compute the gradient of our potential at every step, which can be costly, especially in high dimensions Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 12/24

Neural Networks Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 13/24

Neural Networks 1/4 [Artificial] neural networks (or nets) are popular connectionist models for learning a function approximation Derivative of Hebbian learning after Hebb s neuropsychology work in the 1940s Popularized today thanks to parallelization and convex optimization Universal function approximator (in theory) Typically use function composition and the chain rule alongside vectorization to efficiently optimize a loss function by altering the weights (function) of each node Requires immense computational power, especially when the functions being composed are probabilistic (such as in a Bayesian Neural Net (BNN)) Feedforward Neural Net, Wikimedia Commons Fitting neural nets is an active area of research, with contributions from the perspectives of both optimization and sampling Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 14/24

Neural Networks 2/4 LeCun et al., 1998, Robbins and Monro, 1951 As Lecun et al., details, backward propagation of errors (backprop) is a powerful method for neural net optimization (these do not use sampling) a layer of weights W, at a time t, given error matrix E W t+1 = W t η E W Note this is merely gradient descent, which in recent years has been upgraded with many bells and whistles One such whistle is stochastic gradient descent (SGD), an algorithm that evolved following the stochastic approximation methods introduced by Robbins and Monro, 1951 (GO TAR HEELS!) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 15/24

Neural Networks 3/4 Calculating a gradient is costly, but as LeCun et al. details, stochastic gradient descent is much faster, and comes in both online and the smoother minibatch variations The primary idea is to update using only the error at one point, E* W t+1 = W t η E W The error at one point is an estimate of the error for the entire vector of current weights W t, hence the name stochastic gradient descent The speedup is feasible due to shared information across observations and the fact that by decreasing η, the learning rate, SGD still converges, including for minibatch variants of SGD, which just computes gradients for a handful of points instead of a single one Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 16/24

Neural Networks 4/4 Rumelhart et al., 1986 A popular bell to complement SGD s whistle is the addition of a momentum term to the update step. We more or less smooth our update steps with an exponential decay factor α W t+1 = W t η E W + α Wt W t = W t+1 W t This may seem familiar if you recall the momentum term that exists in Hamiltonian Monte Carlo (cue imaginary dramatic sound effect) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 17/24

Stochastic Gradient HMC (T. Chen et al., 2014) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 18/24

Stochastic Gradient HMC 1/4 As mentioned before, backprop uses the powerful stochastic gradient descent method and extensions to fit gradient-based neural networks Unfortunately, many of these neural nets lack inferentiability One solution (other proving P = NP or than solving AGI) is to use Bayesian Neural Networks, which exist as a class of probabilistic graphical models, and can be fitted with sampling or similar methods BNNs still perform many of the surreal feats that other neural nets accomplish However, even with Gibbs samplers and HMC, sampling in high dimensions is quite slow to converge for now... Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 19/24

Stochastic Gradient HMC 2/4 Welling et al., 2011, T. Chen et al., 2014, and C. Chen et al., 2016 In HMC, Instead of calculating our the gradient of our potential energy, U q, for all of the dataset D, what if we selected some minibatch D D to use for our estimate in the leapfrogging method? U U + N 0, Σ for noise Σ p i t + ε/2 = p i t (ε/2) U q i [q t ] Unfortunately, this naïve stochastic gradient HMC (SGHMC) injects noise into the Hamilton equations, which requires materially decreasing acceptance ratio in the MH algorithm to inefficient levels Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 20/24

Stochastic Gradient HMC 3/4 T. Chen et al. suggests fixing naïve SGHMC by adding a friction term, as proposed by Welling et al., borrowing once again from physics in the form of Langevin dynamics (vectorized form, omitting leapfrog notation) U = U + N 0, 2B B = ε 2 Σ q t+1 = q t + M 1 p t p t+1 = p t ε U q t+1 BM 1 p t Note that B is a PSD function of q t+1, but Chen also shows certain constant choices of B converge (and are far more practical) Welling et al. also laments that Bayesian methods have been leftbehind in recent machine learning advances due to [MCMC] requiring computations over the whole dataset Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 21/24

Stochastic Gradient HMC 4/4 The end result of SGHMC is an efficient sampling algorithm that also permits computing gradients on a minibatch in a Bayesian setting T. Chen et al then elaborate to show that under deterministic settings, SGHMC performs analogously to SGD with momentum, as the momentum components are related C. Chen et al. (BEAT DOOK!) further elaborates that many Bayesian MCMC sampling algorithms are analogs of stochastic optimization algorithms, which suggests that a symbiotic discovery and extensions of the two is possible, as presented in the Stochastic AnNealing Thermostats with Adaptive momentum (Santa) that incorporates recent advances from both domains Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 22/24

Conclusions HMC is a promising variant of MCMC sampling algorithms with applications in Bayesian models SGHMC offers more scalability in deep learning and several other settings, with the added benefit of inferentiability in Bayesian neural nets. Future work and collaborations between the sampling and optimization communities is promising Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 23/24

Bibliography (by date) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400 407. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The journal of chemical physics 21 1087. Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57 97 109. Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J (1986). Learning representations by backpropagating errors. Nature, 323:533 536. Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195 216-222. Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pages 9 48. Springer, 1998b. Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681 688. Neal, R. M. (2012). MCMC using Hamiltonian dynamics. ArXiv e-prints, arxiv:1206.1901 Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. arxiv preprint arxiv:1402.4102v2. Chen, C. Carlson, D. Gan, Z. Li, C. and Carin, L. (2015). Bridging the gap between stochastic gradient MCMC and stochastic optimization. arxiv:1512.07962 Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arxiv preprint arxiv:1701.02434. Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 24/24

Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018