Hamiltonian Monte Carlo for Scalable Deep Learning

Similar documents
17 : Optimization and Monte Carlo Methods

19 : Slice Sampling and HMC

Probabilistic Graphical Models

Bayesian Sampling Using Stochastic Gradient Thermostats

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Approximate inference in Energy-Based Models

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Learning Energy-Based Models of High-Dimensional Data

17 : Markov Chain Monte Carlo

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Comparison of Modern Stochastic Optimization Algorithms

1 Geometry of high dimensional probability distributions

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Markov chain Monte Carlo methods for visual tracking

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo

Gradient-based Monte Carlo sampling methods

Bayesian Methods for Machine Learning

On Markov chain Monte Carlo methods for tall data

Bayesian Sampling Using Stochastic Gradient Thermostats

The Origin of Deep Learning. Lili Mou Jan, 2015

Computational statistics

Introduction to Hamiltonian Monte Carlo Method

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

arxiv: v1 [stat.ml] 4 Dec 2018

Bagging During Markov Chain Monte Carlo for Smoother Predictions

arxiv: v1 [stat.co] 2 Nov 2017

Markov Chain Monte Carlo (MCMC)

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Approximate Slice Sampling for Bayesian Posterior Inference

CPSC 540: Machine Learning

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Lecture 7 and 8: Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Kernel adaptive Sequential Monte Carlo

Bayesian model selection in graphs by using BDgraph package

MCMC and Gibbs Sampling. Kayhan Batmanghelich

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Day 3 Lecture 3. Optimizing deep networks

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

16 : Approximate Inference: Markov Chain Monte Carlo

Stochastic Gradient Hamiltonian Monte Carlo

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Large-scale Stochastic Optimization

Hamiltonian Monte Carlo with Fewer Momentum Reversals

Variational Autoencoder

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

MCMC Sampling for Bayesian Inference using L1-type Priors

Probabilistic Machine Learning

Sampling Algorithms for Probabilistic Graphical models

Markov Chains and MCMC

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Nonparametric Regression for Diabetes Deaths

Thermostat-assisted Continuous-tempered Hamiltonian Monte Carlo for Multimodal Posterior Sampling

Markov chain Monte Carlo Lecture 9

Approximate Slice Sampling for Bayesian Posterior Inference

Credit Assignment: Beyond Backpropagation

Auto-Encoding Variational Bayes

Neural Networks and Deep Learning

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Markov Chain Monte Carlo methods

Results: MCMC Dancers, q=10, n=500

J. Sadeghi E. Patelli M. de Angelis

Computer Practical: Metropolis-Hastings-based MCMC

Bayesian networks: approximate inference

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Monte Carlo in Bayesian Statistics

Approximate Inference using MCMC

Advances and Applications in Perfect Sampling

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Tutorial on Probabilistic Programming with PyMC3

Advanced computational methods X Selected Topics: SGD

Deep unsupervised learning

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Introduction to Convolutional Neural Networks (CNNs)

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

arxiv: v1 [stat.ml] 6 Mar 2015

Introduction to Machine Learning CMU-10701

Simulated Annealing for Constrained Global Optimization

The Ising model and Markov chain Monte Carlo

Adaptive Rejection Sampling with fixed number of nodes

The connection of dropout and Bayesian statistics

MCMC notes by Mark Holder

Markov Chain Monte Carlo methods

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Transcription:

Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

Preface Markov Chain Monte Carlo (MCMC) techniques are powerful algorithms for fitting probabilistic models Variations such as Gibbs samplers work well for some high-dimensional situations, but have issues scaling to today s challenges and model architectures Hamiltonian Monte Carlo (HMC) is a more proposal-efficient variant of MCMCs that is a promising catalyst for innovation in deep learning and probabilistic graphical models Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 02/24

Outline Review Metropolis-Hastings Introduction to Hamiltonian Monte Carlo (HMC) Brief review of neural networks and fitting methods Discussion of Stochastic Gradient HMC (T. Chen et al., 2014) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 03/24

Introduction to Hamiltonian Monte Carlo Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 04/24

Review: Metropolis-Hastings 1/3 Metropolis et al., 1953, and Hastings, 1970 The original Metropolis et al. can be used to compute integrals of a distribution, e.g. the normalization for a Bayesian posterior J = f x P x dx = E P (f) Originally for statistical mechanics, more specifically, calculating potential of 2D spheres (particles) in a square with fast electronic computing machines Size N = 224 particles, time = 16 hours (on prevailing machines) Metropolis et al., 1953 Advantage is that it depends only the ratio of the P(x )/P(x) of the probability distribution evaluated at two points, x and x in some data D Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 05/24

Review: Metropolis-Hastings 2/3 We can use this ratio to accept or reject moving from randomly generated points x x with acceptance ratio P(x )/P(x) This allows us to sample by accumulating a running random-walk (Markov chain) list of correlated samples under a symmetric proposal scheme from the target distribution, which we can then estimate Hastings extended this to permit (but not require) an asymmetric proposal scheme, which speeds the process and improves mixing A(x x) = min[1, P x P x g x x g x x ]πr2 Regardless, we also accumulate a burn-in period of bad initial samples we have to ignore (this slows convergence, as do correlated samples) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 06/24

Review: Metropolis-Hastings 3/3 We have to remember that Metropolis-Hastings has a few restrictions A Markov chain won t converge to a target distribution P(x) unless it converges to a stationary distribution π x = P x If π x is not unique, we can also get multiple answers! (This is bad) So we require the equality P x x)p x = P x x )P(x ), e.g. reversibility Additionally, the proposal is symmetric when g x x g x x = 1, e.g. Gaussian These are called random-walk algorithms When P(x ) P x, they move to the high-density region with certainty, else with acceptance ratio P(x )/P x Note a proposal with higher variance typically yields a lower acceptance ratio Finally, remember Gibbs sampling, useful for certain high-dimensional situations, is a special case of Metropolis-Hastings using proposals conditioned on values of other dimensions Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 07/24

Hamiltonian Monte Carlo 1/5 Duane et al., 1987, Neal, 2012, and Betancourt, 2017 Duane proposed Hybrid Monte Carlo to more efficiently computer integrals in lattice field theory Hybrid was due to the fact that it infused Hamiltonian equations of motion to generate a candidate point x instead of just RNG As Neal describes, this allows us to push the candidate points further out with momentum because the dynamics Are reversible (necessary for convergence to unique target distribution) Preserve the Hamiltonian (so we can still use momentum) Preserve volume (which makes acceptance probabilities solvable) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 08/24

Hamiltonian Monte Carlo 2/5 A Hamiltonian is an energy function of form H q, p = U q + K(p) Position Momentum Potential Energy Kinetic Energy Hamilton s equations govern the change of this system over time dq i dt = H p i = [M 1 p] i, dp i dt = H q i = U q i J H = 0 dxd I dxd I dxd 0 dxd We can set U q = log q + C and K(p) = p T M 1 p/2 where C is constant and M is a PSD mass matrix that determines our momentum (kinetic energy) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 09/24

Hamiltonian Monte Carlo 3/5 In a Bayesian setting, we set q to be the prior, π(q), times the likelihood given data D, L q D) for our potential energy U q = log[π q L q D) ] If we choose a Gaussian proposal (Metropolis), we set kinetic energy K p = d i=1 p i 2 2m i We then generate q, p via Hamiltonian dynamics and use the difference in energy levels as our acceptance ratio in the MH algorithm A q, p q, p) = min 1, exp H q, p + H q, p Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 10/24

Hamiltonian Monte Carlo 4/5 Converting proposal and acceptance steps to this energy form is convoluted, however, we can now use Hamiltonian dynamics to walk farther without sacrificing acceptance ratio Classic method of solving Hamiltonian s differential equations is Euler s Method, which traverses a small distance ε > 0 for L steps p i t + ε = p i t ε U q i [q t ], q i t + ε = q i t + ε p i t m i We can also employ the more efficient leapfrog technique to quickly propose a candidate p i t + ε/2 = p i t (ε/2) U q i [q t ], q i t + ε = q i t + ε p i t + ε/2 m i p i t + ε = p i t + ε/2 (ε/2) U q i [q t + ε ] Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 11/24

Hamiltonian Monte Carlo 5/5 The HMC algorithm adds two steps to MH: Sample p the momentum parameter (typically symmetric, Gaussian) Compute L steps of size ε to find a new q, p Betancourt explains we can sample a momentum to easily change energy levels, then we use Hamiltonian dynamics to traverse our q-space (state space) We no longer have to wait for random walk to slowly explore as we can easily find samples well-distributed across our posterior with high acceptance ratios (same energy levels) Graphic from Betancourt, 2017 However, as Chen et al, 2014 describes, we do still have to compute the gradient of our potential at every step, which can be costly, especially in high dimensions Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 12/24

Neural Networks Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 13/24

Neural Networks 1/4 [Artificial] neural networks (or nets) are popular connectionist models for learning a function approximation Derivative of Hebbian learning after Hebb s neuropsychology work in the 1940s Popularized today thanks to parallelization and convex optimization Universal function approximator (in theory) Typically use function composition and the chain rule alongside vectorization to efficiently optimize a loss function by altering the weights (function) of each node Requires immense computational power, especially when the functions being composed are probabilistic (such as in a Bayesian Neural Net (BNN)) Feedforward Neural Net, Wikimedia Commons Fitting neural nets is an active area of research, with contributions from the perspectives of both optimization and sampling Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 14/24

Neural Networks 2/4 LeCun et al., 1998, Robbins and Monro, 1951 As Lecun et al., details, backward propagation of errors (backprop) is a powerful method for neural net optimization (these do not use sampling) a layer of weights W, at a time t, given error matrix E W t+1 = W t η E W Note this is merely gradient descent, which in recent years has been upgraded with many bells and whistles One such whistle is stochastic gradient descent (SGD), an algorithm that evolved following the stochastic approximation methods introduced by Robbins and Monro, 1951 (GO TAR HEELS!) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 15/24

Neural Networks 3/4 Calculating a gradient is costly, but as LeCun et al. details, stochastic gradient descent is much faster, and comes in both online and the smoother minibatch variations The primary idea is to update using only the error at one point, E* W t+1 = W t η E W The error at one point is an estimate of the error for the entire vector of current weights W t, hence the name stochastic gradient descent The speedup is feasible due to shared information across observations and the fact that by decreasing η, the learning rate, SGD still converges, including for minibatch variants of SGD, which just computes gradients for a handful of points instead of a single one Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 16/24

Neural Networks 4/4 Rumelhart et al., 1986 A popular bell to complement SGD s whistle is the addition of a momentum term to the update step. We more or less smooth our update steps with an exponential decay factor α W t+1 = W t η E W + α Wt W t = W t+1 W t This may seem familiar if you recall the momentum term that exists in Hamiltonian Monte Carlo (cue imaginary dramatic sound effect) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 17/24

Stochastic Gradient HMC (T. Chen et al., 2014) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 18/24

Stochastic Gradient HMC 1/4 As mentioned before, backprop uses the powerful stochastic gradient descent method and extensions to fit gradient-based neural networks Unfortunately, many of these neural nets lack inferentiability One solution (other proving P = NP or than solving AGI) is to use Bayesian Neural Networks, which exist as a class of probabilistic graphical models, and can be fitted with sampling or similar methods BNNs still perform many of the surreal feats that other neural nets accomplish However, even with Gibbs samplers and HMC, sampling in high dimensions is quite slow to converge for now... Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 19/24

Stochastic Gradient HMC 2/4 Welling et al., 2011, T. Chen et al., 2014, and C. Chen et al., 2016 In HMC, Instead of calculating our the gradient of our potential energy, U q, for all of the dataset D, what if we selected some minibatch D D to use for our estimate in the leapfrogging method? U U + N 0, Σ for noise Σ p i t + ε/2 = p i t (ε/2) U q i [q t ] Unfortunately, this naïve stochastic gradient HMC (SGHMC) injects noise into the Hamilton equations, which requires materially decreasing acceptance ratio in the MH algorithm to inefficient levels Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 20/24

Stochastic Gradient HMC 3/4 T. Chen et al. suggests fixing naïve SGHMC by adding a friction term, as proposed by Welling et al., borrowing once again from physics in the form of Langevin dynamics (vectorized form, omitting leapfrog notation) U = U + N 0, 2B B = ε 2 Σ q t+1 = q t + M 1 p t p t+1 = p t ε U q t+1 BM 1 p t Note that B is a PSD function of q t+1, but Chen also shows certain constant choices of B converge (and are far more practical) Welling et al. also laments that Bayesian methods have been leftbehind in recent machine learning advances due to [MCMC] requiring computations over the whole dataset Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 21/24

Stochastic Gradient HMC 4/4 The end result of SGHMC is an efficient sampling algorithm that also permits computing gradients on a minibatch in a Bayesian setting T. Chen et al then elaborate to show that under deterministic settings, SGHMC performs analogously to SGD with momentum, as the momentum components are related C. Chen et al. (BEAT DOOK!) further elaborates that many Bayesian MCMC sampling algorithms are analogs of stochastic optimization algorithms, which suggests that a symbiotic discovery and extensions of the two is possible, as presented in the Stochastic AnNealing Thermostats with Adaptive momentum (Santa) that incorporates recent advances from both domains Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 22/24

Conclusions HMC is a promising variant of MCMC sampling algorithms with applications in Bayesian models SGHMC offers more scalability in deep learning and several other settings, with the added benefit of inferentiability in Bayesian neural nets. Future work and collaborations between the sampling and optimization communities is promising Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 23/24

Bibliography (by date) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400 407. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The journal of chemical physics 21 1087. Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57 97 109. Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J (1986). Learning representations by backpropagating errors. Nature, 323:533 536. Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195 216-222. Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pages 9 48. Springer, 1998b. Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681 688. Neal, R. M. (2012). MCMC using Hamiltonian dynamics. ArXiv e-prints, arxiv:1206.1901 Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. arxiv preprint arxiv:1402.4102v2. Chen, C. Carlson, D. Gan, Z. Li, C. and Carin, L. (2015). Bridging the gap between stochastic gradient MCMC and stochastic optimization. arxiv:1512.07962 Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arxiv preprint arxiv:1701.02434. Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 24/24

Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018