Hamiltonian Monte Carlo for Scalable Deep Learning

Size: px
Start display at page:

Download "Hamiltonian Monte Carlo for Scalable Deep Learning"

Transcription

1 Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill BIOS 740 May 4, 2018

2 Preface Markov Chain Monte Carlo (MCMC) techniques are powerful algorithms for fitting probabilistic models Variations such as Gibbs samplers work well for some high-dimensional situations, but have issues scaling to today s challenges and model architectures Hamiltonian Monte Carlo (HMC) is a more proposal-efficient variant of MCMCs that is a promising catalyst for innovation in deep learning and probabilistic graphical models Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

3 Outline Review Metropolis-Hastings Introduction to Hamiltonian Monte Carlo (HMC) Brief review of neural networks and fitting methods Discussion of Stochastic Gradient HMC (T. Chen et al., 2014) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

4 Introduction to Hamiltonian Monte Carlo Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

5 Review: Metropolis-Hastings 1/3 Metropolis et al., 1953, and Hastings, 1970 The original Metropolis et al. can be used to compute integrals of a distribution, e.g. the normalization for a Bayesian posterior J = f x P x dx = E P (f) Originally for statistical mechanics, more specifically, calculating potential of 2D spheres (particles) in a square with fast electronic computing machines Size N = 224 particles, time = 16 hours (on prevailing machines) Metropolis et al., 1953 Advantage is that it depends only the ratio of the P(x )/P(x) of the probability distribution evaluated at two points, x and x in some data D Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

6 Review: Metropolis-Hastings 2/3 We can use this ratio to accept or reject moving from randomly generated points x x with acceptance ratio P(x )/P(x) This allows us to sample by accumulating a running random-walk (Markov chain) list of correlated samples under a symmetric proposal scheme from the target distribution, which we can then estimate Hastings extended this to permit (but not require) an asymmetric proposal scheme, which speeds the process and improves mixing A(x x) = min[1, P x P x g x x g x x ]πr2 Regardless, we also accumulate a burn-in period of bad initial samples we have to ignore (this slows convergence, as do correlated samples) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

7 Review: Metropolis-Hastings 3/3 We have to remember that Metropolis-Hastings has a few restrictions A Markov chain won t converge to a target distribution P(x) unless it converges to a stationary distribution π x = P x If π x is not unique, we can also get multiple answers! (This is bad) So we require the equality P x x)p x = P x x )P(x ), e.g. reversibility Additionally, the proposal is symmetric when g x x g x x = 1, e.g. Gaussian These are called random-walk algorithms When P(x ) P x, they move to the high-density region with certainty, else with acceptance ratio P(x )/P x Note a proposal with higher variance typically yields a lower acceptance ratio Finally, remember Gibbs sampling, useful for certain high-dimensional situations, is a special case of Metropolis-Hastings using proposals conditioned on values of other dimensions Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

8 Hamiltonian Monte Carlo 1/5 Duane et al., 1987, Neal, 2012, and Betancourt, 2017 Duane proposed Hybrid Monte Carlo to more efficiently computer integrals in lattice field theory Hybrid was due to the fact that it infused Hamiltonian equations of motion to generate a candidate point x instead of just RNG As Neal describes, this allows us to push the candidate points further out with momentum because the dynamics Are reversible (necessary for convergence to unique target distribution) Preserve the Hamiltonian (so we can still use momentum) Preserve volume (which makes acceptance probabilities solvable) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

9 Hamiltonian Monte Carlo 2/5 A Hamiltonian is an energy function of form H q, p = U q + K(p) Position Momentum Potential Energy Kinetic Energy Hamilton s equations govern the change of this system over time dq i dt = H p i = [M 1 p] i, dp i dt = H q i = U q i J H = 0 dxd I dxd I dxd 0 dxd We can set U q = log q + C and K(p) = p T M 1 p/2 where C is constant and M is a PSD mass matrix that determines our momentum (kinetic energy) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

10 Hamiltonian Monte Carlo 3/5 In a Bayesian setting, we set q to be the prior, π(q), times the likelihood given data D, L q D) for our potential energy U q = log[π q L q D) ] If we choose a Gaussian proposal (Metropolis), we set kinetic energy K p = d i=1 p i 2 2m i We then generate q, p via Hamiltonian dynamics and use the difference in energy levels as our acceptance ratio in the MH algorithm A q, p q, p) = min 1, exp H q, p + H q, p Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

11 Hamiltonian Monte Carlo 4/5 Converting proposal and acceptance steps to this energy form is convoluted, however, we can now use Hamiltonian dynamics to walk farther without sacrificing acceptance ratio Classic method of solving Hamiltonian s differential equations is Euler s Method, which traverses a small distance ε > 0 for L steps p i t + ε = p i t ε U q i [q t ], q i t + ε = q i t + ε p i t m i We can also employ the more efficient leapfrog technique to quickly propose a candidate p i t + ε/2 = p i t (ε/2) U q i [q t ], q i t + ε = q i t + ε p i t + ε/2 m i p i t + ε = p i t + ε/2 (ε/2) U q i [q t + ε ] Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

12 Hamiltonian Monte Carlo 5/5 The HMC algorithm adds two steps to MH: Sample p the momentum parameter (typically symmetric, Gaussian) Compute L steps of size ε to find a new q, p Betancourt explains we can sample a momentum to easily change energy levels, then we use Hamiltonian dynamics to traverse our q-space (state space) We no longer have to wait for random walk to slowly explore as we can easily find samples well-distributed across our posterior with high acceptance ratios (same energy levels) Graphic from Betancourt, 2017 However, as Chen et al, 2014 describes, we do still have to compute the gradient of our potential at every step, which can be costly, especially in high dimensions Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

13 Neural Networks Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

14 Neural Networks 1/4 [Artificial] neural networks (or nets) are popular connectionist models for learning a function approximation Derivative of Hebbian learning after Hebb s neuropsychology work in the 1940s Popularized today thanks to parallelization and convex optimization Universal function approximator (in theory) Typically use function composition and the chain rule alongside vectorization to efficiently optimize a loss function by altering the weights (function) of each node Requires immense computational power, especially when the functions being composed are probabilistic (such as in a Bayesian Neural Net (BNN)) Feedforward Neural Net, Wikimedia Commons Fitting neural nets is an active area of research, with contributions from the perspectives of both optimization and sampling Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

15 Neural Networks 2/4 LeCun et al., 1998, Robbins and Monro, 1951 As Lecun et al., details, backward propagation of errors (backprop) is a powerful method for neural net optimization (these do not use sampling) a layer of weights W, at a time t, given error matrix E W t+1 = W t η E W Note this is merely gradient descent, which in recent years has been upgraded with many bells and whistles One such whistle is stochastic gradient descent (SGD), an algorithm that evolved following the stochastic approximation methods introduced by Robbins and Monro, 1951 (GO TAR HEELS!) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

16 Neural Networks 3/4 Calculating a gradient is costly, but as LeCun et al. details, stochastic gradient descent is much faster, and comes in both online and the smoother minibatch variations The primary idea is to update using only the error at one point, E* W t+1 = W t η E W The error at one point is an estimate of the error for the entire vector of current weights W t, hence the name stochastic gradient descent The speedup is feasible due to shared information across observations and the fact that by decreasing η, the learning rate, SGD still converges, including for minibatch variants of SGD, which just computes gradients for a handful of points instead of a single one Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

17 Neural Networks 4/4 Rumelhart et al., 1986 A popular bell to complement SGD s whistle is the addition of a momentum term to the update step. We more or less smooth our update steps with an exponential decay factor α W t+1 = W t η E W + α Wt W t = W t+1 W t This may seem familiar if you recall the momentum term that exists in Hamiltonian Monte Carlo (cue imaginary dramatic sound effect) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

18 Stochastic Gradient HMC (T. Chen et al., 2014) Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

19 Stochastic Gradient HMC 1/4 As mentioned before, backprop uses the powerful stochastic gradient descent method and extensions to fit gradient-based neural networks Unfortunately, many of these neural nets lack inferentiability One solution (other proving P = NP or than solving AGI) is to use Bayesian Neural Networks, which exist as a class of probabilistic graphical models, and can be fitted with sampling or similar methods BNNs still perform many of the surreal feats that other neural nets accomplish However, even with Gibbs samplers and HMC, sampling in high dimensions is quite slow to converge for now... Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

20 Stochastic Gradient HMC 2/4 Welling et al., 2011, T. Chen et al., 2014, and C. Chen et al., 2016 In HMC, Instead of calculating our the gradient of our potential energy, U q, for all of the dataset D, what if we selected some minibatch D D to use for our estimate in the leapfrogging method? U U + N 0, Σ for noise Σ p i t + ε/2 = p i t (ε/2) U q i [q t ] Unfortunately, this naïve stochastic gradient HMC (SGHMC) injects noise into the Hamilton equations, which requires materially decreasing acceptance ratio in the MH algorithm to inefficient levels Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

21 Stochastic Gradient HMC 3/4 T. Chen et al. suggests fixing naïve SGHMC by adding a friction term, as proposed by Welling et al., borrowing once again from physics in the form of Langevin dynamics (vectorized form, omitting leapfrog notation) U = U + N 0, 2B B = ε 2 Σ q t+1 = q t + M 1 p t p t+1 = p t ε U q t+1 BM 1 p t Note that B is a PSD function of q t+1, but Chen also shows certain constant choices of B converge (and are far more practical) Welling et al. also laments that Bayesian methods have been leftbehind in recent machine learning advances due to [MCMC] requiring computations over the whole dataset Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

22 Stochastic Gradient HMC 4/4 The end result of SGHMC is an efficient sampling algorithm that also permits computing gradients on a minibatch in a Bayesian setting T. Chen et al then elaborate to show that under deterministic settings, SGHMC performs analogously to SGD with momentum, as the momentum components are related C. Chen et al. (BEAT DOOK!) further elaborates that many Bayesian MCMC sampling algorithms are analogs of stochastic optimization algorithms, which suggests that a symbiotic discovery and extensions of the two is possible, as presented in the Stochastic AnNealing Thermostats with Adaptive momentum (Santa) that incorporates recent advances from both domains Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

23 Conclusions HMC is a promising variant of MCMC sampling algorithms with applications in Bayesian models SGHMC offers more scalability in deep learning and several other settings, with the added benefit of inferentiability in Bayesian neural nets. Future work and collaborations between the sampling and optimization communities is promising Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

24 Bibliography (by date) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The journal of chemical physics Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J (1986). Learning representations by backpropagating errors. Nature, 323: Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pages Springer, 1998b. Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp Neal, R. M. (2012). MCMC using Hamiltonian dynamics. ArXiv e-prints, arxiv: Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. arxiv preprint arxiv: v2. Chen, C. Carlson, D. Gan, Z. Li, C. and Carin, L. (2015). Bridging the gap between stochastic gradient MCMC and stochastic optimization. arxiv: Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arxiv preprint arxiv: Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, /24

25 Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018

17 : Optimization and Monte Carlo Methods

17 : Optimization and Monte Carlo Methods 10-708: Probabilistic Graphical Models Spring 2017 17 : Optimization and Monte Carlo Methods Lecturer: Avinava Dubey Scribes: Neil Spencer, YJ Choe 1 Recap 1.1 Monte Carlo Monte Carlo methods such as rejection

More information

19 : Slice Sampling and HMC

19 : Slice Sampling and HMC 10-708: Probabilistic Graphical Models 10-708, Spring 2018 19 : Slice Sampling and HMC Lecturer: Kayhan Batmanghelich Scribes: Boxiang Lyu 1 MCMC (Auxiliary Variables Methods) In inference, we are often

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Bayesian Sampling Using Stochastic Gradient Thermostats

Bayesian Sampling Using Stochastic Gradient Thermostats Bayesian Sampling Using Stochastic Gradient Thermostats Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen

More information

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

1 Geometry of high dimensional probability distributions

1 Geometry of high dimensional probability distributions Hamiltonian Monte Carlo October 20, 2018 Debdeep Pati References: Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 (2011): 2. Betancourt, Michael. A conceptual

More information

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada. In Advances in Neural Information Processing Systems 8 eds. D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, MIT Press, 1996. Gaussian Processes for Regression Christopher K. I. Williams Neural Computing

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Markov chain Monte Carlo methods for visual tracking

Markov chain Monte Carlo methods for visual tracking Markov chain Monte Carlo methods for visual tracking Ray Luo rluo@cory.eecs.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720

More information

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Brief introduction to Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to Content 1 and motivation Classical

More information

Gradient-based Monte Carlo sampling methods

Gradient-based Monte Carlo sampling methods Gradient-based Monte Carlo sampling methods Johannes von Lindheim 31. May 016 Abstract Notes for a 90-minute presentation on gradient-based Monte Carlo sampling methods for the Uncertainty Quantification

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Bayesian Sampling Using Stochastic Gradient Thermostats

Bayesian Sampling Using Stochastic Gradient Thermostats Bayesian Sampling Using Stochastic Gradient Thermostats Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Introduction to Hamiltonian Monte Carlo Method

Introduction to Hamiltonian Monte Carlo Method Introduction to Hamiltonian Monte Carlo Method Mingwei Tang Department of Statistics University of Washington mingwt@uw.edu November 14, 2017 1 Hamiltonian System Notation: q R d : position vector, p R

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

arxiv: v1 [stat.ml] 4 Dec 2018

arxiv: v1 [stat.ml] 4 Dec 2018 1st Symposium on Advances in Approximate Bayesian Inference, 2018 1 6 Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling arxiv:1812.01181v1 [stat.ml]

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

arxiv: v1 [stat.co] 2 Nov 2017

arxiv: v1 [stat.co] 2 Nov 2017 Binary Bouncy Particle Sampler arxiv:1711.922v1 [stat.co] 2 Nov 217 Ari Pakman Department of Statistics Center for Theoretical Neuroscience Grossman Center for the Statistics of Mind Columbia University

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Christophe Dupuy INRIA - Technicolor christophe.dupuy@inria.fr Francis Bach INRIA - ENS francis.bach@inria.fr Abstract

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Bayesian model selection in graphs by using BDgraph package

Bayesian model selection in graphs by using BDgraph package Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26, 2013 MOTIVATION Flow cytometry data with 11 proteins from Sachs et al. (2005) RESULT FOR CELL SIGNALING DATA

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Anoop Korattikara, Yutian Chen and Max Welling,2 Department of Computer Science, University of California, Irvine 2 Informatics Institute,

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Stochastic Gradient Hamiltonian Monte Carlo

Stochastic Gradient Hamiltonian Monte Carlo Tianqi Chen Emily B. Fox Carlos Guestrin MODE Lab, University of Washington, Seattle, WA. TQCHEN@CS.WASHINGTON.EDU EBFOX@STAT.WASHINGTON.EDU GUESTRIN@CS.WASHINGTON.EDU Abstract Hamiltonian Monte Carlo

More information

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Hamiltonian Monte Carlo with Fewer Momentum Reversals

Hamiltonian Monte Carlo with Fewer Momentum Reversals Hamiltonian Monte Carlo with ewer Momentum Reversals Jascha Sohl-Dickstein December 6, 2 Hamiltonian dynamics with partial momentum refreshment, in the style of Horowitz, Phys. ett. B, 99, explore the

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02 Spring 13 1 Recap: Monte Carlo Method If U is a universe of items, and G is a subset satisfying some property,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

Thermostat-assisted Continuous-tempered Hamiltonian Monte Carlo for Multimodal Posterior Sampling

Thermostat-assisted Continuous-tempered Hamiltonian Monte Carlo for Multimodal Posterior Sampling Thermostat-assisted Continuous-tempered Hamiltonian Monte Carlo for Multimodal Posterior Sampling Rui Luo, Yaodong Yang, Jun Wang, and Yuanyuan Liu Department of Computer Science, University College London

More information

Markov chain Monte Carlo Lecture 9

Markov chain Monte Carlo Lecture 9 Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois GraphLab, Inc. Anoop Korattikara Dept. of Computer Science UC Irvine Max Welling Informatics Institute University of Amsterdam

More information

Credit Assignment: Beyond Backpropagation

Credit Assignment: Beyond Backpropagation Credit Assignment: Beyond Backpropagation Yoshua Bengio 11 December 2016 AutoDiff NIPS 2016 Workshop oo b s res P IT g, M e n i arn nlin Le ain o p ee em : D will r G PLU ters p cha k t, u o is Deep Learning

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Osnat Stramer 1 and Matthew Bognar 1 Department of Statistics and Actuarial Science, University of

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Results: MCMC Dancers, q=10, n=500

Results: MCMC Dancers, q=10, n=500 Motivation Sampling Methods for Bayesian Inference How to track many INTERACTING targets? A Tutorial Frank Dellaert Results: MCMC Dancers, q=10, n=500 1 Probabilistic Topological Maps Results Real-Time

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

Computer Practical: Metropolis-Hastings-based MCMC

Computer Practical: Metropolis-Hastings-based MCMC Computer Practical: Metropolis-Hastings-based MCMC Andrea Arnold and Franz Hamilton North Carolina State University July 30, 2016 A. Arnold / F. Hamilton (NCSU) MH-based MCMC July 30, 2016 1 / 19 Markov

More information

Bayesian networks: approximate inference

Bayesian networks: approximate inference Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero ( ( A NEW VIEW OF ICA G.E. Hinton, M. Welling, Y.W. Teh Department of Computer Science University of Toronto 0 Kings College Road, Toronto Canada M5S 3G4 S. K. Osindero Gatsby Computational Neuroscience

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Advances and Applications in Perfect Sampling

Advances and Applications in Perfect Sampling and Applications in Perfect Sampling Ph.D. Dissertation Defense Ulrike Schneider advisor: Jem Corcoran May 8, 2003 Department of Applied Mathematics University of Colorado Outline Introduction (1) MCMC

More information

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) Rasmus Waagepetersen Institute of Mathematical Sciences Aalborg University 1 Introduction These notes are intended to

More information

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks and Boltzmann Machines Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks A Hopfield network is a neural network with a graph G = (U,C) that satisfies

More information

Tutorial on Probabilistic Programming with PyMC3

Tutorial on Probabilistic Programming with PyMC3 185.A83 Machine Learning for Health Informatics 2017S, VU, 2.0 h, 3.0 ECTS Tutorial 02-04.04.2017 Tutorial on Probabilistic Programming with PyMC3 florian.endel@tuwien.ac.at http://hci-kdd.org/machine-learning-for-health-informatics-course

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Cognitive Science 30 (2006) 725 731 Copyright 2006 Cognitive Science Society, Inc. All rights reserved. Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Geoffrey Hinton,

More information

arxiv: v1 [stat.ml] 6 Mar 2015

arxiv: v1 [stat.ml] 6 Mar 2015 HAMILTONIAN ABC Hamiltonian ABC arxiv:153.1916v1 [stat.ml] 6 Mar 15 Edward Meeds Robert Leenders Max Welling Informatics Institute University of Amsterdam Amsterdam, Netherlands Abstract TMEEDS@GMAIL.COM

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Simulated Annealing for Constrained Global Optimization

Simulated Annealing for Constrained Global Optimization Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

Adaptive Rejection Sampling with fixed number of nodes

Adaptive Rejection Sampling with fixed number of nodes Adaptive Rejection Sampling with fixed number of nodes L. Martino, F. Louzada Institute of Mathematical Sciences and Computing, Universidade de São Paulo, Brazil. Abstract The adaptive rejection sampling

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

MCMC notes by Mark Holder

MCMC notes by Mark Holder MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Theory of Stochastic Processes 8. Markov chain Monte Carlo Theory of Stochastic Processes 8. Markov chain Monte Carlo Tomonari Sei sei@mist.i.u-tokyo.ac.jp Department of Mathematical Informatics, University of Tokyo June 8, 2017 http://www.stat.t.u-tokyo.ac.jp/~sei/lec.html

More information