Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Similar documents
On Markov chain Monte Carlo methods for tall data

Bayesian Sampling Using Stochastic Gradient Thermostats

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Distributed Stochastic Gradient MCMC

Hamiltonian Monte Carlo for Scalable Deep Learning

Bayesian Sampling Using Stochastic Gradient Thermostats

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

CPSC 540: Machine Learning

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Computational statistics

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 7 and 8: Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Riemann Manifold Methods in Bayesian Statistics

Sparse Stochastic Inference for Latent Dirichlet Allocation

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Methods for Machine Learning

Bayesian Inference and MCMC

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Variational Inference via Stochastic Backpropagation

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

17 : Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

1 Geometry of high dimensional probability distributions

Introduction to Machine Learning

Markov Chain Monte Carlo methods

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Reminder of some Markov Chain properties:

Probabilistic Graphical Models

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Inference using MCMC

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Part 1: Expectation Propagation

Monte Carlo in Bayesian Statistics

Lecture 13 : Variational Inference: Mean Field Approximation

Manifold Monte Carlo Methods

19 : Slice Sampling and HMC

Lecture 6: Graphical Models: Learning

Monte Carlo Inference Methods

Stochastic Gradient Hamiltonian Monte Carlo

Notes on pseudo-marginal methods, variational Bayes and ABC

Machine Learning Summer School

Density Estimation. Seungjin Choi

STA 4273H: Sta-s-cal Machine Learning

Bayesian Estimation with Sparse Grids

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

17 : Optimization and Monte Carlo Methods

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

MCMC algorithms for fitting Bayesian models

Graphical Models and Kernel Methods

Topic Modelling and Latent Dirichlet Allocation

Bayesian Inference for Dirichlet-Multinomials

Kernel adaptive Sequential Monte Carlo

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Learning the hyper-parameters. Luca Martino

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Streaming Variational Bayes

Markov Chain Monte Carlo (MCMC)

Introduction to Bayesian inference

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Introduction to Machine Learning CMU-10701

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

MCMC: Markov Chain Monte Carlo

Bayesian Machine Learning

MCMC and Gibbs Sampling. Sargur Srihari

Adaptive HMC via the Infinite Exponential Family

Computer Intensive Methods in Mathematical Statistics

Learning Energy-Based Models of High-Dimensional Data

Approximate inference in Energy-Based Models

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Principles of Bayesian Inference

Markov chain Monte Carlo methods in atmospheric remote sensing

Approximate Slice Sampling for Bayesian Posterior Inference

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Markov chain Monte Carlo

Gaussian Mixture Model

CS281A/Stat241A Lecture 22

Scaling up Bayesian Inference

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

STA414/2104 Statistical Methods for Machine Learning II

Linear Dynamical Systems

an introduction to bayesian inference

ST 740: Markov Chain Monte Carlo

STAT 425: Introduction to Bayesian Analysis

Markov Chain Monte Carlo (MCMC)

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Transcription:

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer School August 10, 2016 Changyou Chen (Duke University) SG-MCMC 1 / 56

Preface Stochastic gradient Markov chain Monte Carlo (SG-MCMC): A new technique for approximate Bayesian sampling. It is about scalable Bayesian learning for big data. It draws samples {θ} s from p(θ; D) where p(θ; D) is too expensive to be evaluated in each iteration. This lecture: Will cover: basic ideas behind SG-MCMC. Will not cover: different kinds of SG-MCMC algorithms, applications, and the corresponding convergence theory. Changyou Chen (Duke University) SG-MCMC 2 / 56

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 3 / 56

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 3 / 56

Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) 6 8 10 12 14 l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) 6 8 10 12 14 l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) 6 8 10 12 14 l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 6 / 56

MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

Gibbs sampling µ and τ 1 Conditional distributions: ( n µ τ, {x i } N n + 1 x + 1 ) n + 1 µ 1 0, (n + 1)τ ( τ µ, {x i } Gamma α + n + 1 i 2, β + (x i µ) 2 + (µ µ 0 ) 2 ) 2 Changyou Chen (Duke University) SG-MCMC 8 / 56

Trace plot for µ 1.06 1.04 1.02 sample trace true mean sample mean µ 1 0.98 0.96 0.94 0 200 400 600 800 1000 Iteration Changyou Chen (Duke University) SG-MCMC 9 / 56

Sample approximation for µ True posterior is a non-standardized Student s t-distribution. 30 true sample approximation 25 20 p(7jx) 15 10 5 0 0.9 0.95 1 1.05 1.1 7 Changyou Chen (Duke University) SG-MCMC 10 / 56

Trace plot for τ 5.5 5 sample trace true mean sample mean τ 4.5 4 0 200 400 600 800 1000 Iteration Changyou Chen (Duke University) SG-MCMC 11 / 56

Sample approximation for τ True posterior is a Gamma distribution. 2 true sample approximation 1.5 p(= jx) 1 0.5 0 2 3 4 5 6 = Changyou Chen (Duke University) SG-MCMC 12 / 56

Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

Discussion on the proposal distribution 1 Standard proposal distribution is an isotropic Gaussian center at the current state with variance σ: small σ leads to high acceptance rate, but moves too slowly large σ moves fast, but leads to high rejection rate 2 How to choose better proposals? 3 2 1 0-1 -2 < -3-3 -2-1 0 1 2 3 Changyou Chen (Duke University) SG-MCMC 15 / 56

Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

Discussion of Gibbs sampler pling 1 No accept-reject step, very efficient. 2 Conditional distributions are not always easy to sample. 3 May not mix well when in high-dimensional space with highly correlated variables. z 2 L tions: value turn or randomly x j6=i ) l z 1 Figure: Sample path does not followfigure gradients. from PRML, Figure Bishop from (2006) PRML, Bishop (2006) Changyou Chen (Duke University) SG-MCMC 17 / 56

The Metropolis-adjusted Langevin: a better proposal 1 Gibbs sampling travels the parameter space following a zipzag curve, which might be slow in high-dimensional space. 2 The Metropolis-adjusted Langevin uses a proposal that points directly to the center of the probabilistic contour. Changyou Chen (Duke University) SG-MCMC 18 / 56

The Metropolis-adjusted Langevin: a better proposal 1 Let E(θ) log p(θ), the direction of the contour is just the gradient: θ E(θ). 2 In iteration l, define the proposal as a Gaussian centering at θ = θ l 1 θ E(θ l 1 )h l, where h l is a small stepsize: ( q(θ l θ l 1 ) = N θ l ; θ, σ 2). 3 Need to do an accept-reject step: calculate the acceptance probability: A(θ, θ l 1 ) = p(θ )q(θ l 1 θ ) p(θ)q(θ θ l 1 ) accept θ with probability A(θ, θ l 1 ), otherwise set θ l = θ l 1 Changyou Chen (Duke University) SG-MCMC 19 / 56

The Metropolis-adjusted Langevin: a better proposal 1 Let E(θ) log p(θ), the direction of the contour is just the gradient: θ E(θ). 2 In iteration l, define the proposal as a Gaussian centering at θ = θ l 1 θ E(θ l 1 )h l, where h l is a small stepsize: ( q(θ l θ l 1 ) = N θ l ; θ, σ 2). 3 Need to do an accept-reject step: calculate the acceptance probability: A(θ, θ l 1 ) = p(θ )q(θ l 1 θ ) p(θ)q(θ θ l 1 ) accept θ with probability A(θ, θ l 1 ), otherwise set θ l = θ l 1 Changyou Chen (Duke University) SG-MCMC 19 / 56

Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

Demo: MH vs. HMC 1 Nine mixtures of Gaussians 3. 2 Sequential of samples connected by yellow lines. 3 Demo by T. Broderick and D. Duvenaud. Changyou Chen (Duke University) SG-MCMC 22 / 56

Recap 1 Bayesian sampling with traditional MCMC methods, in each iteration: generate a candidate sample from a proposal distribution calculate the acceptance probability accept or reject the proposed sample Changyou Chen (Duke University) SG-MCMC 23 / 56

Discussion 1 All the above traditional MCMC methods are not scalable in a big-data setting 4, in each iteration: the whole data need to be used to generate a proposal the whole data need to be used to calculate the acceptance probability scales O(N), where N is the number of data samples 2 Scalable MCMC uses sub-data in each iteration, to calculate the acceptance probability 5 to generate proposals, and ignore the acceptance step stochastic gradient MCMC methods (SG-MCMC) 4 when the number of data samples are large. 5 A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In: ICML. 2014; R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML. 2014. Changyou Chen (Duke University) SG-MCMC 24 / 56

Discussion 1 All the above traditional MCMC methods are not scalable in a big-data setting 4, in each iteration: the whole data need to be used to generate a proposal the whole data need to be used to calculate the acceptance probability scales O(N), where N is the number of data samples 2 Scalable MCMC uses sub-data in each iteration, to calculate the acceptance probability 5 to generate proposals, and ignore the acceptance step stochastic gradient MCMC methods (SG-MCMC) 4 when the number of data samples are large. 5 A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In: ICML. 2014; R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML. 2014. Changyou Chen (Duke University) SG-MCMC 24 / 56

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 25 / 56

Two key steps in SG-MCMC 1 Proposals typically follow stochastic gradients of log-posteriors: make samples concentrate on the modes 2 Adding random Gaussian noise to proposals. encourage algorithms to jump out of local modes, and to explore the parameter space the noise in stochastic gradients not sufficient to make the algorithm move around parameter space Figure: Proposals of Gibbs and SG-MCMC. Changyou Chen (Duke University) SG-MCMC 26 / 56

Two key steps in SG-MCMC 1 Proposals typically follow stochastic gradients of log-posteriors: make samples concentrate on the modes 2 Adding random Gaussian noise to proposals. encourage algorithms to jump out of local modes, and to explore the parameter space the noise in stochastic gradients not sufficient to make the algorithm move around parameter space Figure: Proposals of Gibbs and SG-MCMC. Changyou Chen (Duke University) SG-MCMC 26 / 56

Basic setup 1 Given data X = {x 1,, x N }, a generative model (likelihood) p(x θ) = N i=1 p(x i θ) and prior p(θ), we want to sample from the posterior: N p(θ X) p(θ)p(x θ) = p(θ) p(x i θ) 2 We are interested in the case when N is extremely large, so that computing p(x θ) is prohibitively expensive. 3 Define the following two quantities (unnormalized log-posterior and stochastic unnormalized log-posterior): N U(θ) log p(x i θ) log p(θ) Ũ(θ) N n i=1 i=1 n log p(x πi θ) log p(θ) i=1 where (π 1,, π N ) is a random permutation of (1,, N). Changyou Chen (Duke University) SG-MCMC 27 / 56

Basic setup 1 Given data X = {x 1,, x N }, a generative model (likelihood) p(x θ) = N i=1 p(x i θ) and prior p(θ), we want to sample from the posterior: N p(θ X) p(θ)p(x θ) = p(θ) p(x i θ) 2 We are interested in the case when N is extremely large, so that computing p(x θ) is prohibitively expensive. 3 Define the following two quantities (unnormalized log-posterior and stochastic unnormalized log-posterior): N U(θ) log p(x i θ) log p(θ) Ũ(θ) N n i=1 i=1 n log p(x πi θ) log p(θ) i=1 where (π 1,, π N ) is a random permutation of (1,, N). Changyou Chen (Duke University) SG-MCMC 27 / 56

Basic setup 1 SG-MCMC relies on the following quantity (stochastic gradient): θ Ũ(θ) N n n θ log p(x πi θ) θ log p(θ), i=1 2 θ Ũ(θ) is an unbiased estimate of θ U(θ): SG-MCMC samples parameters based on θ Ũ(θ) very cheap to compute bringing the name stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 28 / 56

Basic setup 1 SG-MCMC relies on the following quantity (stochastic gradient): θ Ũ(θ) N n n θ log p(x πi θ) θ log p(θ), i=1 2 θ Ũ(θ) is an unbiased estimate of θ U(θ): SG-MCMC samples parameters based on θ Ũ(θ) very cheap to compute bringing the name stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 28 / 56

Comparing with traditional MCMC 1 Ignore the acceptance step: the detailed balance condition typically not hold, and the algorithm is not reversible 6 typically leads to biased, but controllable estimations 2 Use sub-data in each iteration: yielding stochastic gradients does not affect the convergence properties (e.g., convergence rates), compared to using the whole data in each iteration 6 These are sufficient conditions for a valid MCMC method, but not necessary conditions. Changyou Chen (Duke University) SG-MCMC 29 / 56

Comparing with traditional MCMC 1 Ignore the acceptance step: the detailed balance condition typically not hold, and the algorithm is not reversible 6 typically leads to biased, but controllable estimations 2 Use sub-data in each iteration: yielding stochastic gradients does not affect the convergence properties (e.g., convergence rates), compared to using the whole data in each iteration 6 These are sufficient conditions for a valid MCMC method, but not necessary conditions. Changyou Chen (Duke University) SG-MCMC 29 / 56

Demo: the two key steps 1 Proposals follow stochastic gradients of log-posteriors: stuck in a local mode Changyou Chen (Duke University) SG-MCMC 30 / 56

Demo: the two key steps 1 After adding random Gaussian noise: it works!! Changyou Chen (Duke University) SG-MCMC 31 / 56

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 32 / 56

First attempt 1 A 1st-order method: stochastic gradients directly applied on the model parameter θ. 2 Use a proposal that follows the stochastic gradient of the log-posterior: θ l+1 = θ l h l+1 θ Ũ(θ l ) hl s are the stepsizes, could be fixed ( l, h l = h) or deceasing ( l, h l > h l+1 ) 3 Ignore the acceptance step. 4 Resulting in Stochastic Gradient Descend (SGD). Changyou Chen (Duke University) SG-MCMC 33 / 56

Random noise to the rescue 1 Need to make the algorithm explore the parameter space: adding random Gaussian noise to the update 7 θ l+1 = θ l h l+1 θ Ũ(θ l ) + 2h l+1 ζ l+1 ζ l+1 N (0, I) 2 The magnitude of the Gaussian needs to be 2h l+1 in order to guarantee a correct sampler: guaranteed by the Fokker-Planck Equation 3 This is called stochastic gradient Langevin dynamics (SGLD). 7 In the following, we will directly use N (0, I) to represent a normal random variable with zero-mean and covariance matrix I. Changyou Chen (Duke University) SG-MCMC 34 / 56

Random noise to the rescue 1 Need to make the algorithm explore the parameter space: adding random Gaussian noise to the update 7 θ l+1 = θ l h l+1 θ Ũ(θ l ) + 2h l+1 ζ l+1 ζ l+1 N (0, I) 2 The magnitude of the Gaussian needs to be 2h l+1 in order to guarantee a correct sampler: guaranteed by the Fokker-Planck Equation 3 This is called stochastic gradient Langevin dynamics (SGLD). 7 In the following, we will directly use N (0, I) to represent a normal random variable with zero-mean and covariance matrix I. Changyou Chen (Duke University) SG-MCMC 34 / 56

SGLD in algorithm Input: Parameters {h l } Output: Approximate samples {θ l } Initialize θ 0 R n for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l 1)h l + 2h l N (0, I) end Return {θ l } Algorithm 1: Stochastic Gradient Langevin Dynamics Changyou Chen (Duke University) SG-MCMC 35 / 56

Example 8 1 A simple Gaussian mixture: θ 1 N (0, 10), θ 2 N (0, 1) x i 1 2 N (θ 1, 2) + 1 2 N (θ 1 + θ 2, 2), i = 1,, 100 Stochastic Gradient Langevin Dynam 3 2 1 0 1 2 3 2 1 0 1 2 Log joint probability per datum 00-1 1-2 2-3 3-4 4-5 5-6 6-7 of thro 7 0 2 Number 4 iterations 3 1 0 1 2 3 1 0 1 2 Figure 3. Averag and accuracy on Figure 1. True and estimated posterior distribution. ber of sweeps th Figure: Left: true posterior; Right: sample-based estimation. represents accur 10 0 10 0 aged over 50 ru deviation. 8 M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In: ICML. 2011. Changyou Chen (Duke University) SG-MCMC 36 / 56 te

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 37 / 56

SGHMC 1 A 2nd-order method: stochastic gradients applied on some auxiliary parameters (momentum). 2 SGLD is slow when parameter space exhibits uneven curvatures. 3 Use the momentum idea to improve SGLD: a generalization of the HMC, in that the ball is rolling on a friction surface the ball follows the momentum instead of gradients, which is a summarization of historical gradients, thus could jump out local modes easier and move faster needs a balance between these extra forces momentum friction random force gravity Changyou Chen (Duke University) SG-MCMC 38 / 56

Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

SGHMC in algorithm Input: Parameters A, {h l } Output: Approximate samples {θ l } Initialize θ 0 R n for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end Return {θ l } Algorithm 2: Stochastic Gradient Hamiltonian Monte Carlo Changyou Chen (Duke University) SG-MCMC 40 / 56

Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end Reparametrization: ɛ = h 2, m = Ah, p = v h Changyou Chen (Duke University) SG-MCMC 41 / 56

Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l + 2mɛl N (0, I) end Reparametrization: ɛ = h 2, m = Ah, p = v h Changyou Chen (Duke University) SG-MCMC 41 / 56

Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l + 2mɛl N (0, I) end Reparametrization: ɛ = h 2, m = Ah, v = p h ɛ l : learning rate; m: momentum weight Changyou Chen (Duke University) SG-MCMC 42 / 56

SGD vs. SGLD θ Ũ(θ l 1 ) N n n θ log p(x πi θ l 1 ) θ log p(θ l 1 ), i=1 SGD: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l)ɛ l end SGLD: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l)ɛ l + δ l δ l N (0, 2ɛ l I) end Changyou Chen (Duke University) SG-MCMC 43 / 56

SGD with Momentum (SGD-M) vs. SGHMC θ Ũ(θ l 1 ) N n n θ log p(x πi θ l 1 ) θ log p(θ l 1 ), i=1 SGD-M: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l end SGHMC: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l +δ l δ l N (0, 2mɛ l I) end Changyou Chen (Duke University) SG-MCMC 44 / 56

= 0.1. For the noisy scenarios, we replace the gradient by rũ( ) = + N (0, 4). We see that noisy Hamiltonian dynamics lead to diverging trajectories when friction is not introduced. 1 Sample from a 2D Gaussian distribution: Resampling r helps control divergence, but the associated HMC U(θ) = 1 2 θt Σ 1 θ stationary distribution is not correct, as illustrated in Fig. 1. Example 10 Average Absolute Error of Sample Covariance 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 SGLD SGHMC 0 0 50 100 150 200 Autocorrelation Time y 3 2 1 0 1 SGLD SGHMC 2 2 1 0 1 2 3 x Figure 3. Contrasting sampling of a bivariate Gaussian with correlation using SGHMC versus SGLD. Here, U( ) = 1 2 T 1 T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In: ICML. 2014. 10, Changyou Chen (Duke University) SG-MCMC 45 / 56

Recap 1 For SG-MCMC methods, in each iteration: calculate the stochastic gradient based on the current parameter sample generate the next sample by moving the current sample (probably in an extended space) along the direction of the stochastic gradient, plus a suitable random Gaussian noise no need for accept-reject guaranteed to converge close to the true posterior in some sense Changyou Chen (Duke University) SG-MCMC 46 / 56

Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 47 / 56

Latent Dirichlet allocation 1 For each topic k, draw the topic-word distribution: β k Dir(γ) γ β k K 2 For each document d, draw its topic distribution: θ d Dir(α) For each word l, draw its topic indicator: c dl Discrete(θ d ) Draw the observed word: x dl Discrete(β cdl ) x dl c dl θ d α N D Changyou Chen (Duke University) SG-MCMC 48 / 56

Latent Dirichlet allocation 1 Let β (β k ) K k=1, θ (θ d) D d=1, C (c dl) D,n d d,l=1, X (x dl) D,n d d,l=1, the posterior distribution [ K ] [ D ] n d p(β, θ, C X) p(β k γ) p(θ d α) p(c dl θ d )p(x dl β, c dl ) k=1 2 From previous lectures: p(c dl θ d ) = p(x dl θ, c dl ) = 3 Together with the fact: K θ K 1 k=1 d=1 l=1 K (θ dk ) 1(c dl =k) k=1 K V β 1(x dl =v)1(c dl =k) kv k=1 v=1 θ α k 1 k d θ k = K k=1 Γ(α k) Γ( K k=1 α k) Changyou Chen (Duke University) SG-MCMC 49 / 56

Latent Dirichlet allocation 1 Let β (β k ) K k=1, θ (θ d) D d=1, C (c dl) D,n d d,l=1, X (x dl) D,n d d,l=1, the posterior distribution [ K ] [ D ] n d p(β, θ, C X) p(β k γ) p(θ d α) p(c dl θ d )p(x dl β, c dl ) k=1 2 From previous lectures: p(c dl θ d ) = p(x dl θ, c dl ) = 3 Together with the fact: K θ K 1 k=1 d=1 l=1 K (θ dk ) 1(c dl =k) k=1 K V β 1(x dl =v)1(c dl =k) kv k=1 v=1 θ α k 1 k d θ k = K k=1 Γ(α k) Γ( K k=1 α k) Changyou Chen (Duke University) SG-MCMC 49 / 56

Latent Dirichlet allocation 1 Integrate out the local parameters: topic distributions θ for each document, it results in the following semi-collapsed distribution: p(x, C, β α, γ) = D d=1 Γ(K α) Γ(K α + n d ) K k=1 Γ(α + n dk ) Γ(α) K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv, where n dkw n d l=1 1(c dl = k)1(x dl = w) is #word w in doc d with topic k; means marginal sum, e.g. n kw D d=1 n dkw. 2 SG-MCMC requires parameter spaces unconstrained: reparameterization: β kv = λ kv / v λ kv, with the following prior: K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv = K λ kv Ga(λ kv ; γ, 1) k=1 v=1 V Ga(λ kv ; γ, 1) V (λ kv / λ kv ) n kw v v=1 Changyou Chen (Duke University) SG-MCMC 50 / 56

Latent Dirichlet allocation 1 Integrate out the local parameters: topic distributions θ for each document, it results in the following semi-collapsed distribution: p(x, C, β α, γ) = D d=1 Γ(K α) Γ(K α + n d ) K k=1 Γ(α + n dk ) Γ(α) K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv, where n dkw n d l=1 1(c dl = k)1(x dl = w) is #word w in doc d with topic k; means marginal sum, e.g. n kw D d=1 n dkw. 2 SG-MCMC requires parameter spaces unconstrained: reparameterization: β kv = λ kv / v λ kv, with the following prior: K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv = K λ kv Ga(λ kv ; γ, 1) k=1 v=1 V Ga(λ kv ; γ, 1) V (λ kv / λ kv ) n kw v v=1 Changyou Chen (Duke University) SG-MCMC 50 / 56

Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ 1 1 + D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ 1 1 + D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ 1 1 + D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

Latent Dirichlet allocation 1 LDA with the above SGLD update would not work well in practice because of the high dimensionality of model parameters. 2 To make it work, Riemannian geometry information (2nd-order information) need to bring in SGLD: leading to Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) for LDA 11 it considers parameter geometry so that step sizes for each dimension of the parameter are adaptive 11 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS. 2013. Changyou Chen (Duke University) SG-MCMC 52 / 56

Experiments: SGRLD for LDA 12 1 NIPS dataset: the collection of NIPS papers from 1988-2003, with 2483 documents, 50 topics 12 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS. 2013. Changyou Chen (Duke University) SG-MCMC 53 / 56

Experiments: SGRLD for LDA 13 1 Wikipedia dataset: a set of articles downloaded at random from Wikipedia, with 150,000 documents 13 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS. 2013. Changyou Chen (Duke University) SG-MCMC 54 / 56

Conclusion 1 I have introduced: basic concepts in MCMC basic ideas in SG-MCMC, two SG-MCMC algorithms, and application in LDA 2 Topics not covered: a general review of SG-MCMC algorithms theory related to stochastic differential equations and Itó diffusions convergence theory various applications in deep learning, including SG-MCMC for learning weight uncertainty and SG-MCMC for deep generative models interested readers should refer to related references Changyou Chen (Duke University) SG-MCMC 55 / 56

Thank You Changyou Chen (Duke University) SG-MCMC 56 / 56