Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Size: px

Start display at page:

Download "Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods"

Alison Gardner
6 years ago
Views:

1 Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University Duke-Tsinghua Machine Learning Summer School August 10, 2016 Changyou Chen (Duke University) SG-MCMC 1 / 56

2 Preface Stochastic gradient Markov chain Monte Carlo (SG-MCMC): A new technique for approximate Bayesian sampling. It is about scalable Bayesian learning for big data. It draws samples {θ} s from p(θ; D) where p(θ; D) is too expensive to be evaluated in each iteration. This lecture: Will cover: basic ideas behind SG-MCMC. Will not cover: different kinds of SG-MCMC algorithms, applications, and the corresponding convergence theory. Changyou Chen (Duke University) SG-MCMC 2 / 56

3 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 3 / 56

4 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 3 / 56

5 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

6 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

7 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

8 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

9 How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

10 How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

11 How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

12 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 6 / 56

13 MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

14 MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

15 MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

16 Gibbs sampling µ and τ 1 Conditional distributions: ( n µ τ, {x i } N n + 1 x + 1 ) n + 1 µ 1 0, (n + 1)τ ( τ µ, {x i } Gamma α + n + 1 i 2, β + (x i µ) 2 + (µ µ 0 ) 2 ) 2 Changyou Chen (Duke University) SG-MCMC 8 / 56

17 Trace plot for µ sample trace true mean sample mean µ Iteration Changyou Chen (Duke University) SG-MCMC 9 / 56

18 Sample approximation for µ True posterior is a non-standardized Student s t-distribution. 30 true sample approximation p(7jx) Changyou Chen (Duke University) SG-MCMC 10 / 56

19 Trace plot for τ sample trace true mean sample mean τ Iteration Changyou Chen (Duke University) SG-MCMC 11 / 56

20 Sample approximation for τ True posterior is a Gamma distribution. 2 true sample approximation 1.5 p(= jx) = Changyou Chen (Duke University) SG-MCMC 12 / 56

21 Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

22 Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

23 Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

24 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

25 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

26 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

27 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

28 Discussion on the proposal distribution 1 Standard proposal distribution is an isotropic Gaussian center at the current state with variance σ: small σ leads to high acceptance rate, but moves too slowly large σ moves fast, but leads to high rejection rate 2 How to choose better proposals? < Changyou Chen (Duke University) SG-MCMC 15 / 56

29 Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

30 Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

31 Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

32 Discussion of Gibbs sampler pling 1 No accept-reject step, very efficient. 2 Conditional distributions are not always easy to sample. 3 May not mix well when in high-dimensional space with highly correlated variables. z 2 L tions: value turn or randomly x j6=i ) l z 1 Figure: Sample path does not followfigure gradients. from PRML, Figure Bishop from (2006) PRML, Bishop (2006) Changyou Chen (Duke University) SG-MCMC 17 / 56

33 The Metropolis-adjusted Langevin: a better proposal 1 Gibbs sampling travels the parameter space following a zipzag curve, which might be slow in high-dimensional space. 2 The Metropolis-adjusted Langevin uses a proposal that points directly to the center of the probabilistic contour. Changyou Chen (Duke University) SG-MCMC 18 / 56

34 The Metropolis-adjusted Langevin: a better proposal 1 Let E(θ) log p(θ), the direction of the contour is just the gradient: θ E(θ). 2 In iteration l, define the proposal as a Gaussian centering at θ = θ l 1 θ E(θ l 1 )h l, where h l is a small stepsize: ( q(θ l θ l 1 ) = N θ l ; θ, σ 2). 3 Need to do an accept-reject step: calculate the acceptance probability: A(θ, θ l 1 ) = p(θ )q(θ l 1 θ ) p(θ)q(θ θ l 1 ) accept θ with probability A(θ, θ l 1 ), otherwise set θ l = θ l 1 Changyou Chen (Duke University) SG-MCMC 19 / 56

35 The Metropolis-adjusted Langevin: a better proposal 1 Let E(θ) log p(θ), the direction of the contour is just the gradient: θ E(θ). 2 In iteration l, define the proposal as a Gaussian centering at θ = θ l 1 θ E(θ l 1 )h l, where h l is a small stepsize: ( q(θ l θ l 1 ) = N θ l ; θ, σ 2). 3 Need to do an accept-reject step: calculate the acceptance probability: A(θ, θ l 1 ) = p(θ )q(θ l 1 θ ) p(θ)q(θ θ l 1 ) accept θ with probability A(θ, θ l 1 ), otherwise set θ l = θ l 1 Changyou Chen (Duke University) SG-MCMC 19 / 56

36 Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

37 Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

38 Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

39 Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

40 Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

41 Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

42 Demo: MH vs. HMC 1 Nine mixtures of Gaussians 3. 2 Sequential of samples connected by yellow lines. 3 Demo by T. Broderick and D. Duvenaud. Changyou Chen (Duke University) SG-MCMC 22 / 56

43 Recap 1 Bayesian sampling with traditional MCMC methods, in each iteration: generate a candidate sample from a proposal distribution calculate the acceptance probability accept or reject the proposed sample Changyou Chen (Duke University) SG-MCMC 23 / 56

44 Discussion 1 All the above traditional MCMC methods are not scalable in a big-data setting 4, in each iteration: the whole data need to be used to generate a proposal the whole data need to be used to calculate the acceptance probability scales O(N), where N is the number of data samples 2 Scalable MCMC uses sub-data in each iteration, to calculate the acceptance probability 5 to generate proposals, and ignore the acceptance step stochastic gradient MCMC methods (SG-MCMC) 4 when the number of data samples are large. 5 A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In: ICML. 2014; R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML Changyou Chen (Duke University) SG-MCMC 24 / 56

45 Discussion 1 All the above traditional MCMC methods are not scalable in a big-data setting 4, in each iteration: the whole data need to be used to generate a proposal the whole data need to be used to calculate the acceptance probability scales O(N), where N is the number of data samples 2 Scalable MCMC uses sub-data in each iteration, to calculate the acceptance probability 5 to generate proposals, and ignore the acceptance step stochastic gradient MCMC methods (SG-MCMC) 4 when the number of data samples are large. 5 A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In: ICML. 2014; R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML Changyou Chen (Duke University) SG-MCMC 24 / 56

46 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 25 / 56

47 Two key steps in SG-MCMC 1 Proposals typically follow stochastic gradients of log-posteriors: make samples concentrate on the modes 2 Adding random Gaussian noise to proposals. encourage algorithms to jump out of local modes, and to explore the parameter space the noise in stochastic gradients not sufficient to make the algorithm move around parameter space Figure: Proposals of Gibbs and SG-MCMC. Changyou Chen (Duke University) SG-MCMC 26 / 56

48 Two key steps in SG-MCMC 1 Proposals typically follow stochastic gradients of log-posteriors: make samples concentrate on the modes 2 Adding random Gaussian noise to proposals. encourage algorithms to jump out of local modes, and to explore the parameter space the noise in stochastic gradients not sufficient to make the algorithm move around parameter space Figure: Proposals of Gibbs and SG-MCMC. Changyou Chen (Duke University) SG-MCMC 26 / 56

49 Basic setup 1 Given data X = {x 1,, x N }, a generative model (likelihood) p(x θ) = N i=1 p(x i θ) and prior p(θ), we want to sample from the posterior: N p(θ X) p(θ)p(x θ) = p(θ) p(x i θ) 2 We are interested in the case when N is extremely large, so that computing p(x θ) is prohibitively expensive. 3 Define the following two quantities (unnormalized log-posterior and stochastic unnormalized log-posterior): N U(θ) log p(x i θ) log p(θ) Ũ(θ) N n i=1 i=1 n log p(x πi θ) log p(θ) i=1 where (π 1,, π N ) is a random permutation of (1,, N). Changyou Chen (Duke University) SG-MCMC 27 / 56

50 Basic setup 1 Given data X = {x 1,, x N }, a generative model (likelihood) p(x θ) = N i=1 p(x i θ) and prior p(θ), we want to sample from the posterior: N p(θ X) p(θ)p(x θ) = p(θ) p(x i θ) 2 We are interested in the case when N is extremely large, so that computing p(x θ) is prohibitively expensive. 3 Define the following two quantities (unnormalized log-posterior and stochastic unnormalized log-posterior): N U(θ) log p(x i θ) log p(θ) Ũ(θ) N n i=1 i=1 n log p(x πi θ) log p(θ) i=1 where (π 1,, π N ) is a random permutation of (1,, N). Changyou Chen (Duke University) SG-MCMC 27 / 56

51 Basic setup 1 SG-MCMC relies on the following quantity (stochastic gradient): θ Ũ(θ) N n n θ log p(x πi θ) θ log p(θ), i=1 2 θ Ũ(θ) is an unbiased estimate of θ U(θ): SG-MCMC samples parameters based on θ Ũ(θ) very cheap to compute bringing the name stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 28 / 56

52 Basic setup 1 SG-MCMC relies on the following quantity (stochastic gradient): θ Ũ(θ) N n n θ log p(x πi θ) θ log p(θ), i=1 2 θ Ũ(θ) is an unbiased estimate of θ U(θ): SG-MCMC samples parameters based on θ Ũ(θ) very cheap to compute bringing the name stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 28 / 56

53 Comparing with traditional MCMC 1 Ignore the acceptance step: the detailed balance condition typically not hold, and the algorithm is not reversible 6 typically leads to biased, but controllable estimations 2 Use sub-data in each iteration: yielding stochastic gradients does not affect the convergence properties (e.g., convergence rates), compared to using the whole data in each iteration 6 These are sufficient conditions for a valid MCMC method, but not necessary conditions. Changyou Chen (Duke University) SG-MCMC 29 / 56

54 Comparing with traditional MCMC 1 Ignore the acceptance step: the detailed balance condition typically not hold, and the algorithm is not reversible 6 typically leads to biased, but controllable estimations 2 Use sub-data in each iteration: yielding stochastic gradients does not affect the convergence properties (e.g., convergence rates), compared to using the whole data in each iteration 6 These are sufficient conditions for a valid MCMC method, but not necessary conditions. Changyou Chen (Duke University) SG-MCMC 29 / 56

55 Demo: the two key steps 1 Proposals follow stochastic gradients of log-posteriors: stuck in a local mode Changyou Chen (Duke University) SG-MCMC 30 / 56

56 Demo: the two key steps 1 After adding random Gaussian noise: it works!! Changyou Chen (Duke University) SG-MCMC 31 / 56

57 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 32 / 56

58 First attempt 1 A 1st-order method: stochastic gradients directly applied on the model parameter θ. 2 Use a proposal that follows the stochastic gradient of the log-posterior: θ l+1 = θ l h l+1 θ Ũ(θ l ) hl s are the stepsizes, could be fixed ( l, h l = h) or deceasing ( l, h l > h l+1 ) 3 Ignore the acceptance step. 4 Resulting in Stochastic Gradient Descend (SGD). Changyou Chen (Duke University) SG-MCMC 33 / 56

59 Random noise to the rescue 1 Need to make the algorithm explore the parameter space: adding random Gaussian noise to the update 7 θ l+1 = θ l h l+1 θ Ũ(θ l ) + 2h l+1 ζ l+1 ζ l+1 N (0, I) 2 The magnitude of the Gaussian needs to be 2h l+1 in order to guarantee a correct sampler: guaranteed by the Fokker-Planck Equation 3 This is called stochastic gradient Langevin dynamics (SGLD). 7 In the following, we will directly use N (0, I) to represent a normal random variable with zero-mean and covariance matrix I. Changyou Chen (Duke University) SG-MCMC 34 / 56

60 Random noise to the rescue 1 Need to make the algorithm explore the parameter space: adding random Gaussian noise to the update 7 θ l+1 = θ l h l+1 θ Ũ(θ l ) + 2h l+1 ζ l+1 ζ l+1 N (0, I) 2 The magnitude of the Gaussian needs to be 2h l+1 in order to guarantee a correct sampler: guaranteed by the Fokker-Planck Equation 3 This is called stochastic gradient Langevin dynamics (SGLD). 7 In the following, we will directly use N (0, I) to represent a normal random variable with zero-mean and covariance matrix I. Changyou Chen (Duke University) SG-MCMC 34 / 56

61 SGLD in algorithm Input: Parameters {h l } Output: Approximate samples {θ l } Initialize θ 0 R n for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l 1)h l + 2h l N (0, I) end Return {θ l } Algorithm 1: Stochastic Gradient Langevin Dynamics Changyou Chen (Duke University) SG-MCMC 35 / 56

62 Example 8 1 A simple Gaussian mixture: θ 1 N (0, 10), θ 2 N (0, 1) x i 1 2 N (θ 1, 2) N (θ 1 + θ 2, 2), i = 1,, 100 Stochastic Gradient Langevin Dynam Log joint probability per datum of thro Number 4 iterations Figure 3. Averag and accuracy on Figure 1. True and estimated posterior distribution. ber of sweeps th Figure: Left: true posterior; Right: sample-based estimation. represents accur aged over 50 ru deviation. 8 M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In: ICML Changyou Chen (Duke University) SG-MCMC 36 / 56 te

63 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 37 / 56

64 SGHMC 1 A 2nd-order method: stochastic gradients applied on some auxiliary parameters (momentum). 2 SGLD is slow when parameter space exhibits uneven curvatures. 3 Use the momentum idea to improve SGLD: a generalization of the HMC, in that the ball is rolling on a friction surface the ball follows the momentum instead of gradients, which is a summarization of historical gradients, thus could jump out local modes easier and move faster needs a balance between these extra forces momentum friction random force gravity Changyou Chen (Duke University) SG-MCMC 38 / 56

65 Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

66 Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

67 Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

68 SGHMC in algorithm Input: Parameters A, {h l } Output: Approximate samples {θ l } Initialize θ 0 R n for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end Return {θ l } Algorithm 2: Stochastic Gradient Hamiltonian Monte Carlo Changyou Chen (Duke University) SG-MCMC 40 / 56

69 Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end Reparametrization: ɛ = h 2, m = Ah, p = v h Changyou Chen (Duke University) SG-MCMC 41 / 56

70 Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l + 2mɛl N (0, I) end Reparametrization: ɛ = h 2, m = Ah, p = v h Changyou Chen (Duke University) SG-MCMC 41 / 56

71 Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l + 2mɛl N (0, I) end Reparametrization: ɛ = h 2, m = Ah, v = p h ɛ l : learning rate; m: momentum weight Changyou Chen (Duke University) SG-MCMC 42 / 56

72 SGD vs. SGLD θ Ũ(θ l 1 ) N n n θ log p(x πi θ l 1 ) θ log p(θ l 1 ), i=1 SGD: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l)ɛ l end SGLD: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l)ɛ l + δ l δ l N (0, 2ɛ l I) end Changyou Chen (Duke University) SG-MCMC 43 / 56

73 SGD with Momentum (SGD-M) vs. SGHMC θ Ũ(θ l 1 ) N n n θ log p(x πi θ l 1 ) θ log p(θ l 1 ), i=1 SGD-M: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l end SGHMC: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l +δ l δ l N (0, 2mɛ l I) end Changyou Chen (Duke University) SG-MCMC 44 / 56

74 = 0.1. For the noisy scenarios, we replace the gradient by rũ( ) = + N (0, 4). We see that noisy Hamiltonian dynamics lead to diverging trajectories when friction is not introduced. 1 Sample from a 2D Gaussian distribution: Resampling r helps control divergence, but the associated HMC U(θ) = 1 2 θt Σ 1 θ stationary distribution is not correct, as illustrated in Fig. 1. Example 10 Average Absolute Error of Sample Covariance SGLD SGHMC Autocorrelation Time y SGLD SGHMC x Figure 3. Contrasting sampling of a bivariate Gaussian with correlation using SGHMC versus SGLD. Here, U( ) = 1 2 T 1 T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In: ICML , Changyou Chen (Duke University) SG-MCMC 45 / 56

75 Recap 1 For SG-MCMC methods, in each iteration: calculate the stochastic gradient based on the current parameter sample generate the next sample by moving the current sample (probably in an extended space) along the direction of the stochastic gradient, plus a suitable random Gaussian noise no need for accept-reject guaranteed to converge close to the true posterior in some sense Changyou Chen (Duke University) SG-MCMC 46 / 56

76 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 47 / 56

77 Latent Dirichlet allocation 1 For each topic k, draw the topic-word distribution: β k Dir(γ) γ β k K 2 For each document d, draw its topic distribution: θ d Dir(α) For each word l, draw its topic indicator: c dl Discrete(θ d ) Draw the observed word: x dl Discrete(β cdl ) x dl c dl θ d α N D Changyou Chen (Duke University) SG-MCMC 48 / 56

78 Latent Dirichlet allocation 1 Let β (β k ) K k=1, θ (θ d) D d=1, C (c dl) D,n d d,l=1, X (x dl) D,n d d,l=1, the posterior distribution [ K ] [ D ] n d p(β, θ, C X) p(β k γ) p(θ d α) p(c dl θ d )p(x dl β, c dl ) k=1 2 From previous lectures: p(c dl θ d ) = p(x dl θ, c dl ) = 3 Together with the fact: K θ K 1 k=1 d=1 l=1 K (θ dk ) 1(c dl =k) k=1 K V β 1(x dl =v)1(c dl =k) kv k=1 v=1 θ α k 1 k d θ k = K k=1 Γ(α k) Γ( K k=1 α k) Changyou Chen (Duke University) SG-MCMC 49 / 56

79 Latent Dirichlet allocation 1 Let β (β k ) K k=1, θ (θ d) D d=1, C (c dl) D,n d d,l=1, X (x dl) D,n d d,l=1, the posterior distribution [ K ] [ D ] n d p(β, θ, C X) p(β k γ) p(θ d α) p(c dl θ d )p(x dl β, c dl ) k=1 2 From previous lectures: p(c dl θ d ) = p(x dl θ, c dl ) = 3 Together with the fact: K θ K 1 k=1 d=1 l=1 K (θ dk ) 1(c dl =k) k=1 K V β 1(x dl =v)1(c dl =k) kv k=1 v=1 θ α k 1 k d θ k = K k=1 Γ(α k) Γ( K k=1 α k) Changyou Chen (Duke University) SG-MCMC 49 / 56

80 Latent Dirichlet allocation 1 Integrate out the local parameters: topic distributions θ for each document, it results in the following semi-collapsed distribution: p(x, C, β α, γ) = D d=1 Γ(K α) Γ(K α + n d ) K k=1 Γ(α + n dk ) Γ(α) K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv, where n dkw n d l=1 1(c dl = k)1(x dl = w) is #word w in doc d with topic k; means marginal sum, e.g. n kw D d=1 n dkw. 2 SG-MCMC requires parameter spaces unconstrained: reparameterization: β kv = λ kv / v λ kv, with the following prior: K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv = K λ kv Ga(λ kv ; γ, 1) k=1 v=1 V Ga(λ kv ; γ, 1) V (λ kv / λ kv ) n kw v v=1 Changyou Chen (Duke University) SG-MCMC 50 / 56

81 Latent Dirichlet allocation 1 Integrate out the local parameters: topic distributions θ for each document, it results in the following semi-collapsed distribution: p(x, C, β α, γ) = D d=1 Γ(K α) Γ(K α + n d ) K k=1 Γ(α + n dk ) Γ(α) K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv, where n dkw n d l=1 1(c dl = k)1(x dl = w) is #word w in doc d with topic k; means marginal sum, e.g. n kw D d=1 n dkw. 2 SG-MCMC requires parameter spaces unconstrained: reparameterization: β kv = λ kv / v λ kv, with the following prior: K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv = K λ kv Ga(λ kv ; γ, 1) k=1 v=1 V Ga(λ kv ; γ, 1) V (λ kv / λ kv ) n kw v v=1 Changyou Chen (Duke University) SG-MCMC 50 / 56

82 Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

83 Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

84 Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

85 Latent Dirichlet allocation 1 LDA with the above SGLD update would not work well in practice because of the high dimensionality of model parameters. 2 To make it work, Riemannian geometry information (2nd-order information) need to bring in SGLD: leading to Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) for LDA 11 it considers parameter geometry so that step sizes for each dimension of the parameter are adaptive 11 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS Changyou Chen (Duke University) SG-MCMC 52 / 56

86 Experiments: SGRLD for LDA 12 1 NIPS dataset: the collection of NIPS papers from , with 2483 documents, 50 topics 12 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS Changyou Chen (Duke University) SG-MCMC 53 / 56

87 Experiments: SGRLD for LDA 13 1 Wikipedia dataset: a set of articles downloaded at random from Wikipedia, with 150,000 documents 13 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS Changyou Chen (Duke University) SG-MCMC 54 / 56

88 Conclusion 1 I have introduced: basic concepts in MCMC basic ideas in SG-MCMC, two SG-MCMC algorithms, and application in LDA 2 Topics not covered: a general review of SG-MCMC algorithms theory related to stochastic differential equations and Itó diffusions convergence theory various applications in deep learning, including SG-MCMC for learning weight uncertainty and SG-MCMC for deep generative models interested readers should refer to related references Changyou Chen (Duke University) SG-MCMC 55 / 56

89 Thank You Changyou Chen (Duke University) SG-MCMC 56 / 56

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational