Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Size: px
Start display at page:

Download "Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods"

Transcription

1 Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University Duke-Tsinghua Machine Learning Summer School August 10, 2016 Changyou Chen (Duke University) SG-MCMC 1 / 56

2 Preface Stochastic gradient Markov chain Monte Carlo (SG-MCMC): A new technique for approximate Bayesian sampling. It is about scalable Bayesian learning for big data. It draws samples {θ} s from p(θ; D) where p(θ; D) is too expensive to be evaluated in each iteration. This lecture: Will cover: basic ideas behind SG-MCMC. Will not cover: different kinds of SG-MCMC algorithms, applications, and the corresponding convergence theory. Changyou Chen (Duke University) SG-MCMC 2 / 56

3 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 3 / 56

4 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 3 / 56

5 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

6 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

7 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

8 Monte Carlo methods Monte Carlo method is about drawing a set of samples from p(θ): θ l p(θ), l = 1, 2,, L Approximate the target distribution p(θ) as count frequency: p(θ) 1 L L δ(θ, θ l ) l=1 An intractable integration is approximated as: f (θ)p(θ) 1 L f (θ l ) L In Bayesian modeling, p(θ) is usually a posterior distribution, the integral is a predicted quantity. Changyou Chen (Duke University) SG-MCMC 4 / 56 l=1

9 How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

10 How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

11 How does the approximation work? 1 An intractable integration is approximated as: f (θ)p(θ) 1 L L f (θ l ) f l=1 2 If {θ l } s are independent: E f = E [ 1 L ] ( L 1 f (θ l ) = Ef, Var( f ) = Var L ) L f (θ l ) = 1 L Var(f ) l=1 l=1 the variance decreases linearly w.r.t. the number of samples, and independent of the dimension of θ 3 However, obtaining independent samples is hard: usually resort to drawing dependent samples with Markov chain Monte Carlo (MCMC) Changyou Chen (Duke University) SG-MCMC 5 / 56

12 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 6 / 56

13 MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

14 MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

15 MCMC example: a Gaussian model 1 Assume the following generative process (with α = 5, β = 1): x i µ, τ N(µ, 1/τ), i = 1,, n = 1000 µ τ, {x i } N(µ 0, 1/τ), τ Gamma(α, β) 2 Posterior distribution: p(µ, τ {x i }) [ n i=1 N(x i; µ, 1/τ) ] N(µ; µ 0, 1/τ)Gamma(τ; α, β) 3 Marginal posterior distributions for µ and τ are available: ( p(µ {x i }) 2β + (µ µ 0 ) 2 + ) α (n+1)/2 (x i µ) 2 i ( ) p(τ {x i }) = Gamma α + n 2, β + 1 (x i x) 2 n + 2 2(n + 1) ( x µ 0) 2 p(µ {xi }) is a non-standardized Student s t-distribution with mean ( i x i + µ 0 )/(n + 1) Changyou Chen (Duke University) SG-MCMC 7 / 56 i

16 Gibbs sampling µ and τ 1 Conditional distributions: ( n µ τ, {x i } N n + 1 x + 1 ) n + 1 µ 1 0, (n + 1)τ ( τ µ, {x i } Gamma α + n + 1 i 2, β + (x i µ) 2 + (µ µ 0 ) 2 ) 2 Changyou Chen (Duke University) SG-MCMC 8 / 56

17 Trace plot for µ sample trace true mean sample mean µ Iteration Changyou Chen (Duke University) SG-MCMC 9 / 56

18 Sample approximation for µ True posterior is a non-standardized Student s t-distribution. 30 true sample approximation p(7jx) Changyou Chen (Duke University) SG-MCMC 10 / 56

19 Trace plot for τ sample trace true mean sample mean τ Iteration Changyou Chen (Duke University) SG-MCMC 11 / 56

20 Sample approximation for τ True posterior is a Gamma distribution. 2 true sample approximation 1.5 p(= jx) = Changyou Chen (Duke University) SG-MCMC 12 / 56

21 Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

22 Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

23 Markov chain Monte Carlo methods 1 We are interested in drawing samples from some desired distribution p (θ) = 1 Z p (θ). 2 Define a Markov chain: θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 where θ 0 p 0 (θ), θ 1 p 1 (θ),, satisfying p t (θ ) = p t 1 (θ)t (θ θ )d θ, where T (θ θ ) is the Markov chain transition probability from θ to θ. 3 We say p (θ) is an invariant (stationary) distribution of the Markov chain iff: p (θ ) = p (θ)t (θ θ )d θ Changyou Chen (Duke University) SG-MCMC 13 / 56

24 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

25 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

26 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

27 Metroplis-Hasting algorithm 1 Design T (θ θ ) as the composition of a proposal distribution q t (θ θ) and an accept-reject mechanism. 2 At step t, draw a sample 1 θ q t (θ θ t 1 ), and accept it with probability: ( A t (θ p(θ )q t (θ t 1 θ ) ), θ t 1 ) = min 1, p(θ t 1 )q t (θ θ t 1 ) 3 The acceptance can be done by: draw a random variable u Uniform(0, 1) accept the sample if A t (θ, θ t 1 ) > u 4 The corresponding transition kernel satisfies the detailed balance condition, thus has an invariant probability p (θ). 1 A standard setting of qt (θ θ t 1 ) is a normal distribution with mean θ t 1 and tunable variance. Changyou Chen (Duke University) SG-MCMC 14 / 56

28 Discussion on the proposal distribution 1 Standard proposal distribution is an isotropic Gaussian center at the current state with variance σ: small σ leads to high acceptance rate, but moves too slowly large σ moves fast, but leads to high rejection rate 2 How to choose better proposals? < Changyou Chen (Duke University) SG-MCMC 15 / 56

29 Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

30 Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

31 Gibbs sampler 1 Assume θ is multi-dimensional 2, θ = (θ 1,, θ k,, θ K ), denote θ k {θ j : j k}. 2 Sample θ k sequentially, with proposal distribution being the true conditional distribution: q k (θ θ) = p(θ k θ k) 3 Note θ k = θ k, p(θ) = p(θ k θ k )p(θ k ). 4 The MH acceptance probability is: A(θ, θ) = p(θ )q k (θ θ ) p(θ)q k (θ θ) = p(θ k θ k )p(θ k )p(θ k θ k ) p(θ k θ k)p(θ k )p(θ k θ k ) = 1 2 One dimensional random variable is relatively easy to sample. Changyou Chen (Duke University) SG-MCMC 16 / 56

32 Discussion of Gibbs sampler pling 1 No accept-reject step, very efficient. 2 Conditional distributions are not always easy to sample. 3 May not mix well when in high-dimensional space with highly correlated variables. z 2 L tions: value turn or randomly x j6=i ) l z 1 Figure: Sample path does not followfigure gradients. from PRML, Figure Bishop from (2006) PRML, Bishop (2006) Changyou Chen (Duke University) SG-MCMC 17 / 56

33 The Metropolis-adjusted Langevin: a better proposal 1 Gibbs sampling travels the parameter space following a zipzag curve, which might be slow in high-dimensional space. 2 The Metropolis-adjusted Langevin uses a proposal that points directly to the center of the probabilistic contour. Changyou Chen (Duke University) SG-MCMC 18 / 56

34 The Metropolis-adjusted Langevin: a better proposal 1 Let E(θ) log p(θ), the direction of the contour is just the gradient: θ E(θ). 2 In iteration l, define the proposal as a Gaussian centering at θ = θ l 1 θ E(θ l 1 )h l, where h l is a small stepsize: ( q(θ l θ l 1 ) = N θ l ; θ, σ 2). 3 Need to do an accept-reject step: calculate the acceptance probability: A(θ, θ l 1 ) = p(θ )q(θ l 1 θ ) p(θ)q(θ θ l 1 ) accept θ with probability A(θ, θ l 1 ), otherwise set θ l = θ l 1 Changyou Chen (Duke University) SG-MCMC 19 / 56

35 The Metropolis-adjusted Langevin: a better proposal 1 Let E(θ) log p(θ), the direction of the contour is just the gradient: θ E(θ). 2 In iteration l, define the proposal as a Gaussian centering at θ = θ l 1 θ E(θ l 1 )h l, where h l is a small stepsize: ( q(θ l θ l 1 ) = N θ l ; θ, σ 2). 3 Need to do an accept-reject step: calculate the acceptance probability: A(θ, θ l 1 ) = p(θ )q(θ l 1 θ ) p(θ)q(θ θ l 1 ) accept θ with probability A(θ, θ l 1 ), otherwise set θ l = θ l 1 Changyou Chen (Duke University) SG-MCMC 19 / 56

36 Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

37 Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

38 Hamiltonian Monte Carlo Frictionless ball rolling: 1 A dynamic system with total energy or Hamiltonian: H = E(θ) + K (v), where E(θ) log p(θ), K (v) v T v /2. 2 Hamiltonian s equation describes the equations of motion of the ball: d θ dt d v dt = H v = v = H θ = log p(θ) θ Figure: Rolling ball. Movie from Matthias Liepe 3 Joint distribution: p(θ, v) e H(θ,v). Changyou Chen (Duke University) SG-MCMC 20 / 56

39 Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

40 Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

41 Solving Hamiltonian dynamics 1 Solving the continuous-time differential equation with discretized-time approximation: { { d θ = v dt θl = θ = l 1 + v l 1 h l d v = θ log p(θ)dt v l = v l 1 + θ log p(θ l )h l proposals follow historical gradients of the distribution contour 2 Need an accept-reject test to design whether accept the proposal, because of the discretization error: proposal is deterministic acceptance probability: min (1, exp {H(θl, v l ) H(θ l+1, v l+1 )}) 3 Almost identical to SGD with momentum: { θl = θ l 1 + p l 1 = (1 m) p l 1 + θ log p(θ l )ɛ l p l they will be make equivalent in the context of stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 21 / 56

42 Demo: MH vs. HMC 1 Nine mixtures of Gaussians 3. 2 Sequential of samples connected by yellow lines. 3 Demo by T. Broderick and D. Duvenaud. Changyou Chen (Duke University) SG-MCMC 22 / 56

43 Recap 1 Bayesian sampling with traditional MCMC methods, in each iteration: generate a candidate sample from a proposal distribution calculate the acceptance probability accept or reject the proposed sample Changyou Chen (Duke University) SG-MCMC 23 / 56

44 Discussion 1 All the above traditional MCMC methods are not scalable in a big-data setting 4, in each iteration: the whole data need to be used to generate a proposal the whole data need to be used to calculate the acceptance probability scales O(N), where N is the number of data samples 2 Scalable MCMC uses sub-data in each iteration, to calculate the acceptance probability 5 to generate proposals, and ignore the acceptance step stochastic gradient MCMC methods (SG-MCMC) 4 when the number of data samples are large. 5 A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In: ICML. 2014; R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML Changyou Chen (Duke University) SG-MCMC 24 / 56

45 Discussion 1 All the above traditional MCMC methods are not scalable in a big-data setting 4, in each iteration: the whole data need to be used to generate a proposal the whole data need to be used to calculate the acceptance probability scales O(N), where N is the number of data samples 2 Scalable MCMC uses sub-data in each iteration, to calculate the acceptance probability 5 to generate proposals, and ignore the acceptance step stochastic gradient MCMC methods (SG-MCMC) 4 when the number of data samples are large. 5 A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In: ICML. 2014; R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML Changyou Chen (Duke University) SG-MCMC 24 / 56

46 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 25 / 56

47 Two key steps in SG-MCMC 1 Proposals typically follow stochastic gradients of log-posteriors: make samples concentrate on the modes 2 Adding random Gaussian noise to proposals. encourage algorithms to jump out of local modes, and to explore the parameter space the noise in stochastic gradients not sufficient to make the algorithm move around parameter space Figure: Proposals of Gibbs and SG-MCMC. Changyou Chen (Duke University) SG-MCMC 26 / 56

48 Two key steps in SG-MCMC 1 Proposals typically follow stochastic gradients of log-posteriors: make samples concentrate on the modes 2 Adding random Gaussian noise to proposals. encourage algorithms to jump out of local modes, and to explore the parameter space the noise in stochastic gradients not sufficient to make the algorithm move around parameter space Figure: Proposals of Gibbs and SG-MCMC. Changyou Chen (Duke University) SG-MCMC 26 / 56

49 Basic setup 1 Given data X = {x 1,, x N }, a generative model (likelihood) p(x θ) = N i=1 p(x i θ) and prior p(θ), we want to sample from the posterior: N p(θ X) p(θ)p(x θ) = p(θ) p(x i θ) 2 We are interested in the case when N is extremely large, so that computing p(x θ) is prohibitively expensive. 3 Define the following two quantities (unnormalized log-posterior and stochastic unnormalized log-posterior): N U(θ) log p(x i θ) log p(θ) Ũ(θ) N n i=1 i=1 n log p(x πi θ) log p(θ) i=1 where (π 1,, π N ) is a random permutation of (1,, N). Changyou Chen (Duke University) SG-MCMC 27 / 56

50 Basic setup 1 Given data X = {x 1,, x N }, a generative model (likelihood) p(x θ) = N i=1 p(x i θ) and prior p(θ), we want to sample from the posterior: N p(θ X) p(θ)p(x θ) = p(θ) p(x i θ) 2 We are interested in the case when N is extremely large, so that computing p(x θ) is prohibitively expensive. 3 Define the following two quantities (unnormalized log-posterior and stochastic unnormalized log-posterior): N U(θ) log p(x i θ) log p(θ) Ũ(θ) N n i=1 i=1 n log p(x πi θ) log p(θ) i=1 where (π 1,, π N ) is a random permutation of (1,, N). Changyou Chen (Duke University) SG-MCMC 27 / 56

51 Basic setup 1 SG-MCMC relies on the following quantity (stochastic gradient): θ Ũ(θ) N n n θ log p(x πi θ) θ log p(θ), i=1 2 θ Ũ(θ) is an unbiased estimate of θ U(θ): SG-MCMC samples parameters based on θ Ũ(θ) very cheap to compute bringing the name stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 28 / 56

52 Basic setup 1 SG-MCMC relies on the following quantity (stochastic gradient): θ Ũ(θ) N n n θ log p(x πi θ) θ log p(θ), i=1 2 θ Ũ(θ) is an unbiased estimate of θ U(θ): SG-MCMC samples parameters based on θ Ũ(θ) very cheap to compute bringing the name stochastic gradient MCMC Changyou Chen (Duke University) SG-MCMC 28 / 56

53 Comparing with traditional MCMC 1 Ignore the acceptance step: the detailed balance condition typically not hold, and the algorithm is not reversible 6 typically leads to biased, but controllable estimations 2 Use sub-data in each iteration: yielding stochastic gradients does not affect the convergence properties (e.g., convergence rates), compared to using the whole data in each iteration 6 These are sufficient conditions for a valid MCMC method, but not necessary conditions. Changyou Chen (Duke University) SG-MCMC 29 / 56

54 Comparing with traditional MCMC 1 Ignore the acceptance step: the detailed balance condition typically not hold, and the algorithm is not reversible 6 typically leads to biased, but controllable estimations 2 Use sub-data in each iteration: yielding stochastic gradients does not affect the convergence properties (e.g., convergence rates), compared to using the whole data in each iteration 6 These are sufficient conditions for a valid MCMC method, but not necessary conditions. Changyou Chen (Duke University) SG-MCMC 29 / 56

55 Demo: the two key steps 1 Proposals follow stochastic gradients of log-posteriors: stuck in a local mode Changyou Chen (Duke University) SG-MCMC 30 / 56

56 Demo: the two key steps 1 After adding random Gaussian noise: it works!! Changyou Chen (Duke University) SG-MCMC 31 / 56

57 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 32 / 56

58 First attempt 1 A 1st-order method: stochastic gradients directly applied on the model parameter θ. 2 Use a proposal that follows the stochastic gradient of the log-posterior: θ l+1 = θ l h l+1 θ Ũ(θ l ) hl s are the stepsizes, could be fixed ( l, h l = h) or deceasing ( l, h l > h l+1 ) 3 Ignore the acceptance step. 4 Resulting in Stochastic Gradient Descend (SGD). Changyou Chen (Duke University) SG-MCMC 33 / 56

59 Random noise to the rescue 1 Need to make the algorithm explore the parameter space: adding random Gaussian noise to the update 7 θ l+1 = θ l h l+1 θ Ũ(θ l ) + 2h l+1 ζ l+1 ζ l+1 N (0, I) 2 The magnitude of the Gaussian needs to be 2h l+1 in order to guarantee a correct sampler: guaranteed by the Fokker-Planck Equation 3 This is called stochastic gradient Langevin dynamics (SGLD). 7 In the following, we will directly use N (0, I) to represent a normal random variable with zero-mean and covariance matrix I. Changyou Chen (Duke University) SG-MCMC 34 / 56

60 Random noise to the rescue 1 Need to make the algorithm explore the parameter space: adding random Gaussian noise to the update 7 θ l+1 = θ l h l+1 θ Ũ(θ l ) + 2h l+1 ζ l+1 ζ l+1 N (0, I) 2 The magnitude of the Gaussian needs to be 2h l+1 in order to guarantee a correct sampler: guaranteed by the Fokker-Planck Equation 3 This is called stochastic gradient Langevin dynamics (SGLD). 7 In the following, we will directly use N (0, I) to represent a normal random variable with zero-mean and covariance matrix I. Changyou Chen (Duke University) SG-MCMC 34 / 56

61 SGLD in algorithm Input: Parameters {h l } Output: Approximate samples {θ l } Initialize θ 0 R n for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l 1)h l + 2h l N (0, I) end Return {θ l } Algorithm 1: Stochastic Gradient Langevin Dynamics Changyou Chen (Duke University) SG-MCMC 35 / 56

62 Example 8 1 A simple Gaussian mixture: θ 1 N (0, 10), θ 2 N (0, 1) x i 1 2 N (θ 1, 2) N (θ 1 + θ 2, 2), i = 1,, 100 Stochastic Gradient Langevin Dynam Log joint probability per datum of thro Number 4 iterations Figure 3. Averag and accuracy on Figure 1. True and estimated posterior distribution. ber of sweeps th Figure: Left: true posterior; Right: sample-based estimation. represents accur aged over 50 ru deviation. 8 M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In: ICML Changyou Chen (Duke University) SG-MCMC 36 / 56 te

63 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 37 / 56

64 SGHMC 1 A 2nd-order method: stochastic gradients applied on some auxiliary parameters (momentum). 2 SGLD is slow when parameter space exhibits uneven curvatures. 3 Use the momentum idea to improve SGLD: a generalization of the HMC, in that the ball is rolling on a friction surface the ball follows the momentum instead of gradients, which is a summarization of historical gradients, thus could jump out local modes easier and move faster needs a balance between these extra forces momentum friction random force gravity Changyou Chen (Duke University) SG-MCMC 38 / 56

65 Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

66 Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

67 Adding a friction term 1 Without a friction term, the random Gaussian noise would drive the ball too far away from their stationary distribution. 2 After adding a friction term: θ l = θ l 1 + v l 1 h l v l = v l 1 θ Ũ(θ l )h l A v l 1 h l + 2Ah l N (0, I), where A > 0 is a constant 9, controlling the magnitude of the friction. 3 The fraction term penalize the momentum: the more momentum, the more fraction it has, thus slowing down the ball 9 In the original SGHMC paper, A is decomposed into a known variance of injected noise and an unknown variance of stochastic gradients. Changyou Chen (Duke University) SG-MCMC 39 / 56

68 SGHMC in algorithm Input: Parameters A, {h l } Output: Approximate samples {θ l } Initialize θ 0 R n for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end Return {θ l } Algorithm 2: Stochastic Gradient Hamiltonian Monte Carlo Changyou Chen (Duke University) SG-MCMC 40 / 56

69 Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end Reparametrization: ɛ = h 2, m = Ah, p = v h Changyou Chen (Duke University) SG-MCMC 41 / 56

70 Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l + 2mɛl N (0, I) end Reparametrization: ɛ = h 2, m = Ah, p = v h Changyou Chen (Duke University) SG-MCMC 41 / 56

71 Reparametrize SGHMC for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + v l 1 h l v l = v l 1 Ũ(θ l)h l A v l 1 h l + 2Ah l N (0, I) end for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l + 2mɛl N (0, I) end Reparametrization: ɛ = h 2, m = Ah, v = p h ɛ l : learning rate; m: momentum weight Changyou Chen (Duke University) SG-MCMC 42 / 56

72 SGD vs. SGLD θ Ũ(θ l 1 ) N n n θ log p(x πi θ l 1 ) θ log p(θ l 1 ), i=1 SGD: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l)ɛ l end SGLD: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 Ũ(θ l)ɛ l + δ l δ l N (0, 2ɛ l I) end Changyou Chen (Duke University) SG-MCMC 43 / 56

73 SGD with Momentum (SGD-M) vs. SGHMC θ Ũ(θ l 1 ) N n n θ log p(x πi θ l 1 ) θ log p(θ l 1 ), i=1 SGD-M: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l end SGHMC: for l = 1, 2,... do Evaluate θ Ũ(θ l 1 ) from the l-th minibatch θ l = θ l 1 + p l 1 p l = (1 m) p l 1 Ũ(θ l)ɛ l +δ l δ l N (0, 2mɛ l I) end Changyou Chen (Duke University) SG-MCMC 44 / 56

74 = 0.1. For the noisy scenarios, we replace the gradient by rũ( ) = + N (0, 4). We see that noisy Hamiltonian dynamics lead to diverging trajectories when friction is not introduced. 1 Sample from a 2D Gaussian distribution: Resampling r helps control divergence, but the associated HMC U(θ) = 1 2 θt Σ 1 θ stationary distribution is not correct, as illustrated in Fig. 1. Example 10 Average Absolute Error of Sample Covariance SGLD SGHMC Autocorrelation Time y SGLD SGHMC x Figure 3. Contrasting sampling of a bivariate Gaussian with correlation using SGHMC versus SGLD. Here, U( ) = 1 2 T 1 T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In: ICML , Changyou Chen (Duke University) SG-MCMC 45 / 56

75 Recap 1 For SG-MCMC methods, in each iteration: calculate the stochastic gradient based on the current parameter sample generate the next sample by moving the current sample (probably in an extended space) along the direction of the stochastic gradient, plus a suitable random Gaussian noise no need for accept-reject guaranteed to converge close to the true posterior in some sense Changyou Chen (Duke University) SG-MCMC 46 / 56

76 Outline 1 Markov Chain Monte Carlo Methods Monte Carlo methods Markov chain Monte Carlo 2 Stochastic Gradient Markov Chain Monte Carlo Methods Introduction Stochastic gradient Langevin dynamics Stochastic gradient Hamiltonian Monte Carlo Application in Latent Dirichlet allocation Changyou Chen (Duke University) SG-MCMC 47 / 56

77 Latent Dirichlet allocation 1 For each topic k, draw the topic-word distribution: β k Dir(γ) γ β k K 2 For each document d, draw its topic distribution: θ d Dir(α) For each word l, draw its topic indicator: c dl Discrete(θ d ) Draw the observed word: x dl Discrete(β cdl ) x dl c dl θ d α N D Changyou Chen (Duke University) SG-MCMC 48 / 56

78 Latent Dirichlet allocation 1 Let β (β k ) K k=1, θ (θ d) D d=1, C (c dl) D,n d d,l=1, X (x dl) D,n d d,l=1, the posterior distribution [ K ] [ D ] n d p(β, θ, C X) p(β k γ) p(θ d α) p(c dl θ d )p(x dl β, c dl ) k=1 2 From previous lectures: p(c dl θ d ) = p(x dl θ, c dl ) = 3 Together with the fact: K θ K 1 k=1 d=1 l=1 K (θ dk ) 1(c dl =k) k=1 K V β 1(x dl =v)1(c dl =k) kv k=1 v=1 θ α k 1 k d θ k = K k=1 Γ(α k) Γ( K k=1 α k) Changyou Chen (Duke University) SG-MCMC 49 / 56

79 Latent Dirichlet allocation 1 Let β (β k ) K k=1, θ (θ d) D d=1, C (c dl) D,n d d,l=1, X (x dl) D,n d d,l=1, the posterior distribution [ K ] [ D ] n d p(β, θ, C X) p(β k γ) p(θ d α) p(c dl θ d )p(x dl β, c dl ) k=1 2 From previous lectures: p(c dl θ d ) = p(x dl θ, c dl ) = 3 Together with the fact: K θ K 1 k=1 d=1 l=1 K (θ dk ) 1(c dl =k) k=1 K V β 1(x dl =v)1(c dl =k) kv k=1 v=1 θ α k 1 k d θ k = K k=1 Γ(α k) Γ( K k=1 α k) Changyou Chen (Duke University) SG-MCMC 49 / 56

80 Latent Dirichlet allocation 1 Integrate out the local parameters: topic distributions θ for each document, it results in the following semi-collapsed distribution: p(x, C, β α, γ) = D d=1 Γ(K α) Γ(K α + n d ) K k=1 Γ(α + n dk ) Γ(α) K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv, where n dkw n d l=1 1(c dl = k)1(x dl = w) is #word w in doc d with topic k; means marginal sum, e.g. n kw D d=1 n dkw. 2 SG-MCMC requires parameter spaces unconstrained: reparameterization: β kv = λ kv / v λ kv, with the following prior: K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv = K λ kv Ga(λ kv ; γ, 1) k=1 v=1 V Ga(λ kv ; γ, 1) V (λ kv / λ kv ) n kw v v=1 Changyou Chen (Duke University) SG-MCMC 50 / 56

81 Latent Dirichlet allocation 1 Integrate out the local parameters: topic distributions θ for each document, it results in the following semi-collapsed distribution: p(x, C, β α, γ) = D d=1 Γ(K α) Γ(K α + n d ) K k=1 Γ(α + n dk ) Γ(α) K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv, where n dkw n d l=1 1(c dl = k)1(x dl = w) is #word w in doc d with topic k; means marginal sum, e.g. n kw D d=1 n dkw. 2 SG-MCMC requires parameter spaces unconstrained: reparameterization: β kv = λ kv / v λ kv, with the following prior: K k=1 Γ(V γ) Γ(γ) V V v=1 β γ+n kv 1 kv = K λ kv Ga(λ kv ; γ, 1) k=1 v=1 V Ga(λ kv ; γ, 1) V (λ kv / λ kv ) n kw v v=1 Changyou Chen (Duke University) SG-MCMC 50 / 56

82 Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

83 Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

84 Latent Dirichlet allocation 1 Still need to integrate out the local parameter C: p(x, λ α, γ) = E C [p(x, C, β α, γ)] = E C [ D K k=1 Γ(α + n dk ) Γ(α) V v=1 d=1 Γ(K α) Γ(K α + n d ) ( λkv Ga(λ kv ; γ, 1) v λ kv ) n kw ] 2 The stochastic gradient with a minibatch documents D of size D D is: log p(λ α, γ, X) = γ D λ kw λ kw D 3 SGLD update: d D E cd x d,λ,α [ ndkw λ kw λ t+1 kw = log p(λ α, γ, X) λt kw + h t+1 + 2h t+1 N(0, I) λ kw n ] dk λ k Changyou Chen (Duke University) SG-MCMC 51 / 56

85 Latent Dirichlet allocation 1 LDA with the above SGLD update would not work well in practice because of the high dimensionality of model parameters. 2 To make it work, Riemannian geometry information (2nd-order information) need to bring in SGLD: leading to Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) for LDA 11 it considers parameter geometry so that step sizes for each dimension of the parameter are adaptive 11 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS Changyou Chen (Duke University) SG-MCMC 52 / 56

86 Experiments: SGRLD for LDA 12 1 NIPS dataset: the collection of NIPS papers from , with 2483 documents, 50 topics 12 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS Changyou Chen (Duke University) SG-MCMC 53 / 56

87 Experiments: SGRLD for LDA 13 1 Wikipedia dataset: a set of articles downloaded at random from Wikipedia, with 150,000 documents 13 S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In: NIPS Changyou Chen (Duke University) SG-MCMC 54 / 56

88 Conclusion 1 I have introduced: basic concepts in MCMC basic ideas in SG-MCMC, two SG-MCMC algorithms, and application in LDA 2 Topics not covered: a general review of SG-MCMC algorithms theory related to stochastic differential equations and Itó diffusions convergence theory various applications in deep learning, including SG-MCMC for learning weight uncertainty and SG-MCMC for deep generative models interested readers should refer to related references Changyou Chen (Duke University) SG-MCMC 55 / 56

89 Thank You Changyou Chen (Duke University) SG-MCMC 56 / 56

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Bayesian Sampling Using Stochastic Gradient Thermostats

Bayesian Sampling Using Stochastic Gradient Thermostats Bayesian Sampling Using Stochastic Gradient Thermostats Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Distributed Stochastic Gradient MCMC

Distributed Stochastic Gradient MCMC Sungjin Ahn Department of Computer Science, University of California, Irvine Babak Shahbaba Department of Statistics, University of California, Irvine Max Welling Machine Learning Group, University of

More information

Hamiltonian Monte Carlo for Scalable Deep Learning

Hamiltonian Monte Carlo for Scalable Deep Learning Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

More information

Bayesian Sampling Using Stochastic Gradient Thermostats

Bayesian Sampling Using Stochastic Gradient Thermostats Bayesian Sampling Using Stochastic Gradient Thermostats Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models Supplementary Material of High-Order Stochastic Gradient hermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering,

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Riemann Manifold Methods in Bayesian Statistics

Riemann Manifold Methods in Bayesian Statistics Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Anoop Korattikara, Yutian Chen and Max Welling,2 Department of Computer Science, University of California, Irvine 2 Informatics Institute,

More information

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material : Supplementary Material Zhe Gan ZHEGAN@DUKEEDU Changyou Chen CHANGYOUCHEN@DUKEEDU Ricardo Henao RICARDOHENAO@DUKEEDU David Carlson DAVIDCARLSON@DUKEEDU Lawrence Carin LCARIN@DUKEEDU Department of Electrical

More information

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling 1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

More information

1 Geometry of high dimensional probability distributions

1 Geometry of high dimensional probability distributions Hamiltonian Monte Carlo October 20, 2018 Debdeep Pati References: Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 (2011): 2. Betancourt, Michael. A conceptual

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

Reminder of some Markov Chain properties:

Reminder of some Markov Chain properties: Reminder of some Markov Chain properties: 1. a transition from one state to another occurs probabilistically 2. only state that matters is where you currently are (i.e. given present, future is independent

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Manifold Monte Carlo Methods

Manifold Monte Carlo Methods Manifold Monte Carlo Methods Mark Girolami Department of Statistical Science University College London Joint work with Ben Calderhead Research Section Ordinary Meeting The Royal Statistical Society October

More information

19 : Slice Sampling and HMC

19 : Slice Sampling and HMC 10-708: Probabilistic Graphical Models 10-708, Spring 2018 19 : Slice Sampling and HMC Lecturer: Kayhan Batmanghelich Scribes: Boxiang Lyu 1 MCMC (Auxiliary Variables Methods) In inference, we are often

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Monte Carlo Inference Methods

Monte Carlo Inference Methods Monte Carlo Inference Methods Iain Murray University of Edinburgh http://iainmurray.net Monte Carlo and Insomnia Enrico Fermi (1901 1954) took great delight in astonishing his colleagues with his remarkably

More information

Stochastic Gradient Hamiltonian Monte Carlo

Stochastic Gradient Hamiltonian Monte Carlo Tianqi Chen Emily B. Fox Carlos Guestrin MODE Lab, University of Washington, Seattle, WA. TQCHEN@CS.WASHINGTON.EDU EBFOX@STAT.WASHINGTON.EDU GUESTRIN@CS.WASHINGTON.EDU Abstract Hamiltonian Monte Carlo

More information

Notes on pseudo-marginal methods, variational Bayes and ABC

Notes on pseudo-marginal methods, variational Bayes and ABC Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayesian Estimation with Sparse Grids

Bayesian Estimation with Sparse Grids Bayesian Estimation with Sparse Grids Kenneth L. Judd and Thomas M. Mertens Institute on Computational Economics August 7, 27 / 48 Outline Introduction 2 Sparse grids Construction Integration with sparse

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017 Markov Chain Monte Carlo (MCMC) and Model Evaluation August 15, 2017 Frequentist Linking Frequentist and Bayesian Statistics How can we estimate model parameters and what does it imply? Want to find the

More information

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters Exercises Tutorial at ICASSP 216 Learning Nonlinear Dynamical Models Using Particle Filters Andreas Svensson, Johan Dahlin and Thomas B. Schön March 18, 216 Good luck! 1 [Bootstrap particle filter for

More information

17 : Optimization and Monte Carlo Methods

17 : Optimization and Monte Carlo Methods 10-708: Probabilistic Graphical Models Spring 2017 17 : Optimization and Monte Carlo Methods Lecturer: Avinava Dubey Scribes: Neil Spencer, YJ Choe 1 Recap 1.1 Monte Carlo Monte Carlo methods such as rejection

More information

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods Jonas Hallgren 1 1 Department of Mathematics KTH Royal Institute of Technology Stockholm, Sweden BFS 2012 June

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007 Particle Filtering a brief introductory tutorial Frank Wood Gatsby, August 2007 Problem: Target Tracking A ballistic projectile has been launched in our direction and may or may not land near enough to

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Streaming Variational Bayes

Streaming Variational Bayes Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013 Introduction The SDA-Bayes Framework

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013 Eco517 Fall 2013 C. Sims MCMC October 8, 2013 c 2013 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 18 : Advanced topics in MCMC Lecturer: Eric P. Xing Scribes: Jessica Chemali, Seungwhan Moon 1 Gibbs Sampling (Continued from the last lecture)

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet Stat 535 C - Statistical Computing & Monte Carlo Methods Lecture 13-28 February 2006 Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Limitations of Gibbs sampling. Metropolis-Hastings algorithm. Proof

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

MCMC and Gibbs Sampling. Sargur Srihari

MCMC and Gibbs Sampling. Sargur Srihari MCMC and Gibbs Sampling Sargur srihari@cedar.buffalo.edu 1 Topics 1. Markov Chain Monte Carlo 2. Markov Chains 3. Gibbs Sampling 4. Basic Metropolis Algorithm 5. Metropolis-Hastings Algorithm 6. Slice

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Introduction to Markov Chain Monte Carlo & Gibbs Sampling Introduction to Markov Chain Monte Carlo & Gibbs Sampling Prof. Nicholas Zabaras Sibley School of Mechanical and Aerospace Engineering 101 Frank H. T. Rhodes Hall Ithaca, NY 14853-3801 Email: zabaras@cornell.edu

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov chain Monte Carlo methods in atmospheric remote sensing 1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois GraphLab, Inc. Anoop Korattikara Dept. of Computer Science UC Irvine Max Welling Informatics Institute University of Amsterdam

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Markov chain Monte Carlo

Markov chain Monte Carlo 1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

CS281A/Stat241A Lecture 22

CS281A/Stat241A Lecture 22 CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007 Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9 Metropolis Hastings Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 9 1 The Metropolis-Hastings algorithm is a general term for a family of Markov chain simulation methods

More information