Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey

2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)] = f(z)p(z)dz Examples Bayesian prediction: p(z new z, D) = p(z new θ)p(θ D)dθ = E p(θ D) [p(z new θ)] Difficult variational updates: log q(z 1 ) E p(z2 )[log p(z 1, z 2 )] Difficult E-step in EM: Q(θ, θ old ) = E p(z D,θold )[log p(z, D θ)]

3 / 27 Approximating the integral by samples E p(z) [f(z)] = f(z)p(z)dz L f(z (l) ) l=1 where z (l) are samples drawn from p(z (l) ). As long as iid samples are drawn from the true p(z (l) ), 20 samples are sufficient for a good approximation.

Sampling from inverse CDF 1 Draw u Uniform(0, 1) Calculate y = h 1 (u) Because: P r(h 1 (u) y) = P r(u h(y)) = h(y) Problem: How do we compute h 1 (u) for an arbitrary distribution? 1 Bishop, PRML, 2006 4 / 27

2 5 / 27 Rejection Sampling 2 Target distribution p(z), and envelop distribution q(z) Procedure: z (t) q(z) u (t) Uniform(0, kq(z (t) )) Accept sample if u (t) p(z) p(accept) = p(z) kq(z) q(z)dz = 1 p(z)dz k

Adaptive Rejection Sampling 3 Envelope function is a set of piecewise exponential functions: q(z) = k i λ i exp{ λ i (z z i 1 )} z i 1 z z i Each rejected sample is added as a grid point. Acceptance rate decays exponentially wrt dimensionality! 3 Bishop, PRML, 2006 6 / 27

7 / 27 Importance Sampling (1) E p(z) [f(z)] = = f(z)p(z)dz f(z) p(z) q(z) q(z)dz Draw l samples from q(z). Then, E p(z) [f(z)] 1 L L l=1 f(z (l) p(z (l) ) ) q(z (l) ) }{{} importance weight

8 / 27 Importance Sampling (2) (+) All samples are retained. (-) Too much dependent on how similar q(z) is to p(z). (-) No diagnostic measures available!

9 / 27 Markov Chain Monte Carlo Robust to high dimensionalities Samples form a Markov chain with a transition function T (z z ) Samples are drawn from the target distribution p(z) if, p(z) is invariant wrt T (z z ), p(z) = p(z )T (z z )dz. the Markov chain governed by T (z z ) is ergodic. Invariance : Ensured by detailed balance: p(z)t (z z) = p(z )T (z z ) Ergodicity : More tricky. Imposed by sampling algorithms.

10 / 27 Metropolis-Hastings Procedure: Propose the next state by Q(z ( z), e.g. N (z, σ 2 Accept with probability min 1, p(z )Q(z z ) ) p(z)q(z z) Stay at the current state (add another copy of it to the samples list) otherwise The proposal variance σ 2 is very influential. Determines step size If large, low acceptance rate If small, slow convergence

11 / 27 Metropolis-Hastings (2) Detailed balance is provided: ( p(z)t (z z) = p(z)q(z z) min 1, p(z )Q(z z ) ) p(z)q(z z) = min ( p(z)q(z z), p(z )Q(z z ) ) ( p(z)q(z = p(z )Q(z z ) z) ) min p(z )Q(z z ), 1 = p(z )T (z z )

Metropolis-Hastings (3) 4 1-D Demo: 4 Murray,MLSS,2009 12 / 27

13 / 27 Gibbs Sampling Procedure: Initialize z (1) 1, z(1) 2, z(1) 3 For l = 1 to L 1 z (l+1) 1 p(z 1 z (l) 2, z(l) 3 ) z (l+1) 2 p(z 2 z (l+1) 1, z (l) 3 ) z (l+1) 3 p(z 3 z (l+1) 1, z (l+1) 2 )

14 / 27 Gibbs Sampling (2) Invariance: All conditioned variates are constant by definition, and the remaining variable is sampled from the true distribution. Ergodicity: Guaranteed if all conditional probabilities are non-zero in their entire domain. Gibbs sampling is a special case of Metropolis-Hastings with q k (z z) = p(z k z \k ), thus A(z z) = p(z k z \k)p(z \k)p(z k z \k) p(z k z \k )p(z \k )p(z k z \k ) = 1 Hence, all samples are accepted.

Gibbs Sampling (3) 5 Step size is governed by covariances of conditional distributions. Iterative conditional modes: Instead of sampling, update wrt a point estimate (e.g. mean, mode). 5 Bishop, PRML, 2006 15 / 27

16 / 27 Collapsed Gibbs Sampling Integrating out some of the variables may yield others to appear conditionally-independent, which entails faster convergence. Rao-Blackwell Theorem: Let z and θ be dependent variables, and f(z, θ) be some scalar function. Then, var z,θ [f(z, θ)] var z [E θ [f(z, θ) z]].

Example: Gaussian Mixture Model 6 Employ conjugate priors to: cluster means cluster covariances mixture probabilities Then integrate them out! 6 Murphy, Mach. Learn., 2012 17 / 27

18 / 27 Implementation tricks Thinning : Take every Kth sample to decorrelate Burn-in : Discard first (e.g. half) of the samples which were prior to mixing Multiple runs : To neutralize the effect of initialization

19 / 27 Diagnosing Convergence 1: Traceplots

20 / 27 Diagnosing Convergence 2: Running mean plots

21 / 27 Diagnosing Conv. 3: Rubin-Gelman Metric Calculate within-chain variance W and between-chain variance B Calculate estimated variance ˆ V ar(θ) = (1 1/n)W + (1/n)B Calculate and monitor Potential Scale Reduction Factor (PSRF) Vˆar(θ) ˆR = W ˆR should get smaller until convergence.

22 / 27 Diagnosing Convergence 4: Other metrics Geweke diagnostic: Take first x and last y samples in the chain and test if they come from the same distribution. Raftery and Lewis diagnostic: Calculate nr of iterations until a desired level of accuracy is reached for a posterior quantile. Heidelberg and Welch diagnostic: Repeated significance testing (stationary vs null)

23 / 27 Example: Bayesian logistic regression p(f i w, x i ) = N (f i w T x i, σ 2 ), i = 1,, N 1 p(y i f i ) = 1 + e f, iy i i = 1,, N p(w d α d ) = N (w d 0, α 1 d ), d = 1,, D p(α d ) = G(α d a, b), d = 1,, D

24 / 27 Let s aim for a Gibbs samples We require the following conditional distributions: p(w f, α, X, y), (1) p(α w, f, X, y), (2) p(f w, α, X, y) (3)

The log joint N N log p(w, f, α, X, y) = log p(f i w, x i ) + log p(y i f i ) i=1 i=1 D D + log p(w i α i ) + log p(α i ) d=1 d=1 = 1 2 log σ2 I 1 2σ 2 (f T w T X T )(f Xw) N log(1 + e y if i ) + 1 D log α d 1 2 2 wt Aw + i=1 D (a 1) log α d d=1 d=1 d=1 D bα d + const where A dd = α d and A ij = 0, i j 25 / 27

26 / 27 The conditionals p(α d α d, w, f, X, y) = G(α d a + 1 2, b + 1 2 w2 d ) ( ( p(w f, α, X, y) = N w X T X + A ) 1 X T f, ( X T X + A ) ) 1 p(f w, α, X, y) = Metropolis with q(f i ) = N (f i w T x i, σ 2 )

27 / 27 Useful references Robert and Casella, Monte Carlo Statistical Methods, 2004 Bishop, Pattern Recognition & Mach. Learning, 2006, Ch. 11 Murphy, Machine Learning: A Probabilistic Perspective, 2012 Gelman et al., Bayesian Data Analysis, 2013 Murray, Markov Chain Monte Carlo, MLSS, 2009