Simulation - Lectures - Part III Markov chain Monte Carlo

Size: px

Start display at page:

Download "Simulation - Lectures - Part III Markov chain Monte Carlo"

Tobias Brown
5 years ago
Views:

1 Simulation - Lectures - Part III Markov chain Monte Carlo Julien Berestycki Part A Simulation and Statistical Programming Hilary Term 2018 Part A Simulation. HT J. Berestycki. 1 / 50

2 Outline Markov Chain Monte Carlo Recap on Markov chains Markov Chain Monte Carlo Metropolis-Hastings Part A Simulation. HT J. Berestycki. 2 / 50

3 Markov chain Monte Carlo Methods Our aim is to estimate θ = E p (φ(x)) where X is a random variable on Ω with pmf (or pdf) p. Up to this point we have based our estimates on independent and identically distributed draws from either p itself, or some proposal distribution with pmf/pdf q. In MCMC we simulate a correlated sequence X 0, X 1, X 2,... such that X t is approximately distributed from p for t large, and rely on the usual estimate ˆθ n = 1 n 1 φ(x t ). n We will suppose the space of states of X is discrete (finite or countable). But it should be kept in mind that MCMC methods are applicable to continuous state spaces, and in fact one of the most versatile and widespread classes of Monte Carlo algorithms currently. t=0 Part A Simulation. HT J. Berestycki. 3 / 50

4 Outline Markov Chain Monte Carlo Recap on Markov chains Markov Chain Monte Carlo Metropolis-Hastings Part A Simulation. HT J. Berestycki. 4 / 50

5 Markov chains From Part A Probability. Let (X t ) t=0,1,... be a homogeneous Markov chain of random variables on Ω with starting distribution X 0 p (0) and transition matrix P = (P ij ) i,j Ω with for i, j Ω. P i,j = P(X t+1 = j X t = i). We write (X t ) t=0,1,... Markov(p (0), P ) Denote by P (n) i,j and by p (n) (i) = P(X n = i). the n-step transition probabilities P (n) i,j = P(X t+n = j X t = i) Recall that a Markov chain, or equivalently its transition matrix P, is irreducible if and only if, for each pair of states i, j Ω there is n such that P (n) i,j > 0. Part A Simulation. HT J. Berestycki. 5 / 50

6 Markov chains π is a stationary, or invariant distribution of P, if π verifies π j = i Ω π i P ij for all j Ω. If p (0) = π then p (1) (j) = i Ω p (0) (i)p i,j, so p (1) (j) = π(j) also. Iterating, p (t) = π for each t = 1, 2,... in the chain, so the distribution of X t doesn t change with t, it is stationary. If P is irreducible, and has a stationary distribution π, then this stationary distribution π is unique. Part A Simulation. HT J. Berestycki. 6 / 50

7 Ergodic Theorem for Markov chains Theorem (Theorems 6.1, 6.2, 6.3 of Part A Probability) Let (X t ) t=0,1,... Markov(λ, P ) be an irreducible Markov chain on a discrete state space Ω. Assume it admits a stationary distribution π. Then, for any initial distribution λ That is n 1 1 I(X t = i) π(i) almost surely, as n. n t=0 P ( ) n 1 1 I(X t = i) π(i) n t=0 = 1. Additionally, if the chain is aperiodic, then for all i Ω P(X n = i) π(i) Part A Simulation. HT J. Berestycki. 7 / 50

8 Ergodic Theorem for Markov chains Corollary Let (X t ) t=0,1,... Markov(λ, P ) be an irreducible Markov chain on a discrete state space Ω. Assume it admits a stationary distribution π. Let φ : Ω R be a bounded function, X a discrete random variable on Ω with pmf π and θ = E p [φ(x)] = i Ω φ(i)π(i). Then, for any initial distribution λ n 1 1 φ(x t ) θ almost surely, as n. n t=0 That is ( ) n 1 1 P φ(x t ) θ = 1. n t=0 Part A Simulation. HT J. Berestycki. 8 / 50

9 Ergodic Theorem for Markov chains Proof (non-examinable). Assume wlog φ(i) < 1 for all i Ω. Let A Ω and V i(n) = n 1 t=0 I(Xt = i). n 1 1 n 1 φ(x t) θ n = 1 φ(i)i(x t = i) φ(i)π(i) n t=0 t=0 i Ω i Ω ( ) 1 = φ(i) Vi(n) π(i) n i Ω Vi(n) n π(i) + Vi(n) n π(i) i A i / A Vi(n) n π(i) + ( ) Vi(n) n + π(i) i A i / A = Vi(n) n π(i) + ( π(i) Vi(n) ) + 2 π(i) n i A i A i / A 2 Vi(n) n π(i) + 2 π(i) i A i / A where in line 5 we ve used i A V i (n) n = 1 i A V i (n) n = i π(i) i A V i (n) n. Part A Simulation. HT J. Berestycki. 9 / 50

10 Ergodic Theorem for Markov chains Proof (continued). Let ɛ > 0, and take A finite such that i / A π(i) < ɛ/4. For N N, Define the event { ( ) E N = Vi(n) n π(i) < ɛ/4 for all n N}. i A As P( V i(n) π(i)) = 1 for all i Ω and A is finite, the event E n N must occur for some N hence P( E N ) = 1. It follows that, for any ɛ > 0 ( ) n 1 1 P N such that for all n N, φ(x t) θ n < ɛ = 1. t=0 Part A Simulation. HT J. Berestycki. 10 / 50

11 Reversible Markov chains In a reversible Markov chain we cannot distinguish the direction of simulation from inspection of a realization of the chain and its reversal, even with knowledge of the transition matrix. A Markov chain (X t ) t 0 Markov(π, P ) is reversible iff for any n 0. P(X 0, X 1,..., X n ) = P(X n, X n 1,..., X 0 ) We also say that P is reversible with respect to π, or π-reversible Most MCMC algorithms are based on reversible Markov chains. Part A Simulation. HT J. Berestycki. 11 / 50

12 Reversible Markov chains It seems clear that a Markov chain will be reversible if and only if P(X t 1 = j X t = i) = P i,j, so that any particular transition occurs with equal probability in forward and reverse directions. Theorem Let P be a transition matrix. If there is a probability mass function π on Ω such that π and P satisfy the detailed balance condition then π(i)p i,j = π(j)p j,i for all pairs i, j Ω, (I) π = πp, so π is stationary for P and (II) the chain (X t ) Markov(π, P ) is reversible. Part A Simulation. HT J. Berestycki. 12 / 50

13 Reversible Markov chains Proof of (I): sum both sides of detailed balance equation over i Ω. Now i P j,i = 1 so i π(i)p i,j = π(j). Proof of (II), we have π a stationary distribution of P so P(X t = i) = π(i) for all t = 1, 2,... along the chain. Then P(X t 1 = j X t = i) = P(X t = i X t 1 = j) P(X t 1 = j) P(X t = i) = P j,i π(j)/π(i) (stationarity) = P i,j (detailed balance). (Bayes rule) Part A Simulation. HT J. Berestycki. 13 / 50

14 Outline Markov Chain Monte Carlo Recap on Markov chains Markov Chain Monte Carlo Metropolis-Hastings Part A Simulation. HT J. Berestycki. 14 / 50

15 Markov chain Monte Carlo method (discrete case) Let X be a discrete random variable with pmf p on Ω, φ a bounded function on Ω and θ = E p [φ(x)]. Consider a homogeneous Markov chain (X t ) t=0,1,... Markov(λ, P ) with initial distribution λ on Ω and transition matrix P, such that P is irreducible, and admits p as invariant distribution. Then, for any initial distribution λ the MCMC estimator θ MCMC n is (weakly and strongly) consistent θ MCMC n and, if the chain is aperiodic = 1 n 1 φ(x t ) n i=1 θ almost surely as n X t p in distribution as t. Part A Simulation. HT J. Berestycki. 15 / 50

16 Markov chain Monte Carlo method Proof follows directly from the ergodic theorem and corollary Note that the estimator is biased, as λ p (otherwise we would use a standard Monte Carlo estimator) For t large, we have X t d X In order to implement the MCMC algorithm we need, for a given target distribution p, to find an irreducible (and aperiodic) transition matrix P with admits p as invariant distribution Most MCMC algorithms use a transition matrix P which is reversible with respect to p The Metropolis-Hastings algorithm provides a generic way to obtain such P for any target distribution p Part A Simulation. HT J. Berestycki. 16 / 50

17 Outline Markov Chain Monte Carlo Recap on Markov chains Markov Chain Monte Carlo Metropolis-Hastings Part A Simulation. HT J. Berestycki. 17 / 50

18 Metropolis-Hastings Algorithm The Metropolis-Hastings (MH) algorithm allows to simulate a Markov Chain with any given stationary distribution. We will start with simulation of random variable X on a discrete state space. Let p(x) = p(x)/z p be the pmf on Ω. We will call p the (pmf of the) target distribution. To simplify notations, we assume that p(x) > 0 for all x Ω Choose a proposal transition matrix q(y x). We will use the notation Y q( x) to mean Pr(Y = y X = x) = q(y x). Part A Simulation. HT J. Berestycki. 18 / 50

19 Metropolis-Hastings Algorithm Metropolis-Hastings algorithm 1. Set the initial state X 0 = x For t = 1, 2,..., n 1: 2.1 Assume X t 1 = x t Simulate Y t q( x t 1 ) and U t U[0, 1]. 2.3 If U t α(y t x t 1 ) where { α(y x) = min 1, p(y)q(x y) } p(x)q(y x) set X t = Y t, otherwise set X t = x t 1. Part A Simulation. HT J. Berestycki. 19 / 50

20 Metropolis-Hastings Algorithm The Metropolis-Hastings algorithm defines a Markov chain with transition matrix P such that, for x, y Ω P x,y = P(X t = y X t 1 = x) = q(y x)α(y x) + ρ(x)i(y = x) where ρ(x) is the probability of rejection ρ(x) = 1 y Ω q(y x)α(y x). Part A Simulation. HT J. Berestycki. 20 / 50

21 Metropolis-Hastings Algorithm Theorem The transition matrix P of the Markov chain generated by the Metropolis-Hastings algorithm is reversible with respect to p and therefore admits p as stationary distribution. Proof: We check detailed balance. For x y p(x)p x,y = p(x)q(y x)α(y x) { = p(x)q(y x) min 1, p(y)q(x y) } p(x)q(y x) = min {p(x)q(y x), p(y)q(x y)} { } p(x)q(y x) = p(y)q(x y) min p(y)q(x y), 1 = p(y)q(x y)α(x y) = p(y)p y,x. Part A Simulation. HT J. Berestycki. 21 / 50

22 Metropolis-Hastings Algorithm To run the MH algorithm, we need to specify X 0 = x 0 (or X 0 λ) and a proposal q(y x). We only need to know the target p up to a normalizing constant as α depends only on p(y)/p(x) = p(y)/ p(x). If the Markov chain simulated by the MH algorithm is irreducible and aperiodic then the ergodic theorem applies. Verifying aperiodicity is usually straightforward, since the MCMC algorithm may reject the candidate state y, so P x,x > 0 for at least some states x Ω. In order to check irreducibility we need to check that q can take us anywhere in Ω (so q itself is an irreducible transition matrix), and then that the acceptance step doesn t trap the chain (as might happen if α(y x) is zero too often). Part A Simulation. HT J. Berestycki. 22 / 50

23 Example: Discrete Distribution on a finite state-space Consider a discrete random variable X p on Ω = {1, 2,..., m} with p(i) = i so Z p = m i=1 i = m(m+1) 2. One simple proposal distribution is Y q on Ω such that q(i) = 1/m. Acceptance probability { α(y x) = min 1, p(y)q(x y) } { = min 1, y } p(x)q(y x) x This proposal scheme is clearly irreducible P(X t+1 = y X t = x) q(y x)α(y x) = 1 min(1, y/x) > 0 m Part A Simulation. HT J. Berestycki. 23 / 50

24 Example: Discrete Distribution on a finite state-space Start from X 0 = 1. For t = 1,..., n 1 1. Let Y t U{1, 2,..., m} and U t U[0, 1] 2. If U t Y t X t 1 set X t = Y t, otherwise set X t = X t 1. For t large, X t d X Part A Simulation. HT J. Berestycki. 24 / 50

25 Example: Discrete Distribution on a finite state-space Figure: Realization of the MH Markov chain for n = 100 with m = 20. Part A Simulation. HT J. Berestycki. 25 / 50

26 Example: Discrete Distribution on a finite state-space Figure: Average number of visits V i (n)/n (n = 1000) and target pmf p(i) Part A Simulation. HT J. Berestycki. 26 / 50

27 Example: Discrete Distribution on a finite state-space Figure: Average number of visits V i (n)/n (n = 10, 000) and target pmf p(i) Part A Simulation. HT J. Berestycki. 27 / 50

28 Example: Poisson Distribution We want to simulate p(x) = e λ λ x /x! λ x /x! For the proposal we use 1 2 for y = x ± 1,x 1 q(y x) = 1 for x = 0,y = 1 0 otherwise, i.e. toss a coin and add or substract 1 to x to obtain y. Acceptance probability { α(y x) = ( min λ 1, min ( 1, x ) λ x+1 ) if y = x + 1, x 1 if y = x 1, x 2 and α(1 0) = min(1, λ/2), α(0 1) = min(1, 2/λ). Markov chain is irreducible (check!) Part A Simulation. HT J. Berestycki. 28 / 50

29 Example: Poisson Distribution Set X 0 = 1. For t = 1,..., n 1 1. If X t 1 = 0, set Y t = 1 2. Otherwise, simulate V t U[0, 1] 2.1 If V t 1, set Yt = Xt Otherwise set Y t = X t Simulate U t U[0, 1]. 4. If U t α(y t X t 1 ), set X t = Y t, otherwise set X t = X t 1. Part A Simulation. HT J. Berestycki. 29 / 50

30 Example: Poisson distribution Figure: Realization of the MH Markov chain for n = 500 with λ = 20. Part A Simulation. HT J. Berestycki. 30 / 50

31 Example: Poisson distribution Figure: Average number of visits V i (n)/n (n = 1000) and target pmf p(i) Part A Simulation. HT J. Berestycki. 31 / 50

32 Example: Poisson distribution Figure: Average number of visits V i (n)/n (n = 10, 000) and target pmf p(i) Part A Simulation. HT J. Berestycki. 32 / 50

33 Example: Image Consider a m 1 m 2 image, where I(i, j) {0, 1,..., 256} is the gray level of pixel (i, j) Ω = {0,..., m 1 1} {0,..., m 2 1} Consider a discrete random variable taking values in Ω Unnormalized pdf p((i, j)) = I(i, j) Proposal transition probabilities q((y 1, y 2 ) (x 1, x 2 )) = q(y 1 x 1 )q(y 2 x 2 ) with { 1/3 if y1 = x q(y 1 x 1 ) = 1 ± 1 or y 1 = x 1, mod m 1 0 otherwise and similarly for q(y 2 x 2 ). Part A Simulation. HT J. Berestycki. 33 / 50

34 Example: Simulation of an image Average number of visits to each pixel (i, j): V (i,j) (n) = 1 n 1 n t=0 I(X t = (i, j)) Part A Simulation. HT J. Berestycki. 34 / 50

35 Example: Simulation of an image Target pmf Part A Simulation. HT J. Berestycki. 35 / 50

36 Metropolis-Hastings algorithm on R d The Metropolis-Hastings algorithm generalizes to continuous state-space where Ω R d with 1. p is a pdf on Ω 2. q( x) is a pdf on Ω for any x Ω The Metropolis-Hastings algorithm thus defines a Markov chain on Ω R d Precise definition of Markov chains on R d is beyond the scope of this course. We will just state the most important results without proof. Assume for simplicity that p(x) > 0 for all x Ω Part A Simulation. HT J. Berestycki. 36 / 50

37 Metropolis-Hastings algorithm on R d The Markov chain X 0, X 1,... on Ω R d is irreducible if for any x Ω and A Ω, there is n such that P(X n A X 0 = x) > 0 Theorem If the Metropolis-Hastings chain is irreducible, then for any function φ such that E p [ φ(x) ] <, the MH estimator is strongly consistent θ MH n = 1 n 1 φ(x t ) θ n t=0 almost surely as n Part A Simulation. HT J. Berestycki. 37 / 50

38 Example: Gaussian distribution Let X N(µ, σ 2 ) with p(x) = (2πσ 2 ) 1/2 e (x µ)2 2σ 2 MH algorithm with target pdf p and proposal transition pdf { 1 for y [x 1/2, x + 1/2] q(y x) = 0 otherwise Acceptance probability ( α(y x) = min 1, p(y)q(x y) ) ) = min (1, e (y µ)2 2σ p(x)q(y x) 2 + (x µ)2 2σ 2 Part A Simulation. HT J. Berestycki. 38 / 50

39 Example: Gaussian distribution Figure: Realizations from the MH Markov chain (X 0,..., X 1000 ) with X 0 = 0. Part A Simulation. HT J. Berestycki. 39 / 50

40 Example: Gaussian distribution Figure: MH estimates θ t = 1 t the Markov chain. t 1 i=0 X i of θ = E p [X] for different realizations of Part A Simulation. HT J. Berestycki. 40 / 50

41 Example: Mixture of Gaussians distribution Let p(x) = π 1 (2πσ1 2) 1/2 e (x µ 1 ) 2σ π 2 (2πσ2 2) 1/2 e (x µ 2 )2 2σ MH algorithm with target pdf p and proposal transition pdf { 1 for y [x 1/2, x + 1/2] q(y x) = 0 otherwise Acceptance probability α(y x) = min ( 1, p(y)q(x y) ) = min (1, p(y)/p(x)) p(x)q(y x) Part A Simulation. HT J. Berestycki. 41 / 50

42 Example: Mixture of Gaussians distribution Figure: Realization of the Markov chain (X 0,..., X ) with X 0 = 0. Part A Simulation. HT J. Berestycki. 42 / 50

43 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0) and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w) i {1, 2} ) The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Part A Simulation. HT J. Berestycki. 43 / 50

44 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0) and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w) i {1, 2} ) The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Part A Simulation. HT J. Berestycki. 43 / 50

45 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0) and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w) i {1, 2} ) The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Part A Simulation. HT J. Berestycki. 43 / 50

46 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0) and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w) i {1, 2} ) The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Part A Simulation. HT J. Berestycki. 43 / 50

47 Example Consider a Metropolis algorithm for simulating samples from a bivariate Normal distribution with mean µ = (0, 0) and covariance ( Σ = using a proposal distribution for each component x i x i U(x w, x + w) i {1, 2} ) The performance of the algorithm can be controlled by setting w. If w is small then we propose only small moves and the chain will move slowly around the parameter space. If w is large then it is large then we may only accept a few moves. There is an art to implementing MCMC that involves choice of good proposal distributions. Part A Simulation. HT J. Berestycki. 43 / 50

48 Example N = 4 w = 0.1 T = Part A Simulation. HT J. Berestycki. 44 / 50

49 Example N = 4 w = 0.1 T = Part A Simulation. HT J. Berestycki. 45 / 50

50 Example N = 4 w = 0.1 T = Part A Simulation. HT J. Berestycki. 46 / 50

51 Example N = 4 w = 0.01 T = Part A Simulation. HT J. Berestycki. 47 / 50

52 Example N = 4 w = 0.01 T = 1e Part A Simulation. HT J. Berestycki. 48 / 50

53 Example N = 4 w = 10 T = Part A Simulation. HT J. Berestycki. 49 / 50

54 MCMC: Practical aspects The MCMC chain does not start from the invariant distribution, so E[φ(X t )] E[φ(X)] and the difference can be significant for small t X t converges in distribution to p as t Common practice is to discard the b first values of the Markov chain X 0,..., X b 1, where we assume that X b is approximately distributed from p We use the estimator n 1 1 φ(x t ) n b The initial X 0,..., X b 1 is called the burn-in period of the MCMC chain. t=b Part A Simulation. HT J. Berestycki. 50 / 50

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

MSc MT15. Further Statistical Methods: MCMC Lecture 5-6: Markov chains; Metropolis Hastings MCMC Notes and Practicals available at www.stats.ox.ac.uk\ nicholls\mscmcmc15 Markov chain Monte Carlo Methods