Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Size: px

Start display at page:

Download "Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC"

Louise Hawkins
5 years ago
Views:

1 Stat 451 Lecture Notes Markov Chain Monte Carlo Ryan Martin UIC 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters in Lange 2 Updated: April 4, / 42

2 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 2 / 42

3 Motivation We know how to sample independent random variables from the target distribution f (x), at least approximately. Monte Carlo uses these simulated random variables to approximate integrals. But the random variables don t need to be independent in order to accurately approximate integrals! Markov Chain Monte Carlo (MCMC) constructs a dependent sequence of random variables that can be used to approximate the integrals just like for ordinary Monte Carlo. The advantage of introducing this dependence is that very general black-box algorithms (and corresponding theory) are available to perform the required simulations. 3 / 42

4 Initial remarks In some sense, MCMC is basically a black-box approach that works for almost all problems. The previous remark is dangerous it s always a bad idea to use tools without knowing that they will work. Here we will discuss some basics of Markov chains and MCMC but know that there are very important unanswered questions about how and when MCMC works. 3 3 MCMC is an active area of research; despite the many developments in MCMC, according to Diaconis ( 2008), still almost nothing is known! 4 / 42

5 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 5 / 42

6 Markov chains A Markov chain is just a sequence of random variables {X 1, X 2,...} with a specific type of dependence structure. In particular, a Markov chain satisfies P(X n+1 B X 1,..., X n ) = P(X n+1 B X n ), ( ) i.e., the future, given the past and present, depends only on the present. Independence is a trivial Markov chain. From ( ) we can argue that the probabilistic properties of the chain are completely determined by: initial distribution for X 0, and the transition distribution, distribution of X n+1, given X n. 4 4 Assume the Markov chain is homogeneous, so that the transition distribution does not depend on n. 6 / 42

7 Example simple random walk Let U 1, U 2,... iid Unif({ 1, 1}). Set X 0 = 0 and let X n = n i=1 U i = X n 1 + U n. The initial distribution is P{X 0 = 0} = 1. The transition distribution is determined by { X n 1 1 with prob 1/2 X n = X n with prob 1/2 While very simple, the random walk is an important example in probability, having connections to advanced things like Brownian motion. 7 / 42

8 Keywords 5 A state A is recurrent if a chain starting in A will eventually return to A with probability 1. This state is nonull if the expected time to return is finite. A chain is called recurrent if each state is recurrent. A Markov chain is irreducible if there is positive probability that a chain starting in a state A can reach any other state B. A Markov chain is aperiodic if, for a starting state A, there is no constraint on the times at which the chain can return to A. An irreducible, aperiodic Markov chain with all states being nonnull recurrent is called ergodic. 5 Not mathematically precise! 8 / 42

9 Limit theory 6 f is a stationary distribution if X 0 f implies X n f for all n. An ergodic Markov chain has at most one stationary dist. Furthermore, if the chain is ergodic, then lim P(X m+n B X m A) = f (x) dx, for all A, B, m. n Even further, if ϕ(x) is integrable, then 1 n n ϕ(x t ) t=1 B ϕ(x)f (x) dx, with prob 1. This is a version of the famous ergodic theorem. There are also central limit theorems for Markov chains, but I won t say anything about this. 6 Again, not mathematically precise! 9 / 42

10 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 10 / 42

11 Why MCMC? In Monte Carlo applications, we want to generate random variables with distribution f. This could be difficult or impossible to do exactly. MCMC is designed to construct an ergodic Markov chain with f as its stationary distribution. Asymptotically, the chain will resemble samples from f. In particular, by the ergodic theorem, expectations with respect to f can be approximated by averages. Somewhat surprising is that it is actually quite easy to construct and simulate a suitable Markov chain, explaining why MCMC methods have become so popular. But, of course, there are practical and theoretical challenges / 42

12 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 12 / 42

13 Details Let f (x) denote the target distribution pdf. Let q(x y) denote a conditional pdf for X, given Y = y; this pdf should be easy to sample from. Given X 0, the Metropolis Hasting algorithm (MH) produces a sequence of random variables as follows: 1 Sample X t q(x X t 1 ). 2 Compute { R = min 1, f (Xt ) q(x t 1 Xt ) f (X t 1 ) q(xt X t 1 ) 3 Set X t = X t with probability R; otherwise, X t = X t 1. General R code to implement MH is on course website. }. 13 / 42

14 Details (cont.) The proposal distribution is not easy to choose, and the performance of the algorithm depends on this choice. Two general strategies are: Take the proposal q(x y) = q(x); i.e., at each stage of the MH algorithm, X t does not depend on X t 1. Take q(x y) = q 0 (x y), for a symmetric distribution with pdf q 0, which amounts to a random walk proposal. This is one aspect of the MCMC implementation that requires a lot of care from the user; deeper understanding is needed to really see how the proposal affects the performance. In my examples, I will just pick a proposal that seems to work reasonably well / 42

15 Details (cont.) 7 Assuming the proposal is not too bad, then a number of things can be shown about the sequence {X t : t 1}: the chain is ergodic, and the target f is the stationary distribution. Consequently, the sequence converges to the stationary distribution and, for any integrable function ϕ(x), we can approximate integrals with sample averages. So, provided that we run the simulation long enough, we should be able to get arbitrarily good approximations. This is an interesting scenario where we, as statisticians, are able to control the sample size! 7 Again, not mathematically precise! 15 / 42

16 Example: cosine model Consider the problem from old homework, where the likelihood function is L(θ) n {1 cos(x i θ)}, π θ π. i=1 Observed data (X 1,..., X n ) given in the code. Assume that θ is given a Unif( π, π) prior distribution. Use MH to sample from the posterior: Proposal: q(θ θ) = Unif(θ θ ± 0.5). Burn-in: B = Sample size: M = / 42

17 Example: cosine model (cont.) Left figure shows histogram of the MCMC sample with posterior density overlaid. Right figure shows a trace plot of the chain. Density θ θ Iteration 17 / 42

18 Example: Weibull model Data X 1,..., X n has Weibull likelihood { L(α, η) α n η n exp α n log X i η i=1 Prior: π(α, η) e α η b 1 e cη, for some (b, c). Posterior density is proportional to { α n η n+b 1 exp α ( n i=1 n i=1 ) ( log X i 1 η c + } Xi α. n i=1 Exponential is a special case of Weibull, when α = 1. Goal: informal Bayesian test of H 0 : α = 1. )} Xi α. 18 / 42

19 Example: Weibull model (cont.) Data from Problem 7.11 in Ghosh et al (2006). Use MH to sample from posterior of (α, η). Proposal: (α, η ) (α, η) Exp(α) Exp(η). b = 2 and c = 1; B = 1000 and M = Histogram shows marginal posterior of α. Is an exponential model (α = 1) reasonable? Density α 19 / 42

20 Example: logistic regression Based on Examples 1.13 and 7.11 in Robert & Casella s book. In 1986, the Challenger space shuttle exploded during take-off, the result of an o-ring failure. Failure may have been due to the cold temperature (31 F). Goal: Analyze the relationship between temperature and o-ring failure. In particular, fit a logistic regression model. 20 / 42

21 Example: logistic regression (cont.) Model: Y x Ber(p(x)), x = temperature. Failure probability, p(x), is of the form p(x) = exp(α + βx) 1 + exp(α + βx). Using available data, fit logistic regression using glm. Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) * x * --- Signif. codes: 0 *** ** 0.01 * Note that p(31) 0.999!!! 21 / 42

22 Example: logistic regression (cont.) Can also do a Bayesian analysis of this logistic model. Use MH to obtain samples from the posterior of (α, β). Samples can be used to approximate the posterior distribution of p(x 0 ) for any fixed x 0, e.g., x 0 = 65, 31; see below. Details about the prior and proposal construction are given in the R code and a short write-up posted on the course website. Density Density p(65) p(31) 22 / 42

23 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 23 / 42

24 Setup Suppose we have a multivariate target distribution f. MH can be applied to such a problem, but there are challenges in constructing a good proposal over multiple dimensions. Idea: sample one dimension at a time. Question: How to carry out the sampling so that it will approximate the target, at least in a limit? Gibbs sampler is the right tool for the job. 24 / 42

25 Details Suppose we have a trivariate target f (x) = f (x 1, x 2, x 3 ). Suppose we can write down the set of full conditionals f (x 1 x 2, x 3 ), f (x 2 x 1, x 3 ), f (x 3 x 1, x 2 ) and that these can be sampled from. Gibbs sampler generates a sequence {X (t) : t 0} by iteratively sampling from the conditionals: X (t) 1 f (x 1 X (t 1) 2, X (t 1) 3 ) X (t) 2 f (x 2 X (t) 1, X (t 1) 3 ) X (t) 3 f (x 3 X (t) 1, X (t) 2 ). 25 / 42

26 Details (cont.) Gibbs sequence forms a Markov chain. In fact, Gibbs sampler is a special case of MH! Connection to MH is made by considering Gibbs as a sequence that updates one component of X at a time. The acceptance probability for this form of MH update is exactly 1, which explains why Gibbs sampler has no accept/reject step. Since Gibbs is a special kind of MH, the convergence for MH applies to Gibbs as well. 26 / 42

27 Example: bivariate normal A super-simple Gibbs example: bivariate normal. Suppose X = (X 1, X 2 ) is bivariate normal, correlation ρ. Full conditionals are easy to write down here. Gibbs steps: X (t) 1 N(ρX (t 1) 2, 1 ρ 2 ) X (t) 2 N(ρX (t) 1, 1 ρ2 ). Not as efficient as direct sampling, but works fine. 27 / 42

28 Example: many-normal-means ind Model: X i N(θ i, 1), i = 1,..., n. Hierarchical prior distribution: [(θ 1,..., θ n ) ψ] iid N(0, ψ 1 ), ψ Gamma(a, b). Takes some work, but it can be shown 8 that the full conditionals are ( Xi θ i (X i, ψ) ind N 1 + ψ, ψ ( ψ 1 (θ, X ) Gamma a + n 2, b ), i = 1,..., n n i=1 So Gibbs sampler is pretty easy to implement... θ 2 i ). 8 Easiest argument is based on standard conjugate priors / 42

29 Example: many-normal-means (cont.) Suppose the goal is to estimate θ 2 = n i=1 θ2 i. In general, the MLE X 2 is lousy. However, the Bayes estimator, E( θ 2 X ), is better and can be evaluated by running the Gibbs sampler. Can use the Rao Blackwellized estimator of E(θi 2 X ) to reduce the variance. Simulation study to compare Bayes estimator with MLE: n = 10, θ = (1, 1,..., 1); 1000 reps, 5000 Monte Carlos, a = b = 1. mle.mse bayes.mse [1,] / 42

30 Example: capture recapture Example in Lange. Consider a lake that contains N fish, N is unknown. Capture recapture study: On n occasions, fish are caught, marked, and returned. At occasion i = 1,..., n, record C i = number of fish caught at time i R i = number of recaptures at time i. C i R i is the number of new fish caught at time i. Set U i = i j=1 (C i R i ). Model assumes independent binomial sampling. 30 / 42

31 Example: capture recapture (cont.) Introduce binomial success probabilities (ω 1,..., ω n ). Likelihood for (N, ω) is n ( Ui 1 ) ( L(N, ω) = ω R i i (1 ω i ) U i 1 R i N Ui 1 ) i R i=1 i C i R i n ( Ui 1 )( N Ui 1 ) = ω C i i (1 ω i ) N C i R i=1 i C i R i N! n ( Ui 1 ) = ω C i i (1 ω i ) N C i. (N U n)! R i=1 i ω C i R i (1 ω i ) N U i 1 C i +R i Priors: N Pois(m) and ω i ind Beta(a, b). 31 / 42

32 Example: capture recapture (cont.) Posterior distribution for (N, ω) proportional to N! m N (N U n )! N! n i=1 ( Ui 1 R i ) ω C i +a 1 i (1 ω i ) N C i +b 1. To run a Gibbs sampler, we need the full conditionals. Distribution of (ω 1,..., ω n ), given N and data, is clearly ω i (N, data) ind Beta(a + C i, b + N C i ), i = 1,..., n. Distribution of N, given ω and data, is ( N (ω, data) U n + Pois m Now the Gibbs sampler is easy to run... n ) (1 ω i ). i=1 32 / 42

33 Example: probit regression ind Model: Y i Ber(Φ(xi β)), i = 1,..., n. Suppose β has a normal prior. Not directly obvious how to implement Gibbs to get a sample from the posterior distribution of β. Recall, from EM notes, that this model can be simplified by introducing some missing data. The conditional distribution of the missing data, given the observed data and β, makes up one part of the full conditionals. The other part of the full conditionals is simple since the model for the complete data is, by construction, nice. 33 / 42

34 Example: probit regression (cont.) Missing data: Z 1,..., Z n where Z i N(x i β, 1) and Y i = I Zi >0, i = 1,..., n. Full conditionals: Distribution of β, given (Y, Z), only depends on Z and is easy because normal prior for β is conjugate. Distribution of Z, given (Y, β), is a truncated normal... Though I ve not given the precise details, the steps for constructing a Gibbs sampler are not too difficult; 9 see Section in Ghosh et al (2006) for details. 9 The only potential difficulty is simulating from a truncated normal when the truncation point is extreme, but remember we have talked about extreme normal tail probabilities before / 42

35 Example: Dirichlet process mixture In Bayesian nonparametrics, the Dirichlet process mixture model is probably the most widely used. Flexible model for density estimation normal mixture density with unspecified component means (and variances), but also doesn t specify the number of components. The main challenge with using mixture models is choosing how many components to use; this DPM selects the number of components automatically. Interesting that, despite being a nonparametric model, the computations are not too hard, just a Gibbs sampler. Simplest algorithm in Escobar & West (JASA 1995); a nice slice sampler is proposed in Kalli et al (Stat Comp 2011). 35 / 42

36 Example: Dirichlet process mixture (cont.) Using the slice sampler from Kalli et al to fit same normal mixture model to the galaxy data from HW. R code for this on my research page. Density Post mean Kernel Probability Y new Number of components 36 / 42

37 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 37 / 42

38 Diagnostic plots Sample path plot, or trace plot: Can reveal any residual dependence after burn-in. Idea is that a sample path of iid samples should show no trend, so if there is minimal trend in our sample plot, then we can be comfortable treating samples as independent. Autocorrelation plot: Plot sample correlation of {(X t, X t+r ) : t = 1, 2,...} as a function of the lag r. What to see the autocorrelation plot decay rapidly, suggesting that the dependence along the chain is not too strong. If these plots indicate that chain has not yet converged to stationarity, then you can run the chain longer or make some other modifications, e.g., Transformations thinning 38 / 42

39 Other considerations The practical/theoretical rate of convergence can depend on the parametrization; see homework. There is no agreement in the stat community about how many chains to run, how long should the burn in be, etc. Charles Geyer (Univ Minnesota) strongly supports running only one long chain check out his rants Gelman & Rubin suggest running several shorter chains with different starting points; textbook gives their diagnostic test. 39 / 42

40 Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 40 / 42

41 Remarks MCMC methods are powerful because they give fairly general procedures able to solve a variety of important problems. There are black-box software implementations: The mcmc package in R will do random walk MH; In SAS, PROC MCMC does similar things; BUGS ( Bayes Using Gibbs Sampler ). However, it s a bad idea to blindly use these things without fully understanding what they re doing and whether or not they will actually work in your problem. Also important to look at convergence diagnostics before the simulation results be used for inference. 41 / 42

42 Remarks (cont.) Our focus here was on relatively simple MCMC methods. Don t think that MH, Gibbs, etc are all separate methods, they can be combined. For example, if one of the full conditionals is difficult to sample from, one might consider a MH step 10 within Gibbs to sample this conditional. Book by Robert & Casella has some details about more advanced MCMC methods, including various combinations of these standard techniques. 10 Could also use accept reject for this / 42

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),