MCMC Methods: Gibbs and Metropolis

Similar documents
17 : Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Computational statistics

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Markov chain Monte Carlo

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

ST 740: Markov Chain Monte Carlo

Introduction to Machine Learning CMU-10701

Reminder of some Markov Chain properties:

Bayesian Methods for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Principles of Bayesian Inference

CSC 2541: Bayesian Methods for Machine Learning

Monte Carlo Methods. Leon Gu CSD, CMU

Bayesian Inference and MCMC

Bayesian Phylogenetics:

Lecture 6: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

MCMC algorithms for fitting Bayesian models

CS281A/Stat241A Lecture 22

Bayesian GLMs and Metropolis-Hastings Algorithm

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

MARKOV CHAIN MONTE CARLO

Answers and expectations

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

STAT 425: Introduction to Bayesian Analysis

Advanced Statistical Modelling

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Metropolis-Hastings Algorithm

Lecture 8: Bayesian Estimation of Parameters in State Space Models

DAG models and Markov Chain Monte Carlo methods a short overview

Markov chain Monte Carlo Lecture 9

16 : Approximate Inference: Markov Chain Monte Carlo

Markov chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Markov Chain Monte Carlo


MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Lecture 8: The Metropolis-Hastings Algorithm

Simulation - Lectures - Part III Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Convex Optimization CMU-10725

MCMC for Cut Models or Chasing a Moving Target with MCMC

MCMC: Markov Chain Monte Carlo

Markov Chain Monte Carlo, Numerical Integration

eqr094: Hierarchical MCMC for Bayesian System Reliability

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

I. Bayesian econometrics

Robert Collins CSE586, PSU Intro to Sampling Methods

MCMC Review. MCMC Review. Gibbs Sampling. MCMC Review

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Chain Monte Carlo methods

Monte Carlo in Bayesian Statistics

Bayesian Networks in Educational Assessment

In many cases, it is easier (and numerically more stable) to compute

Bayesian Inference and Decision Theory

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Learning the hyper-parameters. Luca Martino

Markov Chains and MCMC

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

One-parameter models

MCMC notes by Mark Holder

CPSC 540: Machine Learning

Bayesian Estimation with Sparse Grids

Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

INTRODUCTION TO BAYESIAN STATISTICS

Markov Chain Monte Carlo

Bayesian Linear Regression

Markov Chain Monte Carlo (MCMC)

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

16 : Markov Chain Monte Carlo (MCMC)

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Brief introduction to Markov Chain Monte Carlo

Quantifying Uncertainty

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

6 Markov Chain Monte Carlo (MCMC)

Introduction to Machine Learning CMU-10701

Markov chain Monte Carlo

Convergence Rate of Markov Chains

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Doing Bayesian Integrals

Monte Carlo integration

Probabilistic Machine Learning

LECTURE 15 Markov chain Monte Carlo

Bayesian modelling. Hans-Peter Helfrich. University of Bonn. Theodor-Brinkmann-Graduate School

The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

A generalization of the Multiple-try Metropolis algorithm for Bayesian estimation and model selection

9 Markov chain Monte Carlo integration. MCMC

Bayesian Graphical Models

Model comparison: Deviance-based approaches

Transcription:

MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30

Introduction As we have seen, the ability to sample from the posterior distribution is essential to the practice of Bayesian statistics, as it allows Monte Carlo estimation of all posterior quantities of interest Typically however, direct sampling from the posterior is not possible either Today, we discuss two mechanisms that allow us to carry out this sampling when a direct approach is not possible (Gibbs sampling and the Metropolis-Hastings algorithm), as well as discuss why these approaches work Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 2/30

Markov chains A sequence of random variables X (0), X (1), X (2),... is said to form a Markov chain if, for all t, p(x (t+1) = x) = t(x X (t) ); in other words, the distribution of X (t+1) depends only on the previous draw, and is independent of X (0), X (1),..., X (t 1) The function t(x X (t) ) defines the transition probabilities or transition distribution of the chain Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 3/30

Stationary distributions A distribution π(x) is stationary with respect to a Markov chain if, given that X (t) π, X (t+1) π Provided that a Markov chain is positive recurrent, aperiodic, and irreducible (next slide), it will converge to a unique stationary distribution, also known as an equilibrium distribution, as t This stationary distribution is determined entirely by the transition probabilities of the chain; the initial value of the chain is irrelevant in the long run In Bayesian statistics, we will be interested in constructing Markov chains whose equilibrium is the posterior distribution Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 4/30

Conditions Markov chains The following conditions are required for a Markov chain to converge to a unique stationary distribution (below, I use set to refer to a set with nonzero probability π(a)): Irreducibile: Any set A can be reached from any other set B with nonzero probability Positive recurrent: For any set A, the expected number of steps required for the chain to return to A is finite Aperiodic: For any set A, the number of steps required to return to A must not always be a multiple of some value k Thankfully, these conditions are typically met in Bayesian statistics Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 5/30

Metropolis-Hastings Suppose that a Markov chain is in position x; the Metropolis-Hastings algorithm is as follows: (1) Propose a move to y with probability q(y x) (2) Calculate the ratio r = p(y)q(x y) p(x)q(y x) (3) Accept the proposed move with probability α = min{1, r}; otherwise, remain at x (i.e., X (t+1) = X (t) ) Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 6/30

Stationary distribution The description on the previous slide allows asymmetric proposals; if the proposal is symmetric, i.e., q(y x) = q(x y), the ratio is simply r = p(y)/p(x) Theorem (detailed balance): For any sets A and B, P (A)T (B A) = P (B)T (A B), where T (A B) is the transition probability from B A imposed by the Metropolis-Hastings algorithm Theorem: The Markov chain with transition probabilities arising from the Metropolis-Hastings algorithm has the posterior distribution p as a stationary distribution Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 7/30

Example Markov chains As an example of how the Metropolis-Hastings algorithm works, let s sample from the following posterior: Y t 5 (µ, 1) µ t 5 (0, 1) The following code can be used to calculate the posterior density: p <- function(mu) { dt(mu, 5) * prod(dt(y, 5, mu)) } In practice, it is better to work with probabilities on the log scale to avoid numerical overflow/underflow, but the above will be sufficient for our purposes today Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 8/30

Example Markov chains The Metropolis algorithm can be coded as follows: N <- 10000 mu <- numeric(n) for (i in 1:(N-1)) { proposal <- mu[i] + rnorm(1) r <- p(proposal)/p(mu[i]) accept <- rbinom(1, 1, min(1,r)) mu[i+1] <- if (accept) proposal else mu[i] } Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 9/30

Results Markov chains Suppose we observe y = c( 1, 1, 5): 2 1 µ 0 1 0 2000 4000 6000 8000 10000 t Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 10/30

Results Markov chains Now suppose we observe y = c(39, 41, 45): 50 40 µ 30 20 10 0 0 2000 4000 6000 8000 10000 t Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 11/30

Remarks Markov chains The preceding plots are known as trace plots; we will discuss trace plots further next week when we discuss convergence diagnostics The preceding examples illustrate the notion of Markov chains converging to the posterior distribution regardless of where they start Note, however, that this may take a while ( 100 draws in the second example; this can be much larger in multidimensional problems) Thus, it might be desirable to discard the beginning values of the Markov chain for the second example, only considering draws from 101 onward to be draws from the posterior Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 12/30

Burn-in Markov chains This idea is known as burn-in Both BUGS and JAGS (and R2OpenBUGS and R2jags) allow you to set the number of burn-in iterations to run the Markov chain for a while before you begin to record draws The advantage of this is to eliminate dependency on the initial values (which are arbitrary) from the results Strictly speaking, burn-in is not necessary, since if you simply run the chain long enough, the impact of the initial values will gradually diminish and achieve the same result Discarding the initial values of the chain, however, is often faster, especially if you are interested in estimating quantities pertaining to the tails of the distribution Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 13/30

Autocorrelation Note that although the marginal distribution of X (t) converges to the posterior, that doesn t mean that the chain converges to a chain producing IID draws from the posterior Indeed, in the second example, consecutive draws were quite highly correlated (this is known as autocorrelation, which we will discuss in greater depth next week) Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 14/30

Proposal distribution: Tradeoffs The high degree of autocorrelation is a consequence of the proposal distribution Newcomers to the Metropolis-Hastings algorithm often feel that rejecting a proposal is a bad outcome and that we should minimize the probability that it occurs However, while an excessive amount of rejection is indeed bad, too little rejection is also bad, as it indicates that the proposals are too cautious and represent only very small movements around the posterior distribution (giving rise to high autocorrelation) Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 15/30

Proposal distribution: Tradeoffs Our original proposal had σ = 1; for the first example, this led to an acceptance rate of 53.5%, for the second example it led to an acceptance rate of 95.5% Informally, it certainly appeared that the Markov chain worked better in the first example than in the second Formally, there are theoretical arguments indicating that the optimal acceptance rate is 44% for one dimension, and has a limit of 23.4% as the dimension goes to infinity We can achieve these targets by modify the proposal; specifically, by increasing or decreasing σ 2, the variance of the normal proposal Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 16/30

Results: y = c(39, 41, 45), σ = 15 Acceptance rate: 49.6% 60 50 40 µ 30 20 10 0 0 2000 4000 6000 8000 10000 t Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 17/30

Results: y = c(39, 41, 45), σ = 200 Acceptance rate: 4.9% 80 60 µ 40 20 0 0 2000 4000 6000 8000 10000 t Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 18/30

Remarks Markov chains Traceplots should look like fat, hairy caterpillars, as they do in slides 10 and 15; not like they do on slides 11 or 16 Both BUGS and JAGS allow for adapting phases in which they try out different values of σ (or other such tuning parameters) to see which ones work the best before they actually start the official Markov chain; we will discuss these as they come up Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 19/30

Idea Markov chains Another extremely useful technique for sampling multidimensional distributions is, which we have already encountered The basic idea is to split the multidimensional θ into blocks (often scalars) and sample each block separately, conditional on the most recent values of the other blocks The beauty of is that it simplifies a complex high-dimensional problem by breaking it down into simple, low-dimensional problems Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 20/30

Formal description Formally, the algorithm proceeds as follows, where θ consists of k blocks θ 1, θ 2,..., θ k : at iteration (t), Draw θ (t+1) 1 from Draw θ (t+1) 2 from p(θ 1 θ (t) 2, θ(t) 3,..., θ(t) k ) p(θ 2 θ (t+1) 1, θ (t) 3,..., θ(t) k )... This completes one iteration of the Gibbs sampler, thereby producing one draw θ (t+1) ; the above process is then repeated many times The distribution p(θ 1 θ (t) 2, θ(t) 3,..., θ(t) k ) is known as the full conditional distribution of θ 1 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 21/30

Gibbs: Illustration β 2 3 2 1 0 1 2 3 3 2 1 0 1 2 3 β 1 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 22/30

Justification for Although they appear quite different, is a special case of the Metropolis-Hasting algorithm Specifically, involves a proposal from the full conditional distribution, which always has a Metropolis-Hastings ratio of 1 i.e., the proposal is always accepted Thus, produces a Markov chain whose stationary distribution is the posterior distribution, for all the same reasons that the Metropolis-Hastings algorithm works Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 23/30

Gibbs vs. Metropolis Thus, there is no real conflict as far as using or the Metropolis-Hastings algorithm to draw from the posterior In fact, they are frequently used in combination with each other As we have seen, semi-conjugacy leads to Gibbs updates with simple, established methods for drawing from the full conditional distribution In other cases, we may have to resort to Metropolis-Hastings to draw from some of the full conditionals Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 24/30

Variable-at-a-time Metropolis This sort of approach is known as variable-at-a-time Metropolis-Hastings or Metropolis-within-Gibbs The advantage of the variable-at-a-time approach, as mentioned earlier, is that it is much easier to propose updates for a single variable than for many variables The disadvantage, however, is that when variables are highly correlated, it may be very difficult to change one without simultaneously changing the other Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 25/30

Gibbs with highly correlated parameters β 2 3 2 1 0 1 2 3 3 2 1 0 1 2 3 β 1 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 26/30

Regression: BUGS Correlated posteriors can often arise in regression (see code for details) 5 5 15 1000 1200 1400 1600 1800 2000 Iterations 0.15 0.05 1000 1200 1400 1600 1800 2000 Iterations Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 27/30

Regression: JAGS JAGS avoids this problem (for regression) by updating the block (β 1, β 2 ) at once: 15 10 5 0 0 200 400 600 800 1000 Iterations 0.10 0.05 0.00 0.05 0.10 0 200 400 600 800 1000 Iterations Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 28/30

Regression: BUGS We can also avoid this problem in BUGS by centering the variables or by specifying beta as a multivariate parameter 10 9 8 7 6 0.2 0.1 0.0 0.1 1000 1200 1400 1600 1800 2000 Iterations 1000 1200 1400 1600 1800 2000 Iterations Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 29/30

Conclusion Note that JAGS is not, in general, impervious to correlated posteriors: it merely recognizes this situation for linear models and GLMs Certainly, there is a large body of work on other computational approaches to sampling (slice sampling, adaptive rejection sampling, Hamiltonian Monte Carlo, etc.); covering such methods is beyond the scope of this course Nevertheless, I hope that by exploring the two most widely known methods (Metropolis and Gibbs), I have conveyed a sense of how such MCMC sampling methods work, why they work, and in what situations they may fail to work Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 30/30