A Geometric Interpretation of the Metropolis Hastings Algorithm

Similar documents
Markov Processes. Stochastic process. Markov process

Numerical Results for the Metropolis Algorithm

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

University of Toronto Department of Statistics

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Markov Chain Monte Carlo

Session 3A: Markov chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo methods

Simulated Annealing for Constrained Global Optimization

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Sampling Methods (11/30/04)

Introduction to Machine Learning CMU-10701

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1

First Hitting Time Analysis of the Independence Metropolis Sampler

LECTURE 15 Markov chain Monte Carlo

Answers and expectations

Reversible Markov chains

ON CONVERGENCE RATES OF GIBBS SAMPLERS FOR UNIFORM DISTRIBUTIONS

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Stochastic optimization Markov Chain Monte Carlo

Sampling Contingency Tables

On Reparametrization and the Gibbs Sampler

Markov chain Monte Carlo methods in atmospheric remote sensing

Simulation of truncated normal variables. Christian P. Robert LSTA, Université Pierre et Marie Curie, Paris

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm

Brief introduction to Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Weak convergence of Markov chain Monte Carlo II

STA 294: Stochastic Processes & Bayesian Nonparametrics

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

MCMC notes by Mark Holder

Finite-Horizon Statistics for Markov chains

Bayesian modelling. Hans-Peter Helfrich. University of Bonn. Theodor-Brinkmann-Graduate School

Optimum Monte-Carlo sampling using Markov chains

Faithful couplings of Markov chains: now equals forever

MARKOV CHAIN MONTE CARLO

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Lecture 8: The Metropolis-Hastings Algorithm

Likelihood-free MCMC

Monte Carlo in Bayesian Statistics

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Lecture 7 and 8: Markov Chain Monte Carlo

Chapter 7. Markov chain background. 7.1 Finite state space

16 : Approximate Inference: Markov Chain Monte Carlo

Bayesian Nonparametric Regression for Diabetes Deaths

STA 4273H: Statistical Machine Learning

First Hitting Time Analysis of the Independence Metropolis Sampler

EIGENVECTORS FOR A RANDOM WALK ON A LEFT-REGULAR BAND

Markov Chains, Stochastic Processes, and Matrix Decompositions

DIFFERENTIAL POSETS SIMON RUBINSTEIN-SALZEDO

REVIEW FOR EXAM III SIMILARITY AND DIAGONALIZATION

Convergence Rate of Markov Chains

Markov chain Monte Carlo

Approximate Bayesian Computation: a simulation based approach to inference

ELEC633: Graphical Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Advanced Sampling Algorithms

A note on Reversible Jump Markov Chain Monte Carlo

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS

642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004

Systems of Linear Equations

The Symmetric Space for SL n (R)

MONTE CARLO METHODS. Hedibert Freitas Lopes

Markov Chains and Stochastic Sampling

The passage time distribution for a birth-and-death chain: Strong stationary duality gives a first stochastic proof

Markov Chain Monte Carlo methods

Random Walks A&T and F&S 3.1.2

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Computational Complexity of Metropolis-Hastings Methods in High Dimensions

CPSC 540: Machine Learning

MCMC Review. MCMC Review. Gibbs Sampling. MCMC Review

Markov Chains and MCMC

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Markov Chains and MCMC

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Monte Carlo Method for Finding the Solution of Dirichlet Partial Differential Equations

MCMC: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Linear Algebra. Preliminary Lecture Notes

MATH 31BH Homework 1 Solutions

Pseudo-marginal MCMC methods for inference in latent variable models

F denotes cumulative density. denotes probability density function; (.)

Paul Karapanagiotidis ECO4060

6.842 Randomness and Computation March 3, Lecture 8

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Course 212: Academic Year Section 1: Metric Spaces

Summary: A Random Walks View of Spectral Segmentation, by Marina Meila (University of Washington) and Jianbo Shi (Carnegie Mellon University)

Markov chain Monte Carlo

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

Statistics & Data Sciences: First Year Prelim Exam May 2018

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Transcription:

Statistical Science 2, Vol. 6, No., 5 9 A Geometric Interpretation of the Metropolis Hastings Algorithm Louis J. Billera and Persi Diaconis Abstract. The Metropolis Hastings algorithm transforms a given stochastic matrix into a reversible stochastic matrix with a prescribed stationary distribution. We show that this transformation gives the minimum distance solution in an L metric.. INTRODUCTION Let X be a finite set and suppose that π x > π x = is a probability distribution on X from which we wish to draw samples. The Metropolis (or Metropolis Hastings algorithm is a widely used procedure for drawing approximate samples from π, which works by finding a Markov chain on X with π as stationary distribution and using the well-known fact that after running the chain for a long time it is approximately distributed as π. An important feature of the Metropolis Hastings algorithm is that it can be applied when π is known only through ratios π x /π y. This allows the algorithm to be used without knowing the normalizing constant: for example, if π is specified as π x e βh x, which is in practice impossible to sum if X is very large as happens in many problems in statistical physics, or π θ P θ L x θ as happens in Bayesian posterior distributions when P is a prior and L is a likelihood. The algorithm is widely used for simulations in physics, chemistry, biology and statistics. It appears as the first entry of a recent list of great algorithms of 2th-century scientific computing []. Yet for many people (including the present authors the Metropolis Hastings algorithm seems like a magictrick. It is hard to see where it comes from or why it works. In this paper we give a development which explains at least some of the properties of the algorithm and we also show how there is a Louis J. Billera is Professor of Mathematics and Operations Research, Department of Mathematics, 5 Malott Hall, Cornell University, Ithaca, New York 5-2. Persi Diaconis is Mary V. Sunseri Professor of Statistics and Professor of Mathematics, Stanford University, Stanford, California 95. natural class of related algorithms, amongst which it is optimal. The mechanics of the algorithm are simple to explain. The Metropolis Hastings algorithm takes a base chain given by an absolutely arbitrary stochastic matrix K x y and transforms this into a stochastic matrix M x y that is reversible with respect to π, that is, one such that (. π x M x y =π y M y x This implies that π is a stationary distribution for M, and so we have found the required Markov chain. The transformation is easy to implement. If (.2 then (. π x K x y K x x min R x y M x y = K x y + K x z z min R x z if x y, else. The transformation (. has a simple probabilisticinterpretation: from x, choose y with probability K x y. With probability min R x y accept this choice and move to y. Failing to accept, stay at x. It is easy to check directly that M x y satisfies the reversibility condition (.. If the base chain K is irreducible, then M is irreducible. The Metropolis algorithm was introduced in 95 by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller (95. In the original version K was taken as symmetric, so K x y = K y x cancels out of (.2, which makes the rejection step easier to understand: the chain moves to higher π-density regions automatically but only with appropriate probabilities to lower regions. The extension to more general nonsymmetricchains 5

6 L. J. BILLERA AND P. DIACONIS development of the recipe (.. Section gives an extension of our main result to general spaces. Fig.. Metropolis map in the 2 2 case. appears in Hastings (9. Textbook descriptions are in Hammersley and Handscomb (96, Fishman (996 and Liu (2. Peskun (2 shows that the Metropolis Hastings algorithm leads to estimates with smallest asymptoticvariance in a class of variants. Some approaches to convergence rates are summarized in Diaconis and Saloff-Coste (99. Further recent references are Mengersen and Tweedie (996 and Robert et al. (99. In this paper we show that the Metropolis Hastings algorithm can be characterized geometrically as a projection in a particular L norm onto R π, the set of π-reversible Markov chains as in (.. Because projections minimize distances, this shows that M is in this sense the closest reversible kernel to the original kernel, and hence it is a natural way in which to construct a good chain with π as its limiting distribution. As an example, consider the 2 2 case. Then a stochastic matrix ( a a b b a b may be represented as a point a b in the unit square. The space R π of π-reversible matrices is given by the intersection of a line π b π a = with the unit square. If π > π the picture is illustrated in Figure. The Metropolis Hastings algorithm moves a point a b to the line by projecting vertically for the top region and horizontally for the bottom region, as in the diagram. In the general case, the space of stochastic matrices is partitioned into chambers (i.e., connected components of the complement of a finite union of hyperplanes in Euclidean space and the Metropolis Hastings algorithm is seen as piecewise linear over this partition, that is, a linear projection on each piece. Details are given in Section 2, which also illustrates the difficulty of replacing L projections by L 2 projections. Section gives a motivated 2. THE METROPOLIS MAP Let X be a finite set. Let X be the set of stochastic matrices indexed by X. Thus K X has K x y and y K x y = for each x X. Note that X is a convex set of dimension X X. Fixing a stationary distribution π x > x π x =, let R π be the subset of X consisting of π-reversible Markov chains as in (.. Note that R π is a convex set of dimension X X /2. The Metropolis Hastings algorithm gives a map M from X onto R π : ( M K x y =min K x y π y (2. K y x π x for x y. Observe that, for x y M is coordinatewise decreasing. In particular, K x y = implies M K x y =. Thus M (weakly increases the set of off-diagonal zeros. Introduce a metric on X via d K K = (2.2 π x K x y K x y x y x This is a metricon X if d K K =, then K x y =K x y for x y and by row stochasticity K x x =K x x. The main result of this paper follows. Theorem. The Metropolis map M of (2. minimizes the distance (2.2 from K to R π. In fact, M K is the unique closest element of R π that is coordinatewise smaller than K on its off-diagonal entries. Remark 2.. In X X, consider the hyperplanes { H xy = K π x K x y = and their corresponding half spaces Hxy {K = π x K x y < H xy {K + = π x K x y > These divide the matrix space into chambers. Furthermore, x y H xy X = π. From (2. the map M is a diagonal linear map on each chamber of this hyperplane arrangement. Remark 2.2. A related natural metricis d K, K = x y π x K x y K x y. This is the total variation distance between the rows of K and K weighted by the stationary distribution. The following example shows that the Metropolis chain does

GEOMETRIC INTERPRETATION OF THE METROPOLIS HASTINGS ALGORITHM not minimize d. The example has three states and π x = for all x. With K = M K = N = 2 5 we have d K M K = 5 2 and d K N =. Note that, for the metric d of (2.2, d K M K = d K N = 5. This does not violate the claimed 2 uniqueness in Theorem ; the matrix N is not coordinatewise smaller than K in the (, 2 entry. The following proof shows that M K minimizes d among elements of R π coordinatewise less than K on off-diagonal entries. Remark 2.. It is natural to seek an L 2 minimizer. A glance at Figure shows the problem. Near the point the unconstrained L 2 minimizer falls outside the box; it increases a and causes the diagonal entry in its row to become negative. Thus, to implement an L 2 minimizing projection, one would have to take into account all the entries in a row in order to limit the amount an entry might be increased. This would require a complete knowledge of all previous moves in the algorithm. Remark 2.. The distance between K X and R π has a simple interpretation. Let πk x y =π x K x y πk x y =. Both are probabilities on X X and K is π-reversible if and only if πk = πk. With this notation, an easy computation shows d ( K R π = πk πk TV Proof of Theorem. We first argue that d K, R π = d K M K by showing that d K N d K M K for every N in R π. For this, note that ( d K N π x K x y N x y + π y K y x N y x Since N is in R π, if N x y = K x y +ε xy, N y x = π x π y K x y +ε xy. Thus d K N π x ε xy +π y K y x = π x π y K x y +ε xy π x ε xy + π x K x y π x ε xy Using a b a b gives d K N π x K x y = d K M K When N is coordinatewise less than or equal to K on the off-diagonal entries, we have all ε xy. If any one is negative, we can conclude d K N > d K M K, proving uniqueness.. DEVELOPING THE ALGORITHM The following development makes the Metropolis Hastings algorithm seem rather natural and, in a sense, gives all possible variations. Let K x y be a Markov chain on a finite state space X. The goal is to change K to a chain with stationary distribution π x. The change must occur as follows: from x, choose y from K x y and decide to accept x or stay at y; this last choice may be stochastic with acceptance probability F x y F x y. This gives a new chain with transition probabilities (. K x y F x y x y The diagonal entries are changed so that each row sums to. The easiest way to get π-stationary is to insist on π-reversibility: π x K x y F x y =F y x Set /π x K x y. The reversibility constraint becomes F x y =R x y F y x (.2 F x y Since F y x = /R x y F x y we must also have F x y R x y and so (. F x y min R x y The point now is that F x y may be chosen arbitrarily to satisfy (. for some fixed orientation of pairs and then F y x is forced by (.2. This gives all modifications. We summarize.

L. J. BILLERA AND P. DIACONIS Proposition.. Let K x y be a Markov chain on a finite state space X. For each pair x y choose an ordering x y. Choose F x y min ( R x y π x K x y and set F y x = /R x y F x y. Then the Markov chain K x y F x y x y M x y = K x x + K x y ( F x y y x satisfies π x M x y = π y M y x for all x y. Furthermore, this construction gives all possible functions F such that (. is π-reversible. Remark.. The classic Metropolis choice is F x y = min R x y. This maximizes the chance of moving from x. Suppose M and M are π-reversible Markov chains on X with eigenvalues = λ λ λ X and = λ λ λ X and also M x y M x y for all x y. Then the minimax characterization of eigenvalues shows λ i λ i for all i. Thus, among the reversible chains of form (., the Metropolis algorithm has the largest spectral gap λ. Moreover, Peskun s theorem shows this gives the chain of form (. with the minimum asymptoticvariance for additive functionals (the usual form of statistical estimators as sums of a function evaluated along the chain. For contrast, chemists sometimes use Barker dynamics, which amounts to the choice F x y = π x K x y + = R + R Hastings (9 introduced the equivalent form S x y F x y = + /R for S x y a symmetricfunction satisfying ( S x y min + R x y + R x y Remark.2. It is natural to consider possible choices of F which are only functions of R F x y =g R x y. Then g must satisfy ( g x =xg for x x Such g may be chosen arbitrarily on subject to g x x x Then the functional equation specifies g on. This gives Hastings class of algorithms. The original Metropolis algorithm corresponds to g x = x x. Remark.. Of course, the ergodicity of M x y must be checked; K x x for all x is a π- reversible chain! Remark.. There are still mysterious features about the Metropolis Hastings algorithm; when applied to natural generating sets of finite reflection groups, changing the stationary distribution proportional to length, the Metropolis Hastings algorithm deforms the multiplication in the group algebra to multiplication in the Hecke algebra. See Diaconis and Hanlon (992 and Diaconis and Ram (2 for this.. MORE GENERAL SPACES It is possible to extend Theorem to general spaces. Let X be a complete separable metric space. Let µ be a σ-finite measure on X and π x a strictly positive probability density with respect to µ. Let be the set of Markov kernels K x dy on X X having measurable densities k x y with respect to µ, with a possible atom at x allowed. The Metropolis Hastings algorithm maps K x dy to M x dy with M x dy =k x y min R x y µ dy + a x δ x dy where π y k y x π x k x y and a x =M ( x x ( + R x y k x y µ dy This maps K to the π-reversible Markov kernels on X. Define a metricon by d K K = π x k x y k x y X X µ dx µ dy with the diagonal in X X. Then Theorem holds as stated for this situation. + ACKNOWLEDGMENTS We thank K. P. Choi, Thomas Yan, Richard Tweedie and a helpful referee for a careful reading and thoughtful comments on a previous version of this paper. The authors were supported in part by NSF Grants DMS-9-9 and DMS- 95-9.

GEOMETRIC INTERPRETATION OF THE METROPOLIS HASTINGS ALGORITHM 9 REFERENCES Diaconis, P. and Hanlon, P. (992. Eigenanalysis for some examples of the Metropolis algorithm. Contemp. Math. 99. Diaconis, P. and Ram, A. (2. Analysis of systematicscan Metropolis algorithms using Iwahori Hecke algebra techniques. Michigan Math. J. 5 9. Diaconis, P. and Saloff-Coste, L. (99. What do we know about the Metropolis algorithm? J. Comput. System. Sci. 5 2 6. Dongarra, J. and Sullivan, F., eds. (2. The top algorithms. Comput. Sci. Engrg. 2. Fishman, G. (996. Monte Carlo, Concepts, Algorithms and Applications. Springer, New York. Hammersley, J. and Handscomb, D. (96. Monte Carlo Methods. Chapman and Hall, New York. Hastings, W. (9. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 5 9 9. Liu, J. (2. Monte Carlo Techniques in Scientific Computing. Springer, New York. Mengersen, K. and Tweedie, R. (996. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 2 2. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, E. (95. Equations of state calculations by fast computing machines. J. Chem. Phys. 2 9. Peskun, P. (9. Optimal Monte Carlo sampling using Markov chains. Biometrika 6 6 62. Roberts, G., Gelman, A. and Gilks, W. (99. Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 2.