An introduction to adaptive MCMC

Similar documents
Examples of Adaptive MCMC

Examples of Adaptive MCMC

Some Results on the Ergodicity of Adaptive MCMC Algorithms

University of Toronto Department of Statistics

Coupling and Ergodicity of Adaptive MCMC

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

arxiv: v1 [stat.co] 28 Jan 2018

Markov chain Monte Carlo methods in atmospheric remote sensing

Optimizing and Adapting the Metropolis Algorithm

Markov Chain Monte Carlo (MCMC)

University of Toronto Department of Statistics

Computational statistics

The random walk Metropolis: linking theory and practice through a case study.

CSC 2541: Bayesian Methods for Machine Learning

Recent Advances in Regional Adaptation for MCMC

Computer Practical: Metropolis-Hastings-based MCMC

Monte Carlo methods for sampling-based Stochastic Optimization

The random walk Metropolis: linking theory and practice through a case study.

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Principles of Bayesian Inference

16 : Approximate Inference: Markov Chain Monte Carlo

Introduction to Machine Learning CMU-10701

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

MCMC Methods: Gibbs and Metropolis

Adaptive Monte Carlo methods

The Random Walk Metropolis: Linking Theory and Practice Through a Case Study

MALA versus Random Walk Metropolis Dootika Vats June 4, 2017

Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

University of Toronto Department of Statistics

TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS

Brief introduction to Markov Chain Monte Carlo

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Nonparametric Drift Estimation for Stochastic Differential Equations

Kernel adaptive Sequential Monte Carlo

Control Variates for Markov Chain Monte Carlo

The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

Markov Chain Monte Carlo methods

Variance Bounding Markov Chains

Bayesian Methods for Machine Learning

MCMC algorithms for fitting Bayesian models

MCMC Sampling for Bayesian Inference using L1-type Priors

Kernel Sequential Monte Carlo

17 : Markov Chain Monte Carlo

Exploring an Adaptive Metropolis Algorithm

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

eqr094: Hierarchical MCMC for Bayesian System Reliability

Sampling from complex probability distributions

Adaptive Markov Chain Monte Carlo: Theory and Methods

Quantitative Non-Geometric Convergence Bounds for Independence Samplers

Chapter 7. Markov chain background. 7.1 Finite state space

On Reparametrization and the Gibbs Sampler

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

The Recycling Gibbs Sampler for Efficient Learning

ST 740: Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Adaptively scaling the Metropolis algorithm using expected squared jumped distance

Approximate Bayesian Computation and Particle Filters

Introduction to Machine Learning CMU-10701

STA205 Probability: Week 8 R. Wolpert

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Monte Carlo Methods. Leon Gu CSD, CMU

The two subset recurrent property of Markov chains

Particle Metropolis-adjusted Langevin algorithms

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Adaptive Metropolis with Online Relabeling

1 Geometry of high dimensional probability distributions

Sequential Monte Carlo Samplers for Applications in High Dimensions

An Adaptive Sequential Monte Carlo Sampler

Introduction to Bayesian methods in inverse problems

On the Containment Condition for Adaptive Markov Chain Monte Carlo Algorithms

Sampling multimodal densities in high dimensional sampling space

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo

A Dirichlet Form approach to MCMC Optimal Scaling

KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO. By Yves F. Atchadé University of Michigan

STA 4273H: Statistical Machine Learning

Roberto Casarin, Fabrizio Leisen, David Luengo, and Luca Martino. Adaptive Sticky Generalized Metropolis

Reversible Markov chains

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Lecture 8: The Metropolis-Hastings Algorithm

Parallel MCMC with Generalized Elliptical Slice Sampling

RESEARCH ARTICLE. Exploring an Adaptive Metropolis Algorithm

The Recycling Gibbs Sampler for Efficient Learning

Probabilistic Graphical Models

Markov Chain Monte Carlo

STA4502 (Monte Carlo Estimation) Lecture Notes, Jan Feb 2013

Bayesian Networks in Educational Assessment

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov chain Monte Carlo methods

GENERAL STATE SPACE MARKOV CHAINS AND MCMC ALGORITHMS

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Markov Chain Monte Carlo

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Transcription:

An introduction to adaptive MCMC Gareth Roberts MIRAW Day on Monte Carlo methods March 2011 Mainly joint work with Jeff Rosenthal.

http://www2.warwick.ac.uk/fac/sci/statistics/crism/ Conferences and workshops Academic visitor programme. Workshops coming up: March 28 - April 1, INfER, Statistics for infectious diseases April 5-7, WOGAS III, geometry and statistics April 13, Prior elicitation

Chosing parameters in MCMC Target density π on statespace X (= R d ) Most MCMC algorithms offer the user choices upon which convergence properties depend crucially. Random walk Metropolis (RWM) might propose moves fromn x to Y N(x, σ 2 I d ), accepting it with probability { min 1, π(y ) } π(x) otherwise remaining at x. This is both a problem and an opportunity!

Adaptive MCMC The Goal: Given a model (i.e., X and π( )), the computer: efficiently and cleverly tries out different MCMC algorithms; automatically learns the good ones; runs the algorithm for long enough ; obtains excellent estimates together with error bounds; reports the results clearly and concisely, while user unaware of the complicated MCMC and adaption that was used. Easier said than done...

A surprising example Let Y = { 1, +1}, π(x) (1 + x 2 ) 1, x > 0. Then if X π then X 1 π. If Γ = 1 the algorithm carries out a RWM step with proposal N(0, 1). If Γ = 1 the algorithm carries out a RWM step for X 1, ie given a standard normal Z, we propose to move to (x 1 + Z) 1. Now consider the adaptive strategy: If Γ = 1, change to Γ = 1 if and only if X n < n 1. If Γ = 1, change to Γ = 1 if and only if X n > n. stationarity and diminishing adaptation certainly hold for this adaptive schemes in this example.

How does it do? For any set A bounded away from both 0 and, P(X n A) 0 as n! For any set containing an open neighbourhood of 0, P(X n A) 1/2 as n The algorithm has a null recurrent character, always returning to any set of positive Lebesgue measure, but doing so with a probabiity that recedes to 0. There are only TWO strategies. They BOTH have identical convergence properties.

When Does Adaptation Preserve Stationarity? Regeneration times [Gilks, Roberts, and Sahu, 1998; Brockwell and Kadane, 2002]. Some success, but difficult to implement. Adaptive Metropolis (AM) algorithm [Haario, Saksman, and Tamminen, 2001]: proposal distribution N(x, (2.38) 2 Σ n / d), where Σ n = (bounded) empirical estimate of covariance of π( )... Generalisations of AM [Atchadé and R., Andrieu and Moulines, Andrieu and Robert, Andrieu and Atchadé, Kohn]. Require technical conditions. We need simple conditions guaranteeing L(X n ) π( ) 0.

Simple Adaptive MCMC Convergence Theorem Theorem [R and Rosenthal]: An adaptive scheme on {P γ } γ Y will converge, i.e. lim n L(X n ) π( ) = 0, if: Stationarity π( ) is stationary for each P γ. [Of course.] Diminishing Adaptation lim n sup x X P Γn+1 (x, ) P Γn (x, ) = 0 (at least, in probability). [Either the adaptations need to be small with high probability or adaptation takes place with probability p(n) 0.] Containment Times to stationary from X n, with transitions P Γn, remain bounded in probability as n. Proof: Couple adaptive chain to another chain that eventually stops adapting...

Ergodicity Under stationarity, adaptation and containment also get lim t t n=1 f (X n) t for any bounded function f. = π(f ) in probability However convergence for all L 1 functions does not follow (counterexample by Chao, Toronto).

Satisfying containment Diminishing adaptation and stationarity are straighforward to ensure. However containment is more complicated. Here are some conditions which imply containment. Simultaneous Uniform Ergodicity For all ɛ > 0, there is N = N(ɛ) N such that P N γ (x, ) π( ) ɛ for all x X and γ Y; Simultaneous geometrically ergodicity (Roberts, Rosenthal, and Schwartz, 1998, JAP). There is C Y, V : X [1, ), δ > 0, λ < 1, and b <, such that sup C V = v <, and 1 for each γ Y, there exists a probability measure ν γ ( ) on C with P γ (x, ) δ ν γ ( ) for all x C; and 2 (P γ )V λ V + b 1 C. A polynomnial version of the above..

Common drift function conditions Simultaneous geometric ergodicity and simultaneous polynomial ergodicity are examples of common drift function conditions. Such conditions say that all Markov kernels can be considered to be pushing towards the same small set as measured by the common drift condition. When different chains have different ways of moving towards π, disaster can occur!

A surprising example (cont) Let Y = { 1, +1}, π(x) (1 + x 2 ) 1, x > 0. Then if X π then X 1 π. If Γ = 1 the algorithm carries out a RWM step with proposal N(0, 1). If Γ = 1 the algorithm carries out a RWM step for X 1, ie given a standard normal Z, we propose to move to (x 1 + Z) 1. Now consider the adaptive strategy: If Γ = 1, change to Γ = 1 if and only if X n < n 1. If Γ = 1, change to Γ = 1 if and only if X n > n. Thus stationarity and diminishing adaptation certainly hold for this adaptive schemes.

What goes wrong? Containment must fail. But how? For Γ = 1, any set bounded away from is small. [P(x, ) ɛν( ) for all x the small set.] For Γ = 1, any set bounded away from 0 is small. A natural drift function for Γ = 1 diverges at x, and is bounded on bounded regions. A natural drift function for Γ = 1 is unbounded as x 0, and is bounded on regions bounded away from zero. Cannot do both with a single drift function!

Common drift functions for RWM If {P γ } are a class of full-dimensional RWMs, a natural drift function is often π(x) 1/2. For instance if the contours of π are sufficiently regular in the tails, and P γ represents RWM with proposal variance matrix γ such that there exists a constant ɛ with ɛi d γ ɛ 1 I d (with the inequalities understood in a positive-definite sense). This is useful for the adaptive Metropolis (AM) example later. [This is based on recent work by R + Rosenthal extending R+Tweedie, 1996, Biometrika.]

Common drift functions for single component Metropolis updates Consider single component Metropolis updates. Easiest to prove for random scan Metropolis which just chooses a component to update at random, and then updates according to a 1-dimensional Metropolis. Again we can use V (x) = π(x) 1/2. [This is based on recent work by Latuszynski, R + Rosenthal extending R+Rosenthal, 1998, AnnAP.]

Scaling proposals π is a d-dimensional probability density function with respect to d-dimensional Lebesgue measure. Consider RWM Propose new value Y N(x, γ) [γ is a matrix here.] MwG Choose a component at random, i say, and update X i X i according to a one-dimensional Metropolis update with proposal N(x i, γ i ). [γ is a vector here!] Lang Langevin update: propose new value from N(x + γ log π(x)/2, γ). [γ is a matrix here!]

What scaling theory tells us, RWM case... R, Rosenthal, Sherlock, Pete Neal, Bedard, Gelman, Gilks... For RWM, we set γ = δ 2 V for a scalar δ. For continuous densities, and for fixed V, the scalar optimisation is often achieved by finding the algorithm which achieves an acceptance rate somewhere between 0.15 and 0.4. In many limiting cases (as d ) the optimal value is 0.234. 0.234 breaks down only when V is an extremely poor approximation to the target distribution covariance Σ. For Gaussian targets, when V = Σ, the optimal value of δ is δ = 2.38/d 1/2.

In the Gaussian case we can exactly quantify the cost of having V different to Σ: R = ( λ 2 i /d)1/2 λi /d = L2 L1 where {λ i } denote the eigenvalues of V 1/2 Σ 1/2. For densities with discontinuities, the problem is far more complex. For certain problems, the optimal limiting acceptance probability is 0.13. Optimal scaling is robust to transience.

What scaling theory tells us, MwG case... R, Pete Neal Sensibly scaled MwG is least as good as RWM in virtually all cases. Moreover it is often more natural (conditional independence structure) and/or easier to implement. In the Gaussian case, it is right to tune each component to have acceptance rate 0.44 The optimal scaling is γ i = ξ i 2.4 2 where ξ i is the target conditional variance.

What scaling theory tells us, Lang case... R + Rosenthal Again set γ = δ 2 V. Generally Lang much more efficient than RWM or MwG (convergence time of optimal algorithm is O(d 1/3 ) rather than O(d) for the other two). Somewhat less robust to light tails, zeros of the target, discontinuous densities, etc.., not robust to transience Single component at a time Langevin methods are very sub-optimal in general. Optimal acceptance rate is around 0.574. In Gaussian case, can quantify the penalty for V being different from Σ R = ( λ 6 i /d)1/6 λi /d = L6 L1

Adaptive Metropolis (AM) Example Dim d = 100, π( ) = randomly-generated, eratic MV normal. So covariance is 100 100 matrix (5,050 different entries). Do Metropolis, with proposal distribution given by: Q n (x, ) = (1 β) N ( ) x, (2.38) 2 Σ n / d + β N ( ) x, (0.1) 2 I d / d, (for n > 2d, say). Here Σ n = current empirical estimate of covariance of π( ). Also β = 0.05 > 0 to satisfy Containment condition.

Figure: Plot of first coordinate Takes about 300,000 iterations, then finds good proposal covariance and starts mixing well. Adaptive Metropolis Example (cont d) 2 1 0 1 2 0e+00 4e+05 8e+05

Adaptive Metropolis Example (cont d) 0 50 100 150 200 0e+00 4e+05 8e+05 Figure: Plot ( of sub-optimality factor d d R n i=1 λ 2 in Σ 1/2 n Σ 1/2. Starts large, converges to 1. / ( d i=1 λ 1 in )2 ), where {λ in } eigenvals of

Adaptive Metropolis-within-Gibbs Strategies Possible strategies for each component separately: 1 adapt proposal variance in particular component to match the empirical variance (SCAM); 2 adapt component acceptance probability to be around 0.44 (AMwG); 3 try to estimate some kind of average conditional variance empirically and fit to proposal variance. Our empirical evidence suggests that AMwG uusally beats SCAM, but all strategies can be tested by problems for which heterogenous variances are required. In the Gaussian case, can show that (2) and (3) converge to optimal MwG.

Adaptive Metropolis-within-Gibbs Example Propose move for each coordinate separately. Propose increment N(0, e 2 ls i ) for i th coord. Start with ls i 0 (say). Adapt each ls i, in batches, to seek 0.44 acceptance ratio (approximately optimal for one-dim proposals). Test on Variance Components Model, with K = 500 location parameters (so dim = 503), and data Y ij N(µ i, 10 2 ) for 1 i K and 1 j J i, where the J i are chosen between 5 and 500.

Metropolis-within-Gibbs (cont d) 0.0 0.5 1.0 1.5 2.0 2.5 0 50000 150000 250000 Figure: Adaptation finds good values for the ls i values.

Metropolis-within-Gibbs: Comparisons Variable J i Algorithm ls i ACT Avr Sq Dist θ 1 5 Adaptive 2.4 2.59 14.932 θ 1 5 Fixed 0 31.69 0.863 θ 2 50 Adaptive 1.2 2.72 1.508 θ 2 50 Fixed 0 7.33 0.581 θ 3 500 Adaptive 0.1 2.72 0.150 θ 3 500 Fixed 0 2.67 0.147 Adaptive much better than Fixed, even in dimension 503.

Adaptive Langevin 1400 1200 1000 800 600 400 200 0 0 0.5 1 1.5 2 x 10 4

0 0.5 1 1.5 2 30 20 10 0 10 20 30 40 x 10 4 Figure: Adaptive Langevin traceplot, d = 20

Heterogenous scaling Instead of random-walk Metropolis with fixed proposal increment distribution (e.g. N(0, σ 2 )), allow σ 2 = σ 2 x to depend on x, e.g. σ 2 x = e a (1 + x ) b (with suitably modified M-H acceptance probabilities). Adjust a and b by ± 1/i, once every 100 (say) iterations, by: Decrease a if too few acceptances; increase it if too many. Decrease b if fewer acceptances when x large; increase it if fewer when x small.

Heterogenous scaling: how does it perform? Example: π( ) = N(0, 1). We are interested in estimating π ( log(1 + x ) ) Estimate converges quickly to true value (0.534822): evector 0.0 0.2 0.4 0.6 0.8 1.0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

One-Dimensional Adaption: Parameter a a moves to near 0.7, but keeps oscillating: avector 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

One-Dimensional Adaption: Parameter b b moves to near 1.5, but keeps oscillating: bvector 0.0 0.5 1.0 1.5 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

One-Dimensional Adaption: Comparisons Algorithm Acceptance Rate ACT Avr Sq Dist σ 2 = exp( 5) 0.973 55.69 0.006 σ 2 = exp( 1) 0.813 8.95 0.234 σ 2 = 1 0.704 4.67 0.450 σ 2 = exp(5) 0.237 7.22 0.305 σ 2 = (2.38) 2 0.445 2.68 0.748 σx 2 = e 0.7 (1 + x ) 1.5 0.458 2.55 0.779 Adaptive (as above) 0.457 2.61 0.774 Conclusion: heterogenous adaptation is much better than arbitrarily-chosen RWM, appears slightly better than wisely-chosen RWM, and nearly as good as an ideally-chosen variable-σ 2 x scheme. The story is much more clear-cut where heavy tailed targets.

Some conclusions Good progress has been made towards finding simple usable conditions for adaptation, at least for standard algorithms. Further progress is subject of joint work with Krys Latuszynski and Jeff Rosenthal. Algorithm families with different small sets and drift conditions are dangerous! In practice, Adaptive MCMC is very easy. Current implentation in some forms of BUGS. More complex adaptive schemes often require individual convergence results (eg for adaptive Gibbs samplers and its variants in work with Krys and Jeff, and work by Atchade and Liu on the Wang Landau algorithm.