Recent Advances in Regional Adaptation for MCMC

Similar documents
University of Toronto Department of Statistics

Some Results on the Ergodicity of Adaptive MCMC Algorithms

TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS

An introduction to adaptive MCMC

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Markov Chain Monte Carlo (MCMC)

Monte Carlo methods for sampling-based Stochastic Optimization

Kernel adaptive Sequential Monte Carlo

Sequential Monte Carlo Samplers for Applications in High Dimensions

17 : Markov Chain Monte Carlo

Computer Practical: Metropolis-Hastings-based MCMC

On the Choice of Parametric Families of Copulas

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Density Estimation. Seungjin Choi

Monte Carlo Methods. Leon Gu CSD, CMU

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Markov chain Monte Carlo methods in atmospheric remote sensing

Kernel Sequential Monte Carlo

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Adaptive Metropolis with Online Relabeling

Lecture 4: Dynamic models

MCMC Sampling for Bayesian Inference using L1-type Priors

University of Toronto Department of Statistics

University of Toronto Department of Statistics

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Adaptive Monte Carlo methods

Randomized Quasi-Monte Carlo for MCMC

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Adaptive Population Monte Carlo

Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs

CSC 2541: Bayesian Methods for Machine Learning

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Expectation Propagation for Approximate Bayesian Inference

Perturbed Proximal Gradient Algorithm

Markov Chain Monte Carlo methods

Particle Filters: Convergence Results and High Dimensions

Sampling multimodal densities in high dimensional sampling space

Asymptotics and Simulation of Heavy-Tailed Processes

Computer Intensive Methods in Mathematical Statistics

Multimodal Nested Sampling

Expectation Propagation Algorithm

MCMC algorithms for fitting Bayesian models

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

arxiv: v1 [stat.co] 28 Jan 2018

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Sampling from complex probability distributions

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Controlled sequential Monte Carlo

19 : Slice Sampling and HMC

The Recycling Gibbs Sampler for Efficient Learning

University of Toronto Department of Statistics

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Layered Adaptive Importance Sampling

An introduction to Sequential Monte Carlo

Chapter 7. Markov chain background. 7.1 Finite state space

The Metropolis-Hastings Algorithm. June 8, 2012

Stochastic Proximal Gradient Algorithm

Overall Objective Priors

A minimalist s exposition of EM

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Expectation Maximization

Markov chain Monte Carlo

Introduction to Bayesian methods in inverse problems

Adaptive Metropolis with Online Relabeling

Lecture 8: Bayesian Estimation of Parameters in State Space Models

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Auxiliary Particle Methods

PATTERN RECOGNITION AND MACHINE LEARNING

A Geometric Interpretation of the Metropolis Hastings Algorithm

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Bayesian Gaussian Process Regression

Computational statistics

Adaptive Rejection Sampling with fixed number of nodes

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Integrated Non-Factorized Variational Inference

Computer intensive statistical methods

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

LECTURE 15 Markov chain Monte Carlo

Lecture 13 : Variational Inference: Mean Field Approximation

Adaptive Rejection Sampling with fixed number of nodes

An Brief Overview of Particle Filtering

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Stochastic optimization Markov Chain Monte Carlo

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

Lecture 6: Gaussian Mixture Models (GMM)

G8325: Variational Bayes

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Markov Chain Monte Carlo methods

STA 294: Stochastic Processes & Bayesian Nonparametrics

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Transcription:

Recent Advances in Regional Adaptation for MCMC Radu Craiu Department of Statistics University of Toronto Collaborators: Yan Bai (Statistics, Toronto) Antonio Fabio di Narzo (Statistics, Bologna) Jeffrey Rosenthal (Statistics, Toronto) AdapSki, January 2011

Outline 1 Regional Adaptive MCMC General Implementation 2 Multimodal targets and regional AMCMC Posing the problem The Case of Normal Mixtures 3 RAPTOR Online EM Definition of Regions Implementation of RAPTOR Theoretical Results 4 Non-Compactness and Regime-Switching Non-Compactness Regime-Switching

Regional AMCMC We consider problems in which the random walk Metropolis (RWM) or the independent Metropolis algorithms (IM) are used to sample from the target distribution π with support S. Given the current state of the MC, x, a proposed sample y is drawn from a proposal distribution P(y x) that satisfies symmetry, i.e. P(y x) = P(x y). The proposal y is accepted with probability min{1, π(y)/π(x)}. If y is accepted, the next state is y, otherwise it is (still) x. The random walk Metropolis is obtained when y = x + ǫ with ǫ f, f symmetric, usually N(0,V). If P(y x) = P(y) then we have the independent Metropolis sampler (acceptance ratio is modified).

Adaptive MCMC Uses an initialization period to gather information about the target π. The initial samples are used to produce estimates for the adaption parameters who are subsequently adapted on the fly until the simulation is stopped (indefinitely). Adaption strategies are adopted based on (i) Theoretical results on the optimality of MCMC, e.g. optimal acceptance rate for MH algorithms. (ii) Other strategies learned by studying the classical MCMC algorithms, e.g. annealing, other Metropolis samplers... (iii) Our ability to prove theoretically that the adaptive chain samples correctly from π. It is usually easier to do (iii) if we assume S is compact.

Multimodal targets and regional AMCMC Multimodality is a never-ending source of headaches in MCMC. Optimal proposal may depend on the region of the current state.

Regional Adaptation with Dynamic Boundary Exact boundary S 1 Q 1 Q 2 S 2 Approximate boundary Today: What to do if π is approximated by a mixture of Gaussians.

Mixture representation of the target Suppose q η (x) = K β (k) N d (x;µ (k),σ (k) ), k=1 where β (k) > 0 for all 1 k K and K k=1 β(k) = 1, is a good approximation for the target π. At each time n during the simulation process one has available n dependent Monte Carlo samples which are used to fit the mixture q η. Can we fit the mixture parameters recursively? Given the mixture parameters, define a regional RWM algorithm in which S is partitioned so that when the chain is in the k-th region we propose from the k-th component of q η.

Online EM Updates At time n 1 the current parameter estimates are η n 1 = {β (k) n 1,µ(k) n 1,Σ(k) n 1 } 1 k K and the available samples are {x 0,x 1,...,x n 1 }; when observing x n we update (see Andrieu and Moulines, Ann. Appl. Probab. 2006) β n (k) = 1 n i=0 n + 1 ν(k) i ( µ (k) n Σ (k) n = µ (k) n 1 + ρ nγ n (k) = Σ (k) n 1 + ρ nγ n (k) = s (k) n 1 + 1 n+1 (ν(k) n ), s (k) n 1 ), x n µ (k) n 1 ( (1 γ n (k) )(x n µ (k) n 1 )(x n µ (k) n 1 ) Σ (k) n 1 where ν m (k) = β(k) m 1 N d(x m;µ (k) m 1,Σ(k) m 1 ) Pk β(k ) m 1 N d(x m;µ (k s(k) ) m 1,Σ(k ) m = 1 m 1 ), m+1 γ m (k) = ν(k) m (m+1)s m (k) m i=0 ν(k) i,, and ρ m = m 1.1 for all 1 m n, 1 k K. ),

Definitions of Regions We would like to define the partition S = K k=1 S(k) so that, on each set S (k), π is more similar to N d (x;µ (k),σ (k) ) than to any other mixture component. We maximize the sum of differences between Kullback-Leibler (KL) divergences; when K = 2 we want to maximize KL(π,N d ( ;µ (2),Σ (2) ) S (1) ) KL(π,N d ( ;µ (1),Σ (1) ) S (1) )+ KL(π,N d ( ;µ (1),Σ (1) ) S (2) ) KL(π,N d ( ;µ (2),Σ (2) ) S (2) ), where KL(f,g A) = A log(f (x)/g(x))f (x)dx. Define S (k) n = {x : arg max k N d (x;µ (k ) n,σ (k ) n ) = k}.

Proposal Distribution The proposal distribution depends on: i) The mixture parameters estimated using the online EM, ii) The regions defined previously. In addition to local optimality (within each modal region) we seek good global traffic (between regions) so we add a global component to the proposal distribution (see also Guan and Krone, Ann. Appl. Probab, 2007). Let α = 0.3 and Σ <w> n be the sample covariance. Put Σ n <w> = δσ n <w> (k) + ǫi d, Σ n = δσ (k) n + ǫi d, 1 k K. The RAPTOR proposal is then Q n (x,dy) = (1 α) K k=1 1 (k) S (x)n d (y;x,s Σ(k) d n )dy n +αn d (y;x,s d Σ <w> n )dy,

Implementation of RAPTOR Run in parallel a number (5-10) of RWM algorithms with fixed kernels started in different regions of the sample space (if the local modes are known start there) for an initialization period of M steps. At step M + 1, compute the mixture parameters using the EM algorithm as well as the sample mean and covariance matrix to obtain. Γ 0 = {µ (1) 0,...,µ(K) 0,µ <w> 0, (1) (K) <w> Σ 0,..., Σ 0, Σ 0 } At each step M + n M + 1 we: i) Update the mixture parameters Γ n ii) Construct the partition based on Γ n and iii) Sample the proposal Y M+n Q n (X M+n ;dy).

Theoretical Results Assumptions: (A1) There is a compact subset S R d such that the target density π is continuous on S, positive on the interior of S, and zero outside of S. (A2) The sequence {ρ j : j 1} is positive and non-increasing. (A3) For all k = 1,,K, Pr(lim i sup l i l j=i ρ jγ (k) j = 0) = 1. We work with ρ j = j 1.1 so that (A2) and (A3) are satisfied. Convergence a) Assuming (A1) and (A2), RAPTOR is ergodic to π. b) Assuming (A2) and (A3), the adaptive parameter {Γ n } n 0 converges in probability.

Non-Compactness and Regime-Switching RAPTOR assumes S is compact. Practically, the impact is small but the theoretical gap is vexing. What to do if S is not compact? We know that a well-tuned IM has better convergence properties than a RWM. However, it is usually impossible to produce a well-tuned IM using only few samples. The AMCMC literature on adapting IM suggests we first sample using a RWM and then switch to an IM. Do things have to happen so suddenly? How to decide when it is time to switch?

Non-Compactness Strategy: Adapt only within the compact K S. Use the adapting kernel when the chain is in K and use a fixed kernel outside K. If using the Metropolis algorithm the proposals are assumed to have compact support. Proof of ergodicity is direct and requires (pretty much) only diminishing adaptation: Let D n = sup x K T γn+1 (x, ) T γn (x, ) TV. Diminishing Adaptation: lim n D n = 0 in probability.

Regime-Switching We propose a more gradual transition between the accumulation of data and the full adaptation regimes. Usually, the former is done with a RWM and the latter with IM. Combine with non-compactness idea and use: A mixture of adapting RWM and IM inside the compact K A fixed RWM outside K P Γ (x,a) = 1 K (x)[λ Γ P Γ (x,a) + (1 λ Γ )Q Γ (A)]+1 K c(x)r(x,a), P Γ,R are RWM kernels using proposal distributions of compact support (of diameter ) Q Γ is a IM kernel using a proposal with support K.

Regime-Switching We want λ Γ to approach zero as Q Γ gets closer to π on K. The samples used to adapt Γ should not be also used for determining the distance between Q Γ and π. Many adaptive strategies that satisfy Diminishing Adaptation can be used for P Γ and Q Γ.

Regime-Switching R(x,dy) K λ P(x,dy) +(1 λ) Q(dy) K K +

Regime-Switching Given, y 1,...,y n samples from π and assuming that π(x) = f (x)/m KL(π,q γ ) = log(π(x)/q γ (x))π(x)dx = = A(π,q γ ) + M 1 n n log(f (x i )/q γ (x i )) + M i=1 Assume A is estimated m = 2h times and set { } λ m = min 0.05 + 1 Â ( m 2 ) Â(m) m θ,0.95 Â (1) Â (m) where Â(1),...,Â(m) are the order statistics for the sequence of estimates. We used θ {1/10,1/5} in all examples.

Banana Example Let π(x) ] exp [ x1 2/200 1 2 (x 2 + Bx1 2 100B)2 x2 3 2, B = 0.1. Chain of interest Auxiliary chain 1 20 10 0 10 20 0.4 0.2 0.1 0.1 0.01 0.01 0.02 0.02 0.8 20 10 0 10 20 0.4 0.2 0.1 0.1 0.01 0.01 0.02 0.02 0.8 20 10 0 10 20 20 10 0 10 20 Auxiliary chain 2 Auxiliary chain 3 20 10 0 10 20 0.4 0.2 0.1 0.1 0.01 0.01 0.02 0.02 0.8 20 10 0 10 20 0.4 0.2 0.1 0.1 0.01 0.01 0.02 0.02 0.8

Regional Adaptive MCMC Multimodal targets and regional AMCMC RAPTOR Non-Compactness and Regime-Switching Banana Example 20 Target 20 IM Proposal 0.01 0.1 2 0.01 0.0 0.4 0.1 0.01 10 0.4 10 0.2 0.1 0.1 0.8 0.4 0.2 0.0 0 0 0.8 2 0.8 10 10 0.2 20 20 0.02 20 10 0 10 20 20 10 0 10 IM proposal (left) and Target (right) 2-dim projections. 20

Banana Example lam 0.2 0.4 0.6 0.8 0 50 100 150 200 Time Evolution of the lambda coefficient as simulation proceeds.

Banana Example Burn in Samples Burn in Samples ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 x_1 0 10 20 30 40 x_2 All Samples All Samples ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 x_1 0 10 20 30 40 x_2

Normal Mixture Example Let π(x) = 0.5N(x µ 1,Σ 1 ) + 0.5N(x µ 2,Σ 2 ) with x R 3, µ 1 = ( 4, 4,0) T, µ 2 = (8,8,5) T, Σ 1 = 2I and Σ 2 = 4I.

Normal Mixture Example lam 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 Time Evolution of the lambda coefficient as simulation proceeds.

ACF plots for burn-in samples (top) and all samples (bottom). Normal Mixture Example Burn in Samples Burn in Samples ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 x_1 0 10 20 30 40 x_2 All Samples All Samples ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 x_1 0 10 20 30 40 x_2

Trace plots obtained at different stages in the simulation. Normal Mixture Example Samples 1-->2,000 Samples 15,001-->17,000 Samples 100,001-->102,000 x1-8 -6-4 -2 x1-5 0 5 10 x1-5 0 5 10 0 500 1000 1500 2000 Time 0 500 1000 1500 2000 Time 0 500 1000 1500 2000 Time x2-7 -6-5 -4-3 -2-1 x2-5 0 5 10 x2-5 0 5 10 0 500 1000 1500 2000 Time 0 500 1000 1500 2000 Time 0 500 1000 1500 2000 Time

RAPTOR: Simulation Setup Target distribution is π(x;m,s) 1 Cd (x)[0.5n d (x; m 1,I d ) + 0.5N d (x;m 1,s I d )] where C d = [ 10 10,10 10 ] d We consider the scenarios given by the following ten combinations of parameter values (d,m,s) {(2,1,1),(5,0.5, 1), (2,1,4),(5,0.5,4),(2,0,1), (5,0,1),(2,0,4),(5,0,4),(2, 2,1), (5, 1, 1)}.

RAPTOR: Simulation Results (m, s) RAPTOR RRWM RAPT d=2 (1, 1) 21 21 22 (1, 4) 43 39 46 (0, 1) 10 8 11 (0, 4) 25 20 28 d=5 (0.5, 1) 30 22 41 (0.5, 4) 72 62 108 (0, 1) 23 18 29 (0, 4) 51 48 62

RAPTOR: Genetic Instability of Esophageal Cancers Cancer cells suffer a number of genetic changes during disease progression, one of which is loss of heterozygosity (LOH). Chromosome regions with high rates of LOH are hypothesized to contain genes which regulate cell behavior and may be of interest in cancer studies. We consider 40 measures of frequencies of the event of interest (LOH) with their associated sample sizes. The model adopted for those frequencies is a mixture model X i η Binomial(N i,π 1 ) + (1 η) Beta-Binomial(N i,π 2,γ).

RAPTOR: Graphical Results π2 0.0 0.2 0.4 0.6 0.8 1.0 π2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 π 1 0.0 0.2 0.4 0.6 0.8 1.0 π 1

References Regional Adaptation Andrieu, C. and Thoms, J. (2008). Statistics and Computing, 18, 343-373. Bai, Y., C.R.V. and Di Narzo, A. F. (201?) Journal of Computational and Graphical Statistics (to appear). C.R.V., Rosenthal, J. and Yang, C. (2009). JASA, 104, 1454-1466. C.R.V. and Rosenthal, J. (201?). Velvet Revolution in Adaptive MCMC: The Benefits of Smooth Regime Change. Online EM Andrieu, C. and Moulines, E. (2006). Annals of Applied Probability, 16, 1462-1505. Cappé, O. and Moulines, E (2009). JRSS-B, 71, 593-613.