A Review of Multiple Try MCMC algorithms for Signal Processing

Similar documents
Group Importance Sampling for particle filtering and MCMC

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

ON MULTIPLE TRY SCHEMES AND THE PARTICLE METROPOLIS-HASTINGS ALGORITHM

WEIGHTING A RESAMPLED PARTICLES IN SEQUENTIAL MONTE CARLO (EXTENDED PREPRINT) L. Martino, V. Elvira, F. Louzada

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

The total derivative. Chapter Lagrangian and Eulerian approaches

The Recycling Gibbs Sampler for Efficient Learning

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Gaussian processes with monotonicity information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

Monte Carlo Methods with Reduced Error

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Generalization of the persistent random walk to dimensions greater than 1

Proof of SPNs as Mixture of Trees

Expected Value of Partial Perfect Information

arxiv: v1 [math.co] 29 May 2009

Linear First-Order Equations

Math 342 Partial Differential Equations «Viktor Grigoryan

6 General properties of an autonomous system of two first order ODE

Topic Modeling: Beyond Bag-of-Words

How to Minimize Maximum Regret in Repeated Decision-Making

Topic 7: Convergence of Random Variables

Introduction to the Vlasov-Poisson system

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

SYNCHRONOUS SEQUENTIAL CIRCUITS

Introduction to Markov Processes

Nonlinear Adaptive Ship Course Tracking Control Based on Backstepping and Nussbaum Gain

Influence of weight initialization on multilayer perceptron performance

Parameter estimation: A new approach to weighting a priori information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Cascaded redundancy reduction

Analysis of tandem fluid queues

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS

Switching Time Optimization in Discretized Hybrid Dynamical Systems

The Recycling Gibbs Sampler for Efficient Learning

A Course in Machine Learning

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

Capacity Analysis of MIMO Systems with Unknown Channel State Information

A. Incorrect! The letter t does not appear in the expression of the given integral

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Table of Common Derivatives By David Abraham

arxiv: v2 [math.st] 29 Oct 2015

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

Chapter 6: Energy-Momentum Tensors

Schrödinger s equation.

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

A simple model for the small-strain behaviour of soils

A. Exclusive KL View of the MLE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Bayesian analysis of massive datasets via particle filters

Semiclassical analysis of long-wavelength multiphoton processes: The Rydberg atom

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

A NONLINEAR SOURCE SEPARATION APPROACH FOR THE NICOLSKY-EISENMAN MODEL

Learning the hyper-parameters. Luca Martino

Equilibrium in Queues Under Unknown Service Times and Service Value

u!i = a T u = 0. Then S satisfies

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS

A Path Planning Method Using Cubic Spiral with Curvature Constraint

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N.

arxiv:cond-mat/ v1 31 Jan 2003

The Exact Form and General Integrating Factors

A Modification of the Jarque-Bera Test. for Normality

Harmonic Modelling of Thyristor Bridges using a Simplified Time Domain Method

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

arxiv: v1 [hep-lat] 19 Nov 2013

Layered Adaptive Importance Sampling

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

State observers and recursive filters in classical feedback control theory

An introduction to Sequential Monte Carlo

Non-Linear Bayesian CBRN Source Term Estimation

Delay Limited Capacity of Ad hoc Networks: Asymptotically Optimal Transmission and Relaying Strategy

Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

Track Initialization from Incomplete Measurements

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Dot trajectories in the superposition of random screens: analysis and synthesis

MEASURES WITH ZEROS IN THE INVERSE OF THEIR MOMENT MATRIX

The Press-Schechter mass function

Entanglement is not very useful for estimating multiple phases

26.1 Metropolis method

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation

Math 1B, lecture 8: Integration by parts

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Logarithmic spurious regressions

Least-Squares Regression on Sparse Spaces

Lecture 10 Notes, Electromagnetic Theory II Dr. Christopher S. Baird, faculty.uml.edu/cbaird University of Massachusetts Lowell

arxiv: v5 [cs.lg] 28 Mar 2017

Transmission Line Matrix (TLM) network analogues of reversible trapping processes Part B: scaling and consistency

A Novel Decoupled Iterative Method for Deep-Submicron MOSFET RF Circuit Simulation

Number of wireless sensors needed to detect a wildfire

Polynomial Inclusion Functions

Transcription:

A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications in signal processing require the estimation of some parameters of interest given a set of observe ata. More specifically, Bayesian inference nees the computation of a-posteriori estimators which are often expresse as complicate multiimensional integrals. Unfortunately, analytical expressions for these estimators cannot be foun in most real-worl applications, an Monte Carlo methos are the only feasible approach. A very powerful class of Monte Carlo techniques is forme by the Markov Chain Monte Carlo (MCMC) algorithms. They generate a Markov chain such that its stationary istribution coincies with the target posterior ensity. In this work, we perform a thorough review of MCMC methos using multiple caniates in orer to select the next state of the chain, at each iteration. With respect to the classical Metropolis-Hastings metho, the use of multiple try techniques foster the exploration of the sample space. We present ifferent Multiple Try Metropolis schemes, Ensemble MCMC methos, Particle Metropolis-Hastings algorithms an the Delaye Rejection Metropolis technique. We highlight limitations, benefits, connections an ifferences among the ifferent methos, an compare them by numerical simulations. Keywors: Markov Chain Monte Carlo, Multiple Try Metropolis, Particle Metropolis- Hastings, Particle Filtering, Monte Carlo methos, Bayesian inference 1 Introuction 5 10 Bayesian methos have become very popular in signal processing over the last years [1, 2, 3, 4]. They require the application of sophisticate Monte Carlo techniques, such as Markov chain Monte Carlo (MCMC) an particle filters, for the efficient computation of a-posteriori estimators [5, 6, 2]. More specifically, the MCMC algorithms generate a Markov chain such that its stationary istribution coincies with the posterior probability ensity function (pf) [7, 8, 9]. Typically, the only requirement is to be able to evaluate the target function, where the knowlege of the normalizing constant is usually not neee. The most popular MCMC metho is unoubtely the Metropolis-Hastings (MH) algorithm [10, 11]. The MH technique is a very simple metho, easy to be applie: this is the reason of 1

15 20 25 30 35 40 45 50 its success. In MH, at each iteration, one new caniate is generate from a proposal pf an then is properly compare with the previous state of the chain, in orer to ecie the next state. However, the performance of MH are often not satisfactory. For instance, when the posterior is multimoal, or when the imension of the space increases, the correlation among the generate samples is usually high an, as a consequence, the variance of the resulting estimators grows. To spee up the convergence an reuce the burn-in perio of the MH chain, several extensions have been propose in literature. In this work, we provie an exhaustive review of more sophisticate MCMC methos that, at each iteration, consier ifferent caniates as possible new state of the chain. More specifically, at each iteration ifferent samples are compare by certain weights an then one of them is selecte as possible future state. The main avantage of these algorithms is that they foster the exploration of a larger portion of the sample space, ecreasing the correlation among the states of the generate chain. In this work, we escribe ifferent algorithms of this family, inepenently introuce in literature. The main contribution is to present them uner the same the framework an notation, remarking ifferences, relationships, limitations an strengths. All the iscusse techniques yiel an ergoic chain converging to the posterior ensity of interest (in the following, referre also as target pf). The first scheme of this MCMC class, calle Orientational Bias Monte Carlo (OBMC) [12, Chapter 13], was propose in the context of molecular simulation. Later on a more general algorithm, calle Multiple Try Metropolis (MTM), was introuce [13]. 1 The MTM algorithm has been extensively stuie an generalize in ifferent ways [14, 15, 16, 17, 18]. Other techniques, alternative to the MTM schemes, are the so-calle the Ensemble MCMC (EnMCMC) methos [19, 20, 21, 22]. They follow a similar approach to MTM but employ a ifferent acceptance function for selecting the next state of the chain. With respect to (w.r.t.) a generic MTM scheme, EnMCMC oes not require any generation of auxiliary samples (as in a MTM scheme employing a generic proposal pf) an hence, in this sense, EnMCMC are less costly. In all the previous techniques, the caniates are rawn in a batch way an compare jointly. In the Delaye Rejection Metropolis (DRM) algorithm [23, 24, 25], in case of rejection of the novel possible state, the authors suggest to perform an aitional acceptance test consiering a new caniate. If this caniate is again rejecte, the proceure can be iterate until reaching a esire number of attempts. The main benefit of DRM is that the proposal pf can be improve at each intermeiate stage. However, the acceptance function progressively becomes more complex so that the implementation of DRM for a great number of attempts is not straightforwar (compare to the implementation of a MTM scheme with a generic number of tries). In the last years, other Monte Carlo methos which combine particle filtering an MCMC have become very popular in the signal processing community. For instance, this is the case of the Particle Metropolis Hastings (PMH) an the Particle Marginal Metropolis Hastings (PMMH) algorithms, which have been wiely use in signal processing in orer to make inference an smoothing about ynamical an static parameters in state space moels [26, 27]. PMH can be interprete as a MTM scheme where the ifferent caniates are generate an weighte by the use of a particle filter [28, 29]. In this work, we present PMH an PMMH an iscuss 1 MTM inclues OBMC as a special case (see Section 4.1.1). 2

55 60 their connections an ifferences with the classical MTM approach. Furthermore, we escribe a suitable proceure for recycling some caniates in the final Monte Carlo estimators, calle Group Metropolis Sampling (GMS) [29, 30]. The GMS scheme can be also seen as a way of generating a chain of sets of weighte samples. Finally, note that other similar an relate techniques can be foun within the so-calle ata augmentation approach [31, 32]. The remaining of the paper is organize as follows. Section 2 recalls the problem statement an some backgroun material, introucing also the require notation. The basis of MCMC an the Metropolis-Hastings (MH) algorithm are presente in Section 3. Section 4 is the core of the work, which escribes the ifferent MCMC using multiple caniates. Section 6 provies some numerical results, applying ifferent techniques in a hyperparameter tuning problem for a Gaussian Process regression moel, an in a localization problem consiering a wireless sensor network. Some conclusions are given in Section 7. 2 Problem statement an preliminaries 65 In many signal processing applications, the goal consists in inferring a variable of interest, θ = [θ 1,..., θ D ] D R D, given a set of observations or measurements, y R P. In the Bayesian framework, the total knowlege about the parameters, after the ata have been observe, is represente by the posterior probability ensity function (pf) [8, 9], i.e., π(θ) = p(θ y) = l(y θ)g(θ), Z(y) = 1 π(θ), (1) Z 70 75 80 where l(y θ) enotes the likelihoo function (i.e., the observation moel), g(θ) is the prior probability ensity function (pf) an Z = Z(y) is the marginal likelihoo (a.k.a., Bayesian evience) [33, 34] an π(θ) = l(y θ)g(θ). In general Z is unknown, an it is possible to evaluate only the unnormalize target function, π(θ) π(θ). The analytical stuy of the posterior ensity π(θ) is often unfeasible an integrals involving π(θ) are typically intractable [33, 35, 34]. For instance, one might be intereste in the estimation of I = E π [f(θ)] = f(θ) π(θ)θ, (2) D = 1 f(θ)π(θ)θ, (3) Z where f(θ) is a generic integrable function w.r.t. π. Dynamic an static parameters. In some specific application, the variable of interest θ can be split in two isjoint parts, θ = [x, λ], where one, x, is involve into a ynamical system (for instance, x is the hien state in a state-space moel) an the other, λ, is a static parameter (for instance, an unknown parameter of the moel). The strategies for making inference about x an λ shoul take into account the ifferent nature of the two parameters (e.g., see Section 4.2.2). The main notation an acronyms are summarize in Tables 1-2. D 3

Table 1: Summary of the main notation. θ = [θ 1,..., θ D ] Variable of interest, θ D R D. y Observe measurements/ata. θ = [x, λ] x : ynamic parameters; λ : static parameters. π(θ) ormalize posterior pf, π(θ) = p(θ y). π(θ) Unnormalize posterior function, π(θ) π(θ). π(θ) Particle approximation of π(θ). I Integral of interest, in Eq. (2). Î, Ĩ Estimators of I. Z Marginal likelihoo; normalizing constant of π(θ). Ẑ, Z Estimators of the marginal likelihoo Z. Table 2: Summary of the main acronyms. MCMC Markov Chain Monte Carlo MH Metropolis-Hastings I-MH Inepenent Metropolis-Hastings MTM Multiple Try Metropolis I-MTM Inepenent Multiple Try Metropolis I-MTM2 Inepenent Multiple Try Metropolis (version 2) PMH Particle Metropolis-Hastings PMMH Particle Marginal Metropolis-Hastings GMS Group Metropolis Sampling EnMCMC Ensemble MCMC I-EnMCMC Inepenent Ensemble MCMC DRM Delaye Rejection Metropolis IS Importance Sampling SIS Sequential Importance Sampling SIR Sequential Importance Resampling 2.1 Monte Carlo integration In many practical scenarios, the integral I cannot be compute in a close form, an numerical approximations are typically require. Many eterministic quarature methos are available in the literature [36, 37]. However, as the imension D of the inference problem grows (θ R D ), the eterministic quarature schemes become less efficient. In this case, a common approach consists of approximating the integral I in Eq. (2) by using Monte Carlo (MC) quarature [8, 9]. amely, consiering T inepenent an ientically istribute (i.i..) samples rawn from the posterior 4

target pf, i.e. θ 1,..., θ T π(θ), 2 we can buil the consistent estimator Î T = 1 T T f(θ t ). (4) t=1 85 Î T converges in probability to I ue to the weak law of large numbers. The approximation above ÎT is known as a irect (or ieal) Monte Carlo estimator if the samples θ t are inepenent an ientically istribute (i.i..) from π. Unfortunately, in many practical applications, irect methos for rawing inepenent samples from π(θ) are not available. Therefore, ifferent approaches are require, such as the Markov chain Monte Carlo (MCMC) techniques. 3 Markov chain Monte Carlo (MCMC) methos 90 95 100 105 A MCMC algorithm generates an ergoic Markov chain with invariant (a.k.a., stationary) ensity given by the posterior pf π(θ) [7, 9]. Specifically, given a starting state θ 0, a sequence of correlate samples is generate, θ 0 θ 1 θ 2... θ T. Even if the samples are now correlate, the estimator Ĩ T = 1 T f(θ t ) (5) T t=1 is consistent, regarless the starting vector θ (0) [9]. 3 With respect to the irect Monte Carlo approach using i.i.. samples, the application of an MCMC algorithm entails a loss of efficiency of the estimator ĨT, since the samples are positively correlate, in general. In other wors, to achieve a given variance obtaine with the irect Monte Carlo estimator, it is necessary to generate more samples. Thus, in orer to improve the performance of an MCMC technique we have to ecrease the correlation among the states of the chain. 4 3.1 The Metropolis-Hastings (MH) algorithm One of the most popular an wiely applie MCMC algorithm is the Metropolis-Hastings (MH) metho [11, 7, 9]. Recall that we are able to evaluate point-wise a function proportional to the target, i.e., π(θ) π(θ). A proposal ensity (a pf which is easy to raw from) is enote as q(θ θ t 1 ) > 0, with θ, θ t 1 R D. In Table 3, we escribe the stanar MH algorithm in etail. The algorithm returns the sequence of states {θ 1, θ 2,..., θ t,..., θ T } (or a subset of them removing the burn-in perio if an estimation of its length is available). We can see that the next state θ t can be the propose sample θ (with probability α) or the previous state θ t 1 (with probability 1 α). Uner some mil regularity conitions, when t grows, the pf of the current state θ t converges to the target ensity π(θ) [9]. The MH algorithm satisfies the so-calle etaile 2 In this work, for simplicity, we use the same notation for enoting a ranom variable or one realization of a ranom variable. 3 Recall we are assuming that the Markov chain is ergoic an hence the starting value is forgotten. 4 For the sake of simplicity, we use all the generate states in the final estimators, without removing any burn-in perio [9]. 5

Table 3: The MH algorithm 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw a sample θ q(θ θ t 1 ). (b) Accept the new state, θ t = θ, with probability [ α(θ t 1, θ ) = min 1, Otherwise, set θ t = θ t 1. 3. Return: {θ t } T t=1. π(θ )q(θ t 1 θ ) π(θ t 1 )q(θ θ t 1 ) ], (6) balance conition which is sufficient to guarantee that the output chain is ergoic an has π as stationary istribution [7, 9]. ote that the acceptance probability α can be rewritten as [ ] [ ] α(θ t 1, θ π(θ )q(θ t 1 θ ) ) = min 1, = min 1, w(θ θ t 1 ). (7) π(θ t 1 )q(θ θ t 1 ) w(θ t 1 θ ) 110 π(θ t 1) q(θ t 1 θ ) where we have enote w(θ θ t 1 ) = π(θ ) an w(θ q(θ θ t 1 ) t 1 θ ) = in a similar fashion of the importance sampling weights of θ an θ t 1 [9]. If the proposal pf is inepenent from the previous state, i.e., q(θ θ t 1 ) = q(θ), the acceptance function epens on the ratio of the importance weights w(θ ) = π(θ ) an w(θ q(θ ) t 1) = π(θ t 1) q(θ t 1, as shown in Table 4. We refer to this ) special MH case as the Inepenent MH (I-MH) algorithm. It is strictly relate to other techniques escribe in the following (e.g., see Section 4.2.1). 4 MCMC using multiple caniates 115 120 In the stanar MH technique escribe above, at each iteration one new sample θ is generate to be teste with the previous state θ t 1 by the acceptance probability α(θ t 1, θ ). Other generalize MH schemes generate several caniates at each iteration to be teste as new possible state. In all these schemes, an extene acceptance probability α is properly esigne in orer to guarantee the ergoicity of the chain. Figure 1 provies a graphical representation of the ifference between MH an the techniques using several caniates. Below, we escribe the most important examples of this class of MCMC algorithms. In most of them, a single MH-type test is performe at each iteration whereas in other methos a sequence of tests is employe. Furthermore, most of these techniques use an Importance Sampling (IS) approximation of the target ensity [8, 9] in orer to improve the proposal proceure employe 6

Table 4: The Inepenent MH (I-MH) algorithm 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw a sample θ q(θ). (b) Accept the new state, θ t = θ, with probability [ α(θ t 1, θ ) = min 1, Otherwise, set θ t = θ t 1. 3. Return: {θ t } T t=1. w(θ ) w(θ t 1 ) ], (8) t 1 TEST 0 t 1 (1) (2) ()... TEST t (a) Stanar MH t (b) MCMC using multiple tries Figure 1: Graphical representation of the classical MH metho an the MCMC schemes using ifferent caniates at each iteration. 125 130 135 within a MH-type algorithm. amely, they buil an IS approximation, an then raw one sample from this approximation (resampling step). Finally, the selecte sample is compare with the previous state of the chain, θ t 1, accoring to a suitable generalize acceptance probability α. It can be prove that all the methoologies presente in this work yiel a ergoic chain with the posterior π as invariant ensity. 4.1 The Multiple Try Metropolis (MTM) algorithm The Multiple Try Metropolis (MTM) algorithms are examples of this class of methos, where samples θ (1), θ (2),..., θ () (calle also tries or caniates ) are rawn from the proposal pf q(θ), at each iteration [13, 14, 15, 16, 17, 38, 39]. Then, one of them is selecte accoring to some suitable weights. Finally, the selecte caniate is accepte or rejecte as new state accoring to a generalize probability function α. The MTM algorithm is given in Table 5. For the sake of simplicity, we have consiere the use of the importance weights w(θ θ t 1 ) =, but there is not a unique possibility, as also π(θ) q(θ θ t 1 ) 7

140 145 150 shown below [13, 38]. In its general form, when the proposal epens on the previous state of the chain q(θ θ t 1 ), the MTM requires the generation of 1 auxiliary samples, v (i), which are employe in the computation of the acceptance function α. They are neee in orer to guarantee the ergoicity. Inee, the resulting MTM kernel satisfies the etaile balance conition, so that the chain is reversible [13, 38]. ote that for = 1, we have θ (j) = θ (1), v (1) = θ t 1 an the acceptance probability of the MTM metho becomes ( ) α(θ t 1, θ (1) ) = min 1, w(θ(1) θ t 1 ), w(v (1) θ (1) ) ( = min 1, w(θ(1) θ t 1 ) w(θ t 1 θ (1) ) ) ( ) = min 1, π(θ(1) )q(θ t 1 θ (1) ), (9) π(θ t 1 )q(θ (1) θ t 1 ) that is the acceptance probability of the classical MH technique. Several variants have been stuie, for instance, with correlate tries an consiering the use of ifferent proposal pfs [38, 40]. Remark 1. The MTM metho in Table 5 nees at step 2 the generation of 1 auxiliary samples an at step 2e the computation of their weights (an, as a consequence, 1 aitional evaluation of the target pf are require), that are only employe in the computation of the acceptance function α. 4.1.1 Generic form of the weights 155 160 The importance weights are not the unique possible choice. It is possible to show that the MTM algorithm generates an ergoic chain with invariant ensity π, if the weight function w(θ θ t 1 ) is chosen with the form w(θ θ t 1 ) = π(θ)q(θ t 1 θ)ξ(θ t 1, θ), (13) where ξ(θ t 1, θ) = ξ(θ, θ t 1 ), θ, θ t 1 D. 1 For instance, choosing ξ(θ t 1, θ) =, we obtain the importance weights w(θ θ q(θ θ t 1 )q(θ t 1 θ) t 1) = π(θ) q(θ θ t 1 use above. If we set ξ(θ ) t 1, θ) = 1, we have w(θ θ t 1 ) = π(θ)q(θ t 1 θ). Another interesting example can be employe if the proposal is symmetric, i.e., q(θ θ t 1 ) = q(θ t 1 θ). In 1 this case, we can choose ξ(θ t 1, θ) = an then w(θ θ q(θ t 1 θ) t 1) = w(θ) = π(θ), i.e., the weights only epen on the value of the target ensity at θ. Thus, MTM contains the Orientational Bias Monte Carlo (OBMC) scheme [12, Chapter 13] as a special case, when a symmetric proposal pf is employe, an then one caniate is chosen with weights proportional to the target ensity, i.e., w(θ θ t 1 ) = π(θ). 4.1.2 Inepenent Multiple Try Metropolis (I-MTM) schemes The MTM metho escribe in Table 5 requires to raw 2 1 samples at each iteration ( caniates an 1 auxiliary samples) an 1 are only use in the acceptance probability function. The generation of the auxiliary points v (1),..., v (j 1), v (j+1),..., v () q(θ θ (j) ), 8

Table 5: The MTM algorithm with importance sampling weights. 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw θ (1), θ (2),..., θ () q(θ θ t 1 ). (b) Compute the importance weights w(θ (n) θ t 1 ) = π(θ(n) ), with n = 1,...,. (10) q(θ (n) θ t 1 ) (c) Select one sample θ (j) {θ (1),..., θ () }, accoring to the probability mass function w n = w(θ(n) θ t 1 ) P. i=1 w(θ(i) θ t 1 ) () Draw 1 auxiliary samples from q(θ θ (j) ), enote as v (1),..., v (j 1), v (j+1),..., v () q(θ θ (j) ), an set v (j) = θ t 1. (e) Compute the weights of the auxiliary samples w(v (n) θ (j) ) = π(v(n) ), with n = 1,...,. (11) q(v (n) θ (j) ) (f) Set θ t = θ (j) with probability α(θ t 1, θ (j) ) = ( ) min 1, θ t 1 ), w(v(n) θ (j) ) (12) otherwise, set θ t = θ t 1. 3. Return: {θ t } T t=1. can be avoie if the proposal pf is inepenent from the previous state, i.e., q(θ θ t 1 ) = q(θ). Inee, in this case, we shoul raw 1 samples again from q(θ) at the step 2 of Table 5. Since we have alreay rawn samples from q(θ) at step 2a of Table 5, we can set v (1) = θ (1),..., v (j 1) = θ (j 1), v (j) = θ (j+1)... v ( 1) = θ (), (14) without jeoparizing the ergoicity of the chain (recall that v (j) = θ t 1 ). Hence, we can avoi step 2 an the acceptance function can be rewritten as ( α(θ t 1, θ (j) ) = min 1, w(θ(j) ) + ),n j w(θ(n) ) w(θ t 1 ) +. (15),n j w(θ(n) ) 9

The I-MTM algorithm is provie in Table 6. 165 Remark 2. An I-MTM metho requires only new evaluations of the target pf at each iteration, instea of 2 1 new evaluations in the generic MTM scheme in Table 5. ote that we can also write α(θ t 1, θ (j) ) as ( ) α(θ t 1, θ (j) ) = min 1, Ẑ1 Ẑ 2, (16) where we have enote Ẑ 1 = 1 w(θ (n) ), Ẑ 2 = 1 ( w(θ t 1 ) +,n j ) w(θ (n) ). (17) Table 6: The Inepenent Multiple Try Metropolis (I-MTM) algorithm. 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw θ (1), θ (2),..., θ () q(θ). (b) Compute the importance weights (c) Select one sample θ (j) function w n = w(θ(n) ) w(θ (n) ) = π(θ(n) ), with n = 1,...,. (18) q(θ (n) ) P i=1 w(θ(i) ). {θ (1),..., θ () }, accoring to the probability mass () Set θ t = θ (j) with probability ( α(θ t 1, θ (j) ) = min 1, w(θ(j) ) + ),n j w(θ(n) ) w(θ t 1 ) +,,n j w(θ(n) ) ( ) otherwise, set θ t = θ t 1. 3. Return: {θ t } T t=1. = min 1, Ẑ1 Ẑ 2, (19) 10

Alternative version (I-MTM2). From the IS theory, we know that Ẑ1 = 1 w(θ(n) ) is an unbiase estimator of the normalizing constant Z of the target π (a.k.a, Bayesian evience or marginal likelihoo). It suggests to replace Ẑ2 with other unbiase estimators of Z (without jeoparizing the ergoicity of the chain). For instance, instea of recycling the samples generate in the same iteration as auxiliary points as in Eq. (14), we coul reuse samples generate in the previous iteration t 1. This alternative version of I-MTM metho (I-MTM2) is given in Table 7. ote that, in both cases I-MTM an I-MTM2, the selecte caniate θ (j) is rawn from the following particle approximation of the target π, π(θ θ (1:) ) = i=1 w(θ (i) )δ(θ θ (i) ), w i = w(θ (i) ) = w(θ (i) ) w(θ(n) ), (20) 170 i.e., θ (j) π(θ θ (1:) ). The acceptance probability α use in I-MTM2 can be also justifie consiering a proper IS weighting of a resample particle [41] an using the expression (7) relate to the stanar MH metho, as iscusse in [29]. Figure 2 provies a graphical representation of the I-MTM schemes. Table 7: Alternative version of I-MTM metho (I-MTM2). 1. Initialization: Choose an initial state θ 0, an obtain an initial approximation Ẑ 0 Z. 2. FOR t = 1,..., T : (a) Draw θ (1), θ (2),..., θ () q(θ). (b) Compute the importance weights w(θ (n) ) = π(θ(n) ), with n = 1,...,. (21) q(θ (n) ) (c) Select one sample θ (j) {θ (1),..., θ () }, accoring to the probability mass function w n = 1 Z b w(θ(n) ) where Ẑ = 1 i=1 w(θ(i) ). () Set θ t = θ (j) an Ẑt = Ẑ with probability α(θ t 1, θ (j) ) = min otherwise, set θ t = θ t 1 an Ẑt = Ẑt 1. ( 1, Ẑ Ẑ t 1 ), (22) 11

Generation (1),..., () q( ) Approximation (1) b ( ) (2) ( 1) ()... Resampling Acceptance Test (j) b ( ) ( t 1, (j) ) t Figure 2: Graphical representation of the I-MTM schemes. 4.1.3 Reusing caniates in parallel I-MTM chains 175 180 185 Let us consier to run C inepenent parallel chains yiele by an I-MTM scheme. In this case, we have C evaluations of the target function π an C resampling steps performe at each iteration (so that we have CT total target evaluations an CT total resampling steps). In literature, ifferent authors have suggeste to recycle the caniates, θ (1), θ (2),..., θ () q(θ), in orer to reuce the number of evaluations of the target pf [42]. The iea is to performs C-times the resampling proceure consiering the same set of caniates, θ (1), θ (2),..., θ () (a similar approach was propose in [20]). Each resample caniate is then teste as possible future state of one chain. In this scenario, The number of target evaluations per iteration is only (hence, the total number of evaluation of π is T ). However, the resulting C parallel chains are no longer inepenent, an there is a lose of performance w.r.t. the inepenent chains. There exists also the possibility of reucing the total number of resampling steps, as suggeste in the Block Inepenent MTM scheme [42] (but the epenence among the chains grows even more). 4.2 Particle Metropolis-Hastings (PMH) metho Assume that the variable of interest is forme by only a ynamical variable, i.e., θ = x = x 1:D = [x 1..., x D ] (see Section 2). This is the case of inferring a hien state in state-space moel, for instance. More generally, let assume that we are able to factorize the target ensity as π(x) π(x) = γ 1 (x 1 )γ 2 (x 2 x 1 ) γ D (x D x 1:D 1 ), (23) D = γ 1 (x 1 ) γ (x x 1 ). (24) =2 The Particle Metropolis Hastings (PMH) metho [43] is an efficient MCMC technique, propose inepenently from the MTM algorithm, specifically esigne for being applie in this framework. Inee, we can take avantage of the factorization of the target pf an consier a proposal pf ecompose in the same fashion q(x) = q 1 (x 1 )q 2 (x 2 x 1 ) q D (x D x 1:D 1 ) = q 1 (x 1 ) D q (x x 1 ). =2 12

Then, as in a batch IS scheme, given an n-th sample x (n) = x (n) 1:D we assign the importance weight w(x (n) ) = w (n) D = π(x(n) ) q(x (n) ) = γ 1(x (n) q(x) with x(n) q (x x 1 ), 1 )γ 2 (x (n) 2 x (n) 1 ) γ D (x (n) D x(n) 1:D 1 ) q 1 (x (n) 1 )q 2 (x (n) 2 x (n) 1 ) q D (x (n) D x(n) 1:D 1 ). (25) The previous expression suggests a recursive proceure for computing the importance weights: starting with w (n) 1 = π(x(n) 1 ) an then ) where we have set q(x (n) 1 w (n) = w (n) 1 β(n) = j=1 β (n) 1 = w (n) 1 an β (n) β (n) j, = 1,..., D, (26) = γ (x (n) x(n) 1: 1 ) q (x (n) x(n) 1: 1 ), (27) for = 2,..., D. This metho is usually referre as Sequential Importance Sampling (SIS). If resampling steps are also employe at some iteration, the metho is calle Sequential Importance Resampling (SIR), a.k.a., particle filtering (PF) (see Appenix B). PMH uses a SIR approach for proviing the particle approximation π(x x (1:) ) = i=1 w(i) D δ(x x(i) ) where w (i) w (i) D = w(i) D P w(n) D D = w(x(i) ), obtaine using Eq. (26) (with a proper weighting of a resample particle [41, 29]). Then, one particle is rawn from this approximation, i.e., with a probability proportional to the corresponing normalize weight. Estimation of the marginal likelihoo Z in particle filtering. SIR combines the SIS approach with the application of resampling proceures. In SIR, a consistent estimator of Z is given by [ D ] Z =, where w (i) 1 = =1 w (n) 1 β(n) Due to the application of the resampling, in SIR the stanar estimator an w (i) 1. (28) w(n) 1 Ẑ = 1 w (n) D = 1 w(x (n) ), (29) 190 is a possible alternative only if a proper weighting of the resample particles is applie [29, 41] (otherwise, it is not an estimator of Z). If a proper weighting of a resample particle is employe, both Z an Ẑ are equivalent estimators of Z [41, 29, 28]. Without the use of resampling steps (i.e., in SIS), Z an Ẑ are always equivalent estimators [29]. See also Appenix B. 195 The complete escription of PMH is provie in Table 8 consiering the use of Z. At each iteration, a particle filter is run in orer to provie an approximation by weighte samples of the 13

Table 8: Particle Metropolis-Hastings (PMH) algorithm. 1. Initialization: Choose a initial state x 0, obtain an initial estimation Z 0 Z. 2. For t = 1,..., T : (a) We employ a SIR approach for rawing with particles an weighting them, {x (i), w (i) D } i=1, i.e., we obtain sequentially a particle approximation π(x) = i=1 w(i) D δ(x x(i) ) where x (i) = [x (i) 1,..., x (i) D ]. Furthermore, we also obtain Z as in Eq. (28). (b) Draw x π(x x (1:) ), i.e., choose a particle x = {x (1),..., x () } with probability, i = 1,...,. w (i) D (c) Set x t = x an Z t = Z with probability [ α = min otherwise set x t = x t 1 an Z t = Z t 1. () Return: {x t } T t=1 where x t = [x 1,t,..., x D,t ]. 1, Z Z t 1 ], (30) 200 measure of the target. Then, a sample among the weighte particles is chosen by one resampling step. This selecte sample is then accepte or rejecte as next state of the chain accoring to an MH-type acceptance probability, which involves two estimators of marginal likelihoo Z. PMH is also relate to other popular metho in molecular simulation calle Configurational Bias Monte Carlo (CBMC) [44]. 4.2.1 Relationship among I-MTM2, PMH an I-MH A simple look at I-MTM2 an PMH shows that they are strictly relate [28]. Inee, the structure of the two algorithms coincies. The main ifference lies that the caniates in PMH are generate sequentially, using a SIR scheme. If no resampling steps are applie, then I-MTM2 an PMH are exactly the same algorithm, where the caniates are rawn in a batch setting or sequential way. Hence, the application of resampling steps is the main ifference between the generation proceures of PMH an I-MTM2. Owing to the use of resampling, the caniates {x (1),..., x () } propose by PMH are not inepenent (ifferently from I-MTM2). As an example, Figure 3 shows = 40 particles (with D = 10) generate an weighte by SIS an SIR proceures (each path is a generate particle x (i) = x (i) 1:10). The generation of correlate samples can be also consiere 14

in MTM methos without jeoparizing the ergoicity of the chain, as simply shown for instance in [16], for instance. Another ifference is the use of Z or Ẑ. However, if a proper weighting of a resample particle is employe, both estimators coincie [29, 41, 28]. Furthermore, both I-MTM2 an PMH can be consiere as I-MH schemes where a proper importance sampling weighting of a resample particle is employe [41]. amely, I-MTM2 an PMH are equivalent to an I-MH technique using the following complete proposal pf, [ ] q(θ) = π(θ θ (1:) ) q(θ (i) ) θ (1:), (31) D where π is given in Eq. (20), i.e., θ (j) q(θ), an then consiering the generalize (proper) IS weighting, w(θ (j) ) = Ẑ, w(θ t 1 ) = Ẑt 1 [29, 41]. For further etails see Appenix A. i=1 6 6 4 4 2 2 0 0 2 2 4 4 6 6 1 2 4 6 8 10 1 2 4 6 8 10 0.5 0.5 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 (a) Batch-IS or SIS. (b) SIR with resampling at = 4, 8. Figure 3: Graphical representation of SIS an SIR. We consier as target ensity a multivariate Gaussian pf, π(x) = 10 =1 (x 2, 1 2 ). In each figure, every component of ifferent particles are represente, so that each particle x (i) = x (i) 1:D forms a path (with D = 10). We set = 40. The normalize weights w(i) D = w(x(i) ) corresponing to each figure are also shown in the bottom. The line-with of each path is proportional to the corresponing weight w (i) D. The particle corresponing to the greatest weight is always epicte in black. The proposal pfs use are q 1 (x 1 ) = (x 1 2, 1) an q(x x 1 ) = (x x 1, 1) for 2. (a) Batch IS or SIS. (b) SIR resampling steps at the iterations = 4, 8. 205 4.2.2 Particle Marginal Metropolis-Hastings (PMMH) metho Assume now that the variable of interest if forme by both ynamical an static variables, i.e., θ = [x, λ]. For instance, this is the case of inferring both, an hien state x in state-space moel, an static parameters λ of the moel. The Particle Marginal Metropolis-Hastings (PMMH) technique is a extension of PMH which aresses this problem. 15

210 Let us consier x = x 1:D = [x 1, x 2,..., x D ] R x, an an aitional moel parameter λ R λ to be inferre as well (θ = [x, λ] R D, with D = x + λ ). Assuming a prior pf g λ (λ) over λ, an a factorize complete posterior pf π(θ) = π(x, λ), π(x, λ) π(x, λ) = g λ (λ)π(x λ), where π(x λ) = γ 1 (x 1 λ) D =2 γ (x x 1: 1, λ). For a specific value of λ, we can use a particle filter approach, obtaining the approximation π(x λ) = w(n) D δ(x x(n) ) an the estimator Z(λ), as escribe above. The PMMH technique is then summarize in Table 9. The pf q λ (λ λ t 1 ) enotes the proposal ensity for generating possible values of λ. Observe that, with the specific choice q λ (λ λ t 1 ) = g λ (λ), then the acceptance function becomes [ ] Z(λ ) α = min 1,. (32) Z(λ t 1 ) ote also that PMMH w.r.t. to λ can be interprete as MH metho where an the posterior cannot be evaluate point-wise. Inee, Z(λ) approximates the marginal likelihoo p(y λ) [45]. Table 9: Particle Marginal MH (PMMH) algorithm 1. Initialization: Choose the initial states x 0, λ 0, an an initial approximation Z 0 (λ) Z(λ) p(y λ). 2. For t = 1,..., T : (a) Draw λ q λ (λ λ t 1 ). (b) Given λ, run a particle filter obtaining π(x λ ) = w(n) D δ(x x(n) ) an Z(λ ), as in Eq. (28). (c) () Draw x π(x λ, x (1:) ), i.e., choose a particle x = {x (1),..., x () } with probability w (i) D, i = 1,...,. (e) Set λ t = λ, x t = x, with probability [ ] Z(λ )g λ (λ )q λ (λ t 1 λ ) α = min 1,. (33) Z(λ t 1 )g λ (λ t 1 )q λ (λ λ t 1 ) Otherwise, set λ t = λ an x t = x t 1. 3. Return: {x t } T t=1 an {λ t } T t=1. 16

215 220 225 230 235 4.3 Group Metropolis Sampling The auxiliary weighte samples in the I-MTM schemes (i.e., the 1 samples rawn at each iteration that are not selecte to be compare with the previous state θ t 1 ) can be recycle proviing a consistent an more efficient estimators [41, 29]. The so-calle Group Metropolis Sampling (GMS) metho is shown in Table 10. GMS yiels a sequence of sets of weighte samples S t = {θ n,t, ρ n,t }, for t = 1,..., T, where we have enote with ρ n,t the importance weights assigne to the samples θ n,t (see Figure 4). All the samples are then employe for a joint particle approximation of the target. Alternatively, GMS can irectly provie an approximation of a specific moment of the target pf (i.e., given a particular function f). The estimator of this specific moment provie by GMS is Ĩ T = 1 T T t=1 ρ n,t i=1 ρ f(θ n,t ) = 1 i,t T T t=1 Ĩ (t). (34) Unlike in the I-MTM schemes, no resampling steps are performe in GMS. However, we can recover an I-MTM chain from the GMS output applying one resampling step when S t S t 1, i.e., ρ n,t θ t θ t = i=1 ρ δ(θ θ n,t ), if S t S t 1, i,t (38) θ t 1, if S t = S t 1, for t = 1,..., T. More specifically, {θ t } T t=1 is a Markov chain obtaine by one run of an I-MTM2 technique. The consistency of the GMS estimators is iscusse in Appenix C. GMS can be also interprete as an iterative IS scheme where an IS approximation of samples is built at each iteration an compare with the previous IS approximation. This proceure is iterate T times an all the accepte IS estimators Ĩ(t) are finally combine to provie a unique global approximation of T samples. ote that the temporal combination of the IS estimators is obtaine ynamically by the ranom repetitions ue to the rejections in the acceptance test. Remark 3. The complete weighting proceure in GMS can be interprete as the composition of two weighting schemes: (a) by an IS approach builing {ρ n,t } an (b) by the possible ranom repetitions ue to the rejections in the acceptance test. Figure 4 epicts a graphical representation of the GMS outputs as chain of sets S t = {θ n,t, ρ n,t }. 4.4 Ensemble MCMC algorithms Another alternative proceure, often referre as Ensemble MCMC (EnMCMC) methos (a.k.a., calle Locally weighte MCMC), involving several tries at each iteration [19, 21]. Relate techniques has been propose inepenently in ifferent works [20, 22]. First, let us efine the joint proposal ensity q(θ (1),..., θ () θ t ) : D R, (39) an, consiering +1 possible elements, S = {θ (1),..., θ (), θ (+1) }, we efine the D matrix Θ k = [θ (1),..., θ (k 1), θ (k+1),..., θ (+1) ], (40) 17

Table 10: Group Metropolis Sampling 1. Initialization: Choose an initial state θ 0 an an initial approximation Ẑ0 Z. 2. FOR t = 1,..., T : (a) Draw θ (1), θ (2),..., θ () q(θ). (b) Compute the importance weights w(θ (n) ) = π(θ(n) ), with n = 1,...,, (35) q(θ (n) ) efine S = {θ (n), w(θ (n) )} ; an compute Ẑ = 1 w(θ(n) ). (c) Set S t = S, i.e., an Ẑt = Ẑ, with probability S t = { θ n,t = θ (n), ρ n,t = w(θ (n) ) }, α(s t 1, S ) = min Otherwise, set S t = S t 1 an Ẑt = Ẑt 1. 3. Return: All the sets {S t } T t=1, or {Ĩ(t) }T t=1 where Ĩ (t) = [ 1, Ẑ Ẑ t 1 ]. (36) ρ n,t i=1 ρ i,t g(θ n,t ), (37) an ĨT = 1 T T t=1 Ĩ(t). S t 1 S t S t+1 Figure 4: Chain of sets S t = {θ n,t, ρ n,t } generate by the GMS metho (graphical representation with = 4). 18

with columns all the vectors in S with the exception of θ (k). For simplicity, in the followings we abuse of the notation writing q(θ (1),..., θ () θ t ) = q(θ +1 θ t ), for instance. One simple example of joint proposal pf is q(θ (1),..., θ () θ t ) = q(θ (n) θ t ), (41) i.e., consiering inepenence among θ (n) s (an having the same marginal proposal pf q). More sophisticate joint proposal ensities can be employe. A generic EnMCMC algorithm is outline in Table 11. Table 11: Generic EnMCMC algorithm 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw θ (1),..., θ () q(θ (1),..., θ () θ t 1 ). (b) Set θ (+1) = θ t 1. (c) Set θ t = θ (j), resampling θ (j) within the set { θ (1),..., θ (), θ (+1) = θ t 1 }, forme by + 1 samples, accoring to the probability mass function α(θ t 1, θ (j) ) = π( θ (j) )q(θ j θ (j) ) +1 l=1 π( θ (l) )q(θ l θ (l) ), for j = 1,..., + 1 an Θ j is efine in Eq. (40). 3. Return: {θ t } T t=1. 240 Remark 4. ote that with respect to the generic MTM metho, EnMCMC oes not require to raw auxiliary samples an weights. Therefore, in EnMCMC a smaller number of evaluation of target is require w.r.t. a generic MTM scheme. 4.4.1 Inepenent Ensemble MCMC In this section, we present an interesting special case, which employs a single proposal pf q(θ) inepenent on the previous state of the chain, i.e., q(θ (1),..., θ () θ t ) = q(θ (1),..., θ () ) = q(θ (n) ). (42) 19

In this case, the technique can be simplifie as shown below. At each iteration, the algorithm escribe in Table 12 generates new samples θ (1), θ (2),..., θ () an then resample the new state θ t within a set of + 1 samples, {θ (1),..., θ (), θ t+1 } (which inclues the previous state), accoring to the probabilities α(θ t 1, θ (j) ) = w j = w(θ (j) ), j = 1,..., + 1, (43) i=1 w(θ(i) ) + w(θ t 1 ) 245 where w(θ) = π(θ) q(θ) enotes the importance sampling weight. ote that Eq. (43) for = 1 becomes α(θ t 1, θ (j) ) = = = w(θ (j) ) w(θ (i) ) + w(θ t 1 ), π(θ (j) ) q(θ (j) ) π(θ (j) ) + π(θ t 1) q(θ (j) ) q(θ t 1 ), π(θ (j) )q(θ t 1 ) π(θ (j) )q(θ t 1 ) + π(θ t 1 )q(θ (j) ), (44) that is the Barker s acceptance function (see [46, 47]). Table 12: EnMCMC with an inepenent proposal pf (I-EnMCMC). 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw θ (1), θ (2),..., θ () q(θ), an set θ (+1) = θ t 1. (b) Compute the importance weights w(θ (n) ) = π(θ(n) ), with n = 1,..., + 1. (45) q(θ (n) ) (c) Set θ t = θ (j), resampling θ (j) within the set {θ (1),..., θ (), θ t+1 } forme by + 1 samples, accoring to the probability mass function α(θ t 1, θ (j) ) = w j = w(θ (j) ) i=1 w(θ(i) ) + w(θ t 1 ). 3. Return: {θ t } T t=1. As iscusse in [42, Appenix B], [48, Appenix C], [41], the ensity of a resample caniate becomes closer an closer to π as grows, i.e.,. Hence, the performance of I-EnMCMC 20

250 clearly improves with (see Appenix A). The I-EnMCMC algorithm prouces an ergoic chain with invariant ensity π, by resampling + 1 samples at each iteration ( new samples θ (1),..., θ () from q an setting θ (+1) = θ t 1 ). Figure 5 summarizes the steps of I-EnMCMC. Approximation Generation (1),..., () q( ) (1) (+1) = (2) ( 1) t 1 ()... b ( ) Resampling (j) b ( ) t Figure 5: Graphical representation of the steps of the I-EnMCMC scheme. 4.5 Delaye Rejection Metropolis (DRM) Sampling An alternative use of ifferent caniates in one iteration of a Metropolis-type metho is given in [23, 24, 25]. The iea behin the propose algorithm, calle Delaye Rejection Metropolis (DRM) algorithm, is the following. As in a stanar MH metho, at each iteration, one sample is propose θ (1) q 1 (θ θ t 1 ) an accepte with probability ( ) α 1 (θ t 1, θ (1) ) = min 1, π(θ(1) )q 1 (θ t 1 θ (1) ). π(θ t 1 )q 1 (θ (1) θ t 1 ) 255 If θ (1) is accepte then θ t = θ (1) an the chain is move forwar. If θ (1) is rejecte, the DRM metho suggests of rawing another samples θ (2) q 2 (θ θ (1), θ t 1 ) (consiering a ifferent proposal pf q 2 taking into account possibly the previous caniate θ (1) ) an accepte with a suitable acceptance probability α 2 (θ t 1, θ (2) ) = ( = min 1, π(θ (2) )q 1 (θ (1) θ (2) )q 2 (θ t 1 θ (1), θ (2) )(1 α 1 (θ (2), θ (1) )) π(θ t 1 )q 1 (θ (1) θ t 1 )q 2 (θ (2) θ (1), θ t 1 )(1 α 1 (θ t 1, θ (1) )) ). 260 265 The acceptance function α 2 (θ t 1, θ (2) ) is esigne in orer to ensure the ergoicity of the chain. If θ (2) is rejecte we can set θ t = θ t 1 an perform another iteration of the algorithm, or continue with this iterative strategy rawing θ (3) q 3 (θ θ (2), θ (1), θ t 1 ) an test it with a proper probability α 3 (θ t 1, θ (3) ). The DRM algorithm with only 2 acceptance stages is outline in Table 13 an summarize in Figure 6. Remark 5. ote that the proposal pf can be improve at each intermeiate stage (θ (1) q 1 (θ θ t 1 ), θ (2) q 2 (θ θ (1), θ t 1 ) etc.), using the information provie by the previous generate samples an the corresponing target evaluations. The iea behin DRM of creating a path of intermeiate points, then improving the proposal pf, an hence fostering larger jumps have been also consiere in other works [14, 49]. 21

Table 13: Delaye Rejection Metropolis algorithm with 2 acceptance steps. 1. Initialization: Choose an initial state θ 0. 2. FOR t = 1,..., T : (a) Draw θ (1) q 1 (θ θ t 1 ) an u 1 U([0, 1]). (b) Define the probability (c) If u 1 α 1 (θ t 1, θ (1) ), set θ t = θ (1). ( ) α 1 (θ t 1, θ (1) ) = min 1, π(θ(1) )q 1 (θ t 1 θ (1) ), (46) π(θ t 1 )q 1 (θ (1) θ t 1 ) () Otherwise, if u 1 > α 1 (θ t 1, θ (1) ), o: (1) Draw θ (2) q 2 (θ θ (1), θ t 1 ) an u 2 U([0, 1]). (2) Given the function ψ(θ t 1, θ (2) θ (1) ) = π(θ t 1 )q 1 (θ (1) θ t 1 ) q 2 (θ (2) θ (1), θ t 1 )(1 α 1 (θ t 1, θ (1) )), an the probability ( ) α 2 (θ t 1, θ (2) ) = min 1, ψ(θ(2), θ t 1 θ (1) ). (47) ψ(θ t 1, θ (2) θ (1) ) (3) If u 2 α 2 (θ t 1, θ (2) ), set θ t = θ (2). (4) Otherwise, if u 2 > α 1 (θ t 1, θ (2) ), set θ t = θ t 1. 3. Return: {θ t } T t=1. Generation Acceptance Test Generation Acceptance Test o (2) q 1 ( (1), t 1 ) ( t 1, (2) ) t (1) q 1 ( t 1 ) ( t 1, (1) ) Yes t = (1) Figure 6: Graphical representation of the DRM scheme with two acceptance tests. 22

5 Summary: computational cost, ifferences an connections 270 275 280 285 The performance of the algorithms escribe above improves as grows, in general: the correlation among samples vanishes to zero, an the acceptance rate of new state approaches one (see Section 6.1). 5 As increases, they become similar an similar to an exact sampler rawing inepenent samples irectly from the target ensity (for MTM, PMH an EnMCMC schemes the explanation is given in Appenix A). However, this occurs at the expense of an aitional computational cost. In Table 14, we summarize the total number of target evaluations, E, an the total number of samples use in the final estimators, Q (without consiering to remove any burn-in perio). The generic MTM algorithm has the greatest number of target evaluations. However, a ranom-walk proposal pf can be use in a generic MTM algorithm an, in general, it fosters the exploration of the state space. In this sense, the generic EnMCMC seems to be preferable w.r.t. MTM, since E = T an the ranom-walk proposal can be applie. A isavantage of the EnMCMC schemes is that their acceptance function seems worse in terms of Peskun s orering [50] (see numerical results in Section 6.1). amely, fixing the number of tries, the target π the proposal q pfs, the MTM schemes seem to provie greater acceptance rates than the corresponing EnMCMC techniques. This is theoretically prove for = 1 [50], an the ifference vanishes to zero as grows. The GMS technique, like other strategies [20, 42], has been propose to recycle samples or re-use target evaluations, in orer to increase Q (see also Section 4.1.3). Table 14: Total number of target evaluations, E, an total number of samples, Q, use in the final estimators. Algorithm MTM I-MTM I-MTM2 PMH E 2T 1 T T T Q T T T T Algorithm GMS EnMCMC I-EnMCMC DMR E T T T T Q T T T T 290 In PMH, the components of the ifferent tries are rawn sequentially an they are correlate ue to the application of the resampling steps. In DMR, each caniate is rawn in a batch way (all the components jointly) but the ifferent caniates are rawn in a sequential manner (see Figure 6), θ (1) then θ (2) etc. The benefit of this strategy is that the proposal pf can be improve consiering the previous generate tries. Hence, if the proposal takes into account the previous samples, DMR generates correlate caniates as well. The main isavantage of DRM is that 5 Generally, an acceptance rate close to 1 is not an evience of goo performance for an MCMC algorithm. However, for the techniques tackle in this work, the situation is ifferent: as grows, the proceure use for proposing a novel possible state (involving tries, resampling steps etc.) becomes better an better, yieling a better approximation of the target pf. See Appenix A for further etails. 23

295 300 305 310 315 320 the implementation for a generic > 2 is not straightforwar. I-MTM an I-MTM2 iffers for the acceptance function employe. Furthermore, The main ifference between the I-MTM2 an PMH schemes is the use of resampling steps uring the generation the ifferent tries. For this reason, the caniates of PMH are correlate (unlike in I-MTM2). I-MTM2 an PMH can be interprete as I-MH methos using a sophisticate proposal ensity q(θ) in Eq. (31), an an extene IS weighting proceure is employe. ote that, inee, q cannot be evaluate pointwise, hence a stanar IS weighting strategy cannot be employe. 6 umerical Experiments We test ifferent MCMC using multiple caniates in ifferent numerical experiments. In the first example, an exhaustive comparison among several techniques with an inepenent proposal is given. We have consiere ifferent number of tries, length of the chain, parameters of the proposal pfs an also ifferent imension of the inference problem. In the secon numerical simulation, we compare ifferent particle methos. The thir one regars the hyperparameter selection for a Gaussian Process (GP) regression moel. The last two examples are localization problems in a wireless sensor network (WS): in the fourth one some parameters of the WS are also tune, whereas in last example a real ata analysis is performe. 6.1 A first comparison of efficiency In orer to compare the performance of ifferent techniques, in this section we consier a multimoal, multiimensional Gaussian target ensity. More specifically, we have π(θ) = 1 3 (θ µ i, Σ i ), θ R D, (48) 3 i=1 where µ 1 = [µ 1,1,..., µ 1,D ], µ 2 = [µ 2,1,..., µ 2,D ], µ 3 = [µ 3,1,..., µ 3,D ], with µ 1, = 3, µ 2, = 0, µ 3, = 2 for all = 1,..., D. Moreover, the covariance matrices are iagonal, Σ i = δ i I D (where I D is the D D ientity matrix), with δ i = 0.5 for i = 1, 2, 3. Hence, given a ranom variable Θ π(θ), we know analytically that E[Θ] = [ θ 1,..., θ D ] with θ = 1 for all, an 3 iag{cov[θ]} = [ξ 1,..., ξ D ] with ξ = 85 for all = 1,..., D. 18 We apply I-MTM, IMTM2 an I-EnMCMC in orer to estimate all the expecte values an all the variances of the marginal target pfs. amely, for a given imension D, we have to estimate all { θ } D =1 an {ξ } D =1, hence 2D values. The results are average over 3000 inepenent runs. At each run, we compute an average square error obtaine in the estimation of the 2D values an then calculate the Mean Square Error (MSE) average over the 3000 runs. For all the techniques, we consier a Gaussian proposal ensity q(θ) = (θ µ, σ 2 I D ) with µ = [µ 1 = 0,..., µ D = 0] (inepenent from the previous state) an ifferent values of σ are consiere. 6.1.1 Description of the experiments We perform several experiments varying the number of tries,, the length of the generate chain, T, the imension of the inference problem, D, an the scale parameter of the proposal pf, σ. In 24

325 Figures 7(a)-(b)-(c)-()-(e), we show the MSE (obtaine by ifferent techniques) as function of, T, D an σ, respectively. In Figure 7(), we only consier I-MTM with {1, 5, 100, 1000} in orer to show the effect of using ifferent tries in ifferent imensions D. ote that I-MTM with = 1 coincies with I-MH. Let us enote as φ(τ) the auto-correlation function of the states of the generate chain. Figures 8(a)-(b)-(c) epicts the normalize auto-correlation function φ(τ) = φ(τ) (recall that φ(0) φ(τ) φ(0) for τ 0) at ifferent lags τ = 1, 2, 3, respectively. Furthermore, given the efinition of the Effective Sample Size (ESS) [51, Chapter 4], ESS = T 1 + 2, (49) τ=1 φ(τ) 330 in Figure 8(), we show of the ratio ESS (approximate; cutting off the series in the enominator T at lag τ = 10), as function of. 6 Finally, in Figures 9(a)-(b), we provie the Acceptance Rate (AR) of a new state (i.e., the expecte number of accepte jumps to a novel state), as function of an D, respectively. 6.1.2 Comment on the results 335 340 345 350 Figures 7(a)-(b), 8 an 9(a), clearly show that the performance improves as grows, for all the algorithms. The MSE values an the correlation ecrease, an the ESS an the AR grow. I-MTM seems to provie the best performance. Recall that for = 1, I-EnMCMC becomes an I-MH with Baker s acceptance function an I-MTM becomes the I-MH in Table 4 [47]. For = 1, the results confirm the Peskun s orering about the acceptance function for a MH metho [50, 47]. Observing the results, the Peskun s orering appears vali also for the multiple try case, > 1. I-MTM2 seems to have worse performance than I-MTM for all. With respect to I-EnMCMC, I-MTM2 performs better for smaller. The ifference among the MSE values obtaine by the samplers becomes smaller as grows, as shown in Figure 7(a)-(b) (note that in the first one D = 1, in the other D = 10, an the range of is ifferent). The comparison among I-MTM, I-MTM2 an I-EnMCMC seems not to be affecte by changing T an σ, as epicte in Figures 7(c)-(e). amely, the MSE values change but the orering of the methos (e.g., best an worst) seems to epen mainly on. Obviously, for greater D, more tries are require in orer to obtain goo performance (see Figures 7() an9(b), for instance). ote that for, I-MTM, I-MTM2 an I-EnMCMC perform similarly to an exact sampler rawing T inepenent samples from π(θ): the correlation φ(τ) among the samples approaches zero (for all τ), ESS approaches T an AR approaches 1. 6 Since the MCMC algorithms yiel positive correlate sequences of states, we have ESS < T in general. 25