Roberto Casarin, Fabrizio Leisen, David Luengo, and Luca Martino. Adaptive Sticky Generalized Metropolis

Size: px

Start display at page:

Download "Roberto Casarin, Fabrizio Leisen, David Luengo, and Luca Martino. Adaptive Sticky Generalized Metropolis"

Adelia Bryan
5 years ago
Views:

1 Roberto Casarin, Fabrizio Leisen, David Luengo, and Luca Martino Adaptive Sticky Generalized Metropolis ISSN: No. 19/WP/213

2 W o r k i n g P a p e r s D e p a r t me n t o f E c o n o m i c s C a Fo s c a r i U n i v e r s i t y o f V e n i c e N o. 19/ W P / 21 3 ISSN Title: Adaptive Sticky Generalized Metropolis Luca Martino Universidad Carlos III de Madrid Fabrizio Leisen University of Kent Roberto Casarin Università Ca Foscari Venezia David Luengo Universidad Politecnica de Madrid 2 August 213 Abstract We introduce a new class of adaptive Metropolis algorithms called adaptive sticky algorithms for efficient general-purpose simulation from a target probability distribution. The transition of the Metropolis chain is based on a multiple-try scheme and the different proposals are generated by adaptive nonparametric distributions. Our adaptation strategy uses the interpolation of support points from the past history of the chain as in the adaptive rejection Metropolis. The algorithm efficiency is strengthened by a step that controls the evolution of the set of support points. This extra stage improves the computational cost and accelerates the convergence of the proposal distribution to the target. Despite the algorithms are presented for univariate target distributions, we show that they can be easily extended to the multivariate context by a Gibbs sampling strategy. We show the ergodicity of the proposed algorithms and illustrate their efficiency and effectiveness through some simulated examples involving target distributions with complex structures. Keywords Adaptive Markov chain Monte Carlo, Adaptive rejection Metropolis, Muliple-try Metropolis, Metropolis within Gibbs. JEL Codes C1, C15, C11, C4, C63. Address for correspondence: Roberto Casarin Department of Economics Ca Foscari University of Venice Cannaregio 873, Fondamenta S.Giobbe 3121 Venezia - Italy Phone: (++39) Fax: (++39) r.casarin@unive.it This Working Paper is published under the auspices of the Department of Economics of the Ca Foscari University of Venice. Opinions expressed herein are those of the authors and not those of the Department. The Working Paper series is designed to divulge preliminary or incomplete work, circulated to favour discussion and comments. Citation of this paper should consider its provisional character. The Working Paper Series is available only on line ( For editorial correspondence, please contact: wp.dse@unive.it Department of Economics Ca Foscari University of Venice Cannaregio 873, Fondamenta San Giobbe 3121 Venice Italy Fax:

3 Adaptive Sticky Generalized Metropolis Luca Martino Roberto Casarin Fabrizio Leisen David Luengo Universidad Carlos III de Madrid University Ca Foscari, Venice University of Kent Universidad Politecnica de Madrid Abstract We introduce a new class of adaptive Metropolis algorithms called adaptive sticky algorithms for efficient general-purpose simulation from a target probability distribution. The transition of the Metropolis chain is based on a multiple-try scheme and the different proposals are generated by adaptive nonparametric distributions. Our adaptation strategy uses the interpolation of support points from the past history of the chain as in the adaptive rejection Metropolis. The algorithm efficiency is strengthened by a step that controls the evolution of the set of support points. This extra stage improves the computational cost and accelerates the convergence of the proposal distribution to the target. Despite the algorithms are presented for univariate target distributions, we show that they can be easily extended to the multivariate context by a Gibbs sampling strategy. We show the ergodicity of the proposed algorithms and illustrate their efficiency and effectiveness through some simulated examples involving target distributions with complex structures. Keywords: Adaptive Markov chain Monte Carlo, Adaptive rejection Metropolis, Muliple-try Metropolis, Metropolis within Gibbs. 1 Introduction Markov Chain Monte Carlo (MCMC) methods (see Liu (24); Liang et al. (21); Robert and Casella (24) and references therein) are now a very Corresponding author: Fabrizio Leisen, fabrizio.leisen@gmail.com. Other contacts: r.casarin@unive.it (Roberto Casarin); luca.martino@uc3m.es (Luca Martino). 1

4 important numerical tool in statistics and in many others fields, because they can generate samples from any target distribution available up to a normalizing constant. The standard MCMC techniques requires the specification of a proposal distribution and produces a Markov chain that converges the target distribution. The main issue in MCMC is the choice of the proposal distribution, which can heavily affect the mixing of the MCMC chain when the target distribution has a complex structure, e.g., multimodality and heavy tails. Thus, in the last decade and after the seminal paper of (Haario et al., 21), a remarkable stream of literature focuses on adaptive proposal distributions, which allow for self-tuning procedures of the MCMC algorithms and for flexible movements within the state space and reasonable acceptance probabilities of the adaptive MCMC chain. Adaptive MCMC algorithms are used in many statistical applications (e.g., see Roberts and Rosenthal (29), Craiu et al. (29), Giordani and Kohn (21) and Richardson et al. (211)) and different adaptive strategies have been proposed in the literature. One of the strategies consists in updating the proposal distribution according to the past values of the chain (e.g., see (Haario et al., 21) and Andrieu and Robert (21)). Another strategy relies on the use of auxiliary chains, which are run in parallel (e.g., see Jasra et al. (27a), Jasra et al. (27b), Campillo et al. (29), Atchade (21), Casarin et al. (213)) and interact with the principal chain. One of the most used class of MCMC algorithms, is the Metropolis- Hastings (MH) algorithm (see Metropolis et al. (1953), Hastings (197)) and its generalizations. Among the different variants of the MH, in this paper we focus on multiple-try Metropolis (MTM) (see Liu et al. (2)), which have revealed to be efficients in different applications (e.g., see Craiu and Lemieux (27) and So (26)). While in the MH formulation one accepts or rejects a single proposed move, the MTM is designed so that the next state of the chain is selected among multiple proposals. The multipleproposal setup can be used effectively to explore the sample space of the target distribution. The MTM has been further generalized with the use of antithetic and quasi-monte Carlo sampling (Craiu and Lemieux (27) and Bédard et al. (21)), the extension to a trans-dimensional setup (Pandolfi et al. (21)) and the use of general weighting function in the selection step of the MTM (Martino and Read (212) and Martino and Read (213)). In this paper, we contribute to the adaptive MCMC literature by proposing a new class of adaptive generealized Metropolis algorithms, customarily called adaptive sticky algorithms. More specifically, we propose a new class of adaptive MTM algorithms called adaptive sticky MTM (ASMTM) which has the adaptive sticky Metropolis (ASM) as a special 2

5 case. Adaptation strategies for MTM based on interacting chains have been proposed in Casarin et al. (213). We follow here an alternative route and use the past history of a single MTM chain to adapt the proposal distribution over the chain iterations. The proposal distribution is nonparametric and the construction method relies upon alternative interpolation strategies. Our adaptation mechanism builds on and extends the mechanism in the adaptive rejection sampling (ARS) of Gilks and Wild (1992) and in the accept/reject Metropolis (ARMS) of Gilks et al. (1995b) and its extensions (see Meyer et al. (28), Cai et al. (28), Hörmann (1995),Görür and Teh (211), Martino and Míguez (211) and Martino et al. (212)). We shall notice that the interpolation approach has been used also in Krzykowski and Mackowiak (26) and Shao et al. (213), but not in an adaptive MH framework. Our extension of the algorithms in the ARMS class is twofold. First we use the more efficient multiple-proposal transition instead of the single proposal transition kernel. Secondly we apply a random test procedure for the inclusion of new points in the support set of the proposal distribution. We discuss different testing procedures for the inclusion of new support points. They represent more efficient generalizations of the accept/reject rule of the ARMS algorithm. Another contribution of the paper regards the converge of the proposed adapitve algorithms. Adaptive MCMC algorithms, which use previous iterations or auxiliary variables in their future transitions, violate the Markov property which provides the justification for conventional MCMC. Thus, their validity in terms of convergence to the desired target distribution, has to be demonstrated. We shall notice that convergence of adaptive MCMC is reached under various conditions (Haario et al. (21), Atchade and Rosenthal (25), Andrieu and Moulines (26), Roberts and Rosenthal (27), Saksman and Vihola (21), Latuszynski et al. (213), and Holden et al. (29)). In this paper we follow the Holden et al. (29) approach and show the ergodicity of the adaptive Metropolis chain under suitable conditions on the proposal distribution. Our interpolation approach guaranties that the adaptive proposal distributions satisfy such conditions. These results extend to adaptive MTM algorithm the previous results on adaptive MH due to Holden et al. (29). Finally, we discuss some practical issues such as the acceleration techniques for the reduction of the computational cost. A possible extension to the multivariate setup is also proposed following a Gibbs sampling updating rule. The resulting ASM-within-Gibbs sampling algorithm represents an effective simulation technique thanks to the efficiency of the nonparametric proposal distributions used in the ASM. We study the 3

6 efficiency and and effectiveness of the proposed algorithms on different simulation experiments involving target distributions with multiple-mode, heavy tails and skewness. The structure of the paper is as follows. Section 2 introduces adaptive sticky Metropolis and discuss convergence issues. Section 3 presents different updating schemes for the proposal distributions. Section 4 discusses some practical issues for the implementation and Section 5 presents a multivariate extension of the sampling scheme. Section 6 contains algorithm comparisons using simulated data. Section 7 contains conclusions and suggestions for further research. 2 Adaptive Generalized Metropolis 2.1 Adaptive Sticky Metropolis Let π(x) be a real target distribution known up the normalizing constant. Fix an initial state x of the chain x t, t =, 1, 2,..., and an initial set of support points S = {s 1,..., s m }, with m >. Assume that the current state of the chain is x t, then the general update of the proposed Adaptive Sticky Metropolis (ASM) algorithm is described in Algorithm 1. 4

7 Algorithm 1. Adaptive Sticky Metropolis (ASM) For t = 1,..., T : 1. Construction of the proposal: Build a proposal q t (x S t 1 ) via a suitable interpolation procedure using the set of support points S t MH step: 2.1 Draw x from q t (x S t 1 ). 2.2 Set x t+1 = x and z = x t with probability [ ] α(x t, x, S t 1 ) = min 1, π(x )q t (x t S t 1 ) π(x t )q t (x, S t 1 ) and set x t+1 = x t and z = x, with probability 1 α(x t, x, S t 1 ). 3. Test to update S t : Let η : R + [, 1] be a strictly increasing continuous function such that η() =. Then, set { St 1 {z} with prob. η(d S t = t (z)), S t 1 with prob. 1 η(d t (z)), where d t (z) is a positive measure (at the iteration t) of the distance in z between the target and the proposal distributions. The proposal distribution changes along the iterations (see step 1 of Algorithm 1) following an adaptation scheme which relies upon a suitable interpolation of a set of support points. In Section 3 we provide several interpolation methods based on a partition of the support of π(x). The insight behind this adaptation strategy is to build a proposal that is closer and closer to the target as the number of iterations increases. The proposal generated the proposal are then used in a standard acceptreject Metropolis-Hastings (MH) step (see step 2 of the algorithm), hence the resulting algorithm is in the class of adaptive MH. Another important feature of the proposed adaptation strategy is given by the test for updating the set of support points (see step 3). This step includes with probability η the rejected proposal from the MH step in the set of support points by applying an accept-reject rule. The ratio behind 5

8 this test is to use information from the target distribution in order to include in the set only the points where the proposal is far from the target. More specifically, we set the acceptance probability η as a function of a distance d t (z). This allows to design a strategy that incorporates the point z only if distance in z between the proposal distribution and the target is large. Moreover, a suitable construction of the proposal leads to a probability of adding a new point that converges to zero. This implies that both the total number of points in the support set and the computational cost of building the proposals are kept bounded along the iterations, provided that η() =. Different choices of η, which ensure quick convergence of the proposal to the target, are presented in Section 4.1. Finally, it should be noted that Algorithm 1 is a special case of the adaptive sticky MTM presented in the next section (see Algorithm 2) and the proof of the validity of the algorithm follows closely the proof given in next session for the adaptive sticky MTM and, therefore, it is not given here. 2.2 Adaptive Sticky Multiple Try Metropolis In the ASM one accepts or rejects a single proposed value. We extend the ASM by allowing for multiple-proposals in order to further improve the ability of the Metropolis chain to explore the state space. We focus on the multiple-try Metropolis (MTM) (see Liu et al. (2) and Craiu and Lemieux (27)) and propose an Adaptive Sticky MTM (ASMTM). The ASMTM can also be seen as a generalization of the MTM which allows for adaptive proposal distributions. Note that our adaptation strategy can be combined with MTM algorithms with different proposal distributions and with interacting MTM algorithms (see Casarin et al. (213)) to design new adaptive algorithms. We adaptation can be also used within the multi-point algorithms (e.g., Martino and Read (212); Pandolfi et al. (21)) as well. At the iteration t, the ASMTM builds the proposal distribution q t (x S t 1 ) (step 1 of Algorithm 2) using the current set of support points S t 1. Let x t = x be the current value of the chain and x j, j = 1,..., M, a set of i.i.d. proposals simulated from q t (x S t 1 ) (see step 2). Define the unnormalized selection weights w jt (x, x j) = π(x)q t (x j S t 1 )λ t (x, x j S t 1 ) where λ t (x, x S t 1 ) is a non-negative symmetric function in x and x. It is worth noticing that in the adaptive MTM not only the proposal distribution changes over the iterations, but also the function λ t may adapt following the update in the set of support points. 6

9 Algorithm 2. Adaptive Sticky Multiple Try (ASMTM) For t = 1,..., T : 1. Construction of the proposal: Build a proposal q t (x S t 1 ) via a suitable interpolation procedure using the set of support points S t 1. In Section 3 we provide several procedures that are based in a partition of the support of π(x). 2. MTM step: 2.1 Draw x 1,..., x M from q t(x S t 1 ) and compute the unnormalized weights w t (x i ) = π(x i ) q t(x i S t 1). 2.2 Select x = x j among the M proposals with probability proportional to w t (x i ), i = 1,..., M. 2.3 Set the auxiliary point x i = x i, i j and x j = x t 2.4 Set x t+1 = x and z i = x i with probability [ α(x t, x, x j, S t 1 ) = min 1, w t(x 1 ) + + w t(x M ) ] w t (x 1 ) + + w t(x M ), and set x t+1 = x t and z i = x i, with probability 1 α(x t, x, x j, S t 1). 3. Test to update S t : Let η i : R + [, 1], i = 1,..., M, be strictly increasing continuous functions such that η i () =, i and M i=1 η i 1. Then, set { St 1 {z i } with prob. η i (d t (z i )), i = 1,..., M S t = S t 1 with prob. 1 M i=1 η i(d t (z i )), where d t (z) is a positive measure (at the iteration t) of the distance in z between the target and the proposal distributions. Liu et al. (2) discussed various possible specifications of the function λ t and found in their experiments that the efficiency gain when using MTM is generally not sensitive to the choice of this function. However, in some of the experiments of Liu et al. (2) and in quite all the simulation experiments of Casarin et al. (213), the choice λ t (x, x S t 1 ) = 1/(q t (x S t 1 )q t (x S t 1 )) leads to better performance of the MTM algorithms. Thus, in this work we 7

10 consider this choice of λ t and focus on w jt (x, x ) = w t (x), j, where w t (x) are unnormalized importance weights w t (x) = π(x) q t (x S t 1 ) The importance weights are used at the step 2 of the ASMTM to select one of the proposals. The selected candidate is accepted or rejected with the generalized acceptance probability given at step 2. Finally, step 3 includes the selected proposal in the set of support points, with probability η. This updating step can be extended to allow for more than one proposals to be included into the set of support points. The strategy leads to recycle the proposals and possibly improves the adaptation of the proposal distributions. For the sake of simplicity, in the presentation of the ASMTM algorithm, we consider the case only one proposal is added, at each iteration, to S t 1. We show the convergence of the ASMTM algorithm by extending to the MTM the results in Holden et al. (29) where they show the convergence for independent MH scheme with adaptive proposal avoiding the requirement of diminishing adaptation. The difference between the adaptive independent MH algorithm of Holden et al. (29) and a standard independent MH algorithm is that the proposal distribution q t (x S t 1 ) depend on the set of support points S t 1, which can include part of the past history of the MH algorithm except for the current state of the MH chain (see Liang et al. (21), pp ). The main difference between our adaptive independent MTM algorithm and the adaptive independent MH algorithm of Holden et al. (29) is that the at each iteration multiple-proposals can be used in the Metropolis transition. The following theorem implies that the AMTM chain never leaves the stationary distribution π(x) once it is reached. Theorem 1. The target distribution π(x) is invariant for the adaptive independent MTM algorithm; that is, p t (x t S t 1 ) = π(x t ) implies p t+1 (x t+1 S t ) = π(x t+1 ), where p t ( S t 1 ) denotes the distribution of x t conditional on the past samples. Proof. Let ρ be the state appended to the history S t 1. Without loss of generality, suppose that η j = 1 where j is the index sampled at the selection step, then S t = ρ S t 1 with ρ = z j. Moreover, let f t (S t ) be the joint distribution of the history S t and let q t, j (x j S t 1) = i j q t(x i S t 1) where x j = (x 1,..., x j 1, x j+1,..., x M). Following Liu et al. (2), 8

11 Theorem 1 and Casarin et al. (213) Theorem 1, the actual transition probability of the MTM step in our ASMTM writes as follows: A(ρ, x t+1 ) = h Mt (dj) q t, J (x J S t 1 )dx Jq t (x J S t 1 ) J X M δ ρ (dx J)δ xt+1 (dx J) δ x X M+1 k (dx k [1, ) min i j w t(x i ) + w t(x J ) ] k J i j w t(x i ) + w t(x J ) M = q t, j (x j S t 1 )dx jα(ρ, x t+1, x j, S t 1 ) w t(x t+1 )q t (x t+1 S t 1 ) X M 1 k j w t(x k ) + w t(x t+1 ) j=1 where J = {1,..., M} and h Mt (dj) = M j=1 w t (x j ) M k=1 w t(x k )δ j(dj) is the empirical measure generated by the selection step. We show that the chain with this transition probability never leaves the stationary distribution once it is reached: p t+1 (x t+1 S t )f t (S t ) = { M w t (x t+1 ) = f t 1 (S t 1 ) π(ρ)q t (x t+1 S t 1 ) X i j M 1 w t(x i ) + w t(x t+1 ) j=1 α(ρ, x t+1, x j, S t 1 )q t, j (x j S t 1 )dx j+ w t (ρ) + π(x t+1 )q t (ρ S t 1 ) X i j M 1 w t(x i ) + w t(ρ) [1 α(x t+1, ρ, x j, S t 1 )]q t, j (x j S t 1 )dx j { M w t (ρ) = f t 1 (S t 1 ) π(x t+1 )q t (ρ S t 1 ) j=1 X i j M 1 w t(x i ) + w t(ρ) } q t, j (x j S t 1 )dx j = f t 1 (S t 1 )π(x t+1 )q t (ρ S t 1 )g t (ρ S t 1 ) } where g t (ρ S t 1 ) = M X M 1 and this concludes the proof. w t (ρ) i j w t(x i ) + w t(ρ) q t, j(x j S t 1 )dx j 9

12 Let us assume that the proposal distribution q t (x S t 1 ) satisfies the strong Doeblin s condition q t (x S t 1 ) a t (S t 1 )π(x) (1) for all x X and S t 1 X t 1, where X denotes the state space, and a t (S t 1 ) (, 1]. This condition is satisfied in our proposal distributions discussed in the next sections. Theorem 2. Assume the proposal q t (x S t 1 ) in the ASMTM algorithm satisfies the condition 1 for all t. Then p t π T V 2 X t t (1 a j (S j 1 ))dµ(s t 1 ) (2) The algorithm converges if the product goes to zero when t. j=1 Proof. Let x t be the current value of the chain at the iteration t and x the j-th proposal accepted if u t < α j (x t, x j, x j, S t 1), where u t is a uniform number on the [, 1] interval. The acceptance probability α j (x t, x j, x j, S t 1) satisfies where min 1, k j π(x k ) q t(x k S t 1) + π(x j ) q t(x j S t 1) π(x k ) k j q t(x k S t 1) + > min 1, a t(s t 1 ) M k j { = min 1, π(xt) q t(x t S t 1 ) π(x j ) q t (x j S t 1)ãt(S t 1, x ) ã t (S t 1, x ) = a t(s t 1 ) k j M π(x k ) q t (x k S t 1) + π(x j ) q t (x j S t 1) } π(x k ) q t(x k S t 1) + π(x j ) q t(x j S t 1) π(x j ) q t(x j S t 1) Then A t be the condition that u t q t (x S t 1 )/π(x ) ã t (S t 1, x ). Then 1

13 the conditional distribution of x given S t 1, x t and A t is proportional to M w t (x ) X i j M 1 w t(x i ) + w t(x ) P (A t S t 1, x, x j, x t )q t, j (x j S t 1 )dx jq t (x S t 1 ) = j=1 = M j=1 w t (x ) X i j M 1 w t(x i ) + w t(x ) P ( u t q t (x S t 1 )q t, j (x j S t 1 )dx j M a t (S t 1 ) = X M 1 M π(x )q t, j (x j S t 1 )dx j j=1 = a t (S t 1 )π(x ) π(x ) ) q t (x S t 1 )ãt(s t 1, x ) Following Holden et al. (29) we define { with probability 1 at+1 (S I t+1 = t ) if I t = 1 otherwise for t 1, with I =, and the probability not to be in the stationary after j step is P (I t = S t 1 ) = b t (S t 1 ) where b t (S t 1 ) = t (1 a j (S j 1 )) j=1 Then conditional distribution of x t+1 can be written as p t (x S t ) = π(x)(1 b t (S t 1 )) + v t (x S t )b t (S t ) where v t is a probability distribution. Then the distance between the limiting distribution and the conditional distribution of x t+1 can be bounded as follows p t π T V = p t (x S t )p t (S t )dµ(s t ) π(x) dµ(x) X X t = (v t (x S t ) π(x))b t (S t 1 )p t (S t )dµ(s t ) dµ(x) X X t (3) v t (x S t ) π(x) dµ(x)b t (S t 1 )p t (S t )dµ(s t ) X t X 2 b t (S t 1 )p t (S t )dµ(s t ) X t Thanks to this bound, the probability to jump in the stationary within t steps, can be made arbitrarily close to one. 11

14 3 Construction of sticky proposal functions There are many alternatives available for the construction of a suitable proposal distribution in the ASM and ASMTM algorithms. In section we focus on some procedures that approximate the target distribution interpolating points that belong to the graph of the (unnormalized) target. The points are identified by evaluating the target at the support points and the set of support points change over the algorithm iterations. The name sticky, we choose for this algorithm, highlights the ability of the adaptation schemes to generate a sequence of proposal distributions which converge to the target, allowing for a full adaptation of the proposal distribution. The adaptation relies upon interpolation scheme which are easy to improve by adding new points to the support set and are easy to sample. A general approach to interpolation is based on piecewise linear function. We note that the resulting proposal density can be represented as a mixture of probability density functions, so that to draw from it one need to compute mixture weights, to sample from a discrete distribution in order to choose one of the mixture components and finally to be able to draw samples from the selected component. The use of mixture distributions as proposal is common to many adaptive algorithms. In the class of importance sampling methods the validity of the algorithms with adaptive proposal can be easily showed by an importance sampling argument. See for example the Population Monte Carlo (Cappé et al. (24)), the iterative importance sampling (Cappé et al. (28)) and the adaptive importance sampling (Hoogerheide et al. (212)) algorithms. Adaptive proposal mixtures are less frequently used in Metropolis algorithms, mainly due to the difficulties in showing the validity of the algorithms. One of the recent papers in this direction is Holden et al. (29), who proposes a Metropolis algorithm with proposals from adaptive mixture distributions and shows the geometric ergodicity of the adaptive Metropolis chain. In this paper, we contribute to this stream of literature, proposing new adaptation schemes and extending the results of Holden et al. (29). In this paper, we will present three different adaptation strategies for the proposal distributions. Let us assume that a set S t = {s 1,..., s mt } of m t support points is available at the iteration t + 1 of a Metropolis algorithm. Define a sequence, of m t + 1 intervals: I = (, s 1 ], I j = (s j, s j+1 ] for j = 1,..., m t 1, and I mt = (s mt, + ). In the first type of adaptation schemes, the proposal distribution is a mixture of m t +1 densities with bounded disjoint supports I j, j =,..., m t. 12

15 An addition of a new support point, say s, can change the shape of the densities associated to the different intervals. For instance, if s I k, then the algorithm will update the mixture components associated with I k, I k 1 and I k+1. This feature of the adaptation scheme has, as a special case, the construction in Gilks et al. (1995b). The proposal distribution, in the second type of adaptation schemes, is a mixture of densities with bounded disjoint supports, like the one used in the first method, but the addition of a new support point, say s, can change only one component of the mixture. For instance, if s I k, then the k-th density of the mixture will be improved. This proposal updating scheme is a simpler alternative to Gilks et al. (1995b). Finally, we consider proposal adaptation schemes based on mixtures of densities with overlapping supports. This kind of adaptation strategies uses the points in S t in a quite general approach to the construction of sticky proposal distributions. For instance, Shao et al. (213) propose recently B-splines techniques to build proposal distributions for accept/reject algorithms. In an adaptive importance sampling framework, Cappé et al. (28) and Hoogerheide et al. (212) use adaptive mixture of Student-t distributions with overlapping supports. In the following sections, we discuss the three adaptation schemes and illustrate how our sticky proposal construction applies within these schemes. Moreover, we briefly discuss the construction of the tails of the mixture distribution and the different procedures to handle unbounded target distributions. 3.1 Disjoint supports and proposal changes in different intervals The first adaptation strategy relies upon interpolation for points on the graph of the target. For the sake of simplicity we describe the interpolation procedure representing the target and proposal densities in a log-domain. Hence, let us define the log-density functions W t+1 (x) log[q t+1 (x S t )], V (x) log[π(x)]. (4) where q t+1 (x S t ) is the proposal at the iteration t + 1 of the Algorithms 1 and 2 and π is the target distribution. Let us denote as L j,j+1 (x) the straight line passing through the points (s j, V (s j )) and (s j+1, V (s j+1 )) for j = 1,..., m t 1 where s j S t. Also, set L 1, (x) = L,1 (x) L 1,2 (x), 13

16 L mt,m t+1(x) = L mt+1,m t+2(x) L mt 1,m t (x). In Gilks et al. (1995b), W t+1 (x) is a piecewise linear function, W t+1 (x) = max [ L j,j+1 (x), min [L j 1,j (x), L j+1,j+2 (x)] ], (5) with x I i where I j = (s j, s j+1 ], j = 1,..., m t 1 and I = (, s 1 ] and I mt = (s mt, + ). The function W t+1 (x) can be re-written as follows 8 L 1,2(x), x I ; max {L 1,2(x), L 2,3(x)}, x I 1; >< max {L j,j+1(x), min {L j 1,j(x), L j+1,j+2(x)}}, x I j, W t(x) = 2 j m t 2; max {L mt 1,m t (x), L mt 2,m t 1(x)}, x I mt 1; >: L mt 1,m t (x), x I mt. Eq. 5 and 6 show that the construction of the log-density in a interval I j depends also on the points s j 1 and s j+2. Therefore, an addition of a point in a interval can change the construction in the adjacent regions. For instance, let us assume S t = {s 1, s 2, s 3, s 4, s 5 }. Fig. 1(a) illustrate the construction using the points in the set S t. Fig. 1(b) show how the construction change when a new point is added between the points s 1 and s 2 of the set S t used Fig. 1(a). As illustrated in Fig. 1(b), intervals I = (, s 1 ], I 1 = (s 1, s 2 ] and I 2 = (s 2, s 3 ], this construction requires to modify lines for the intervals I and I 1 of Fig. 1(a) and to compute the intersection point between two straight lines (see interval I 2 = (s 2, s 3 ] of Fig. 1(b)), to be able to draw adequately from the corresponding proposal distribution. Note that, a similar procedure using pieces of quadratic functions in the log-domain (namely, pieces of truncated Gaussians density in the pdf domain) also has been proposed in Meyer et al. (28). 3.2 Disjoint supports and proposal changes in one interval Gilks et al. (1995b) introduced for the ARMS algorithm the procedure to build q t+1 (x S t+1 ), described in the previous section. The computational complexity of the procedure arises from the need to construct a proposal function above the target in more regions as possible, in order to take advantage of the rejection sampling step. We note that a simpler approach to build the proposal is to define W t+1 (x) inside the i-th interval as the straight line passing through (s i, V (s i )) and (s i+1, V (s i+1 )), L i,i+1 (x), for 1 i m t 1, and extending the straight lines corresponding to I 1 and (6) 14

17 W t (x) V(x) s 1 s 2 s 3 s 4 s 5 (a) (b) Figure 1: Examples of piecewise linear function, W t+1 (x), built using the procedure described in Gilks et al. (1995b) for the set S t = {s 1,..., s 5 } of support points (graph (a)) and the set of support points s 1,..., s 6 (graph (b)), obtained by adding a new point between the two points s 1 and s 2 in S t. I mt 1. Formally, this can be expressed as L 1,2 (x), x I = (, s 1 ]; L i,i+1 (x), x I i = (s i, s i+1 ], W t+1 (x) = 1 i m t 1; L mt 1,mt (x), x I mt = (s mt, + ). (7) This construction is illustrated in Fig. 2(a). Although this procedure looks similar to the one used in ARMS by Gilks et al. (1995b), it is much simpler in fact, since there is not any minimization or maximization involved, and thus it does not require the calculation of intersection points to determine when one straight line is above the other. Observe that the proposal q t+1 (x S t ) = exp{w t+1 (x)}, with such a definition, is formed by exponential pieces (in the pdf-domain). Moreover, an even simpler procedure to construct W t+1 (x) can be devised using a piecewise constant approximation with two straight lines inside the first and last intervals. Mathematically, it can be expressed as L 1,2 (x), x I = (, s 1 ]; max {V (s i ), V (s i+1 )}, x I i = (s i, s i+1 ], W t+1 (x) = (8) 1 i m t 1; L mt 1,mt (x), x I mt = (s mt, + ). The construction described above leads to the simplest proposal density, i.e., 15

18 W t (x) V(x) W t (x) V(x) s 1 s 2 s 3 s 4 s 5 (a) s 1 s 2 s 3 s 4 s 5 (b) Figure 2: Examples of the construction of W t+1 (x) using the procedures described in Eq. (7) (graph (a)) and in Eq. (8) (graph (b)). a collection of uniform pdfs with two exponential tails. Fig. 2(b) shows an example of the construction of the proposal using this approach. Note that we can also apply the procedure proposed for adaptive trapezoid Metropolis sampling (ATRAMS, Cai et al. (28)) to build the proposal distribution. However, the structure of the ATRAMS algorithm Cai et al. (28) is completely different to the ASM and ARMS-type techniques. In both cases the proposal is constructed in the domain of the target pdf, π(x), rather than in the domain of the log-pdf, V (x) = log(π(x)). For instance, the basic idea proposed for ATRAMS is using straight lines, L i,i+1 (x), passing through the points (s i, π(s i )) and (s i+1, π(s i+1 )) for i = 1,..., m t 1 and two exponential pieces, E (x) and E mt (x), for the tails: E (x), x I = (, s 1 ]; L i,i+1 (x), x I i = (s i, s i+1 ], q t (x S t ) (9) i = 1,..., m t 1; E mt (x), x I mt = (s mt, + ). Unlike in Cai et al. (28), here the tails E (x) and E mt (x) do not necessarily have to be equivalent in the areas they enclose. Note that L denotes a straight line built directly in the domain of π(x), whereas L denotes the linear function constructed in the log-domain. Indeed, we may follow a much simpler approach calculating two secant lines L 1,2 (x) and L mt 1,mt (x) passing through (s 1, V (s 1 )), (s 2, V (s 2 )), and (s mt 1, V (s mt 1)), (s mt, V (s mt )) respectively, so that the two exponential tails are defined as E (x) = exp{l 1,2 (x)} and E mt (x) = exp{l mt 1,mt (x)}. Fig. 3 depicts an example of the construction of q t (x) using this last procedure. Note that drawing samples from these trapezoidal pdfs inside 16

19 (a) (b) Figure 3: Example of the construction of the proposal density, q t+1 (x S t ), using a procedure described in Cai et al. (28), within the ATRAMS algorithm, in the pdf domain (graph (a)) and in the log-domain (graph (b)). I i = (s i, s i+1 ] can be easily done (Cai et al., 28; Devroye, 1986). Indeed, given u, v U([s i, s i+1 ]) and w U([, 1]), then x = { min{u, v }, w < max{u, v }, w π(s i ) π(s i )+π(s i+1 ) ; π(s i ) π(s i )+π(s i+1 ) ; (1) is distributed according to a trapezoidal density defined in the interval I i = [s i, s i+1 ]. 3.3 Overlapping supports It is possible to consider proposal densities of the type q t+1 (x S t ) m t 1 j=1 ω j f j (x), where f i (x) could be B-spline functions (e.g., see Krzykowski and Mackowiak (26); Shao et al. (213)) or the densities of Gaussian or Student-t distributions (e.g., see Cappé et al. (28)), for instance. Clearly, we need to be able to draw from each f j (x). It is possible to draw from B-splines, however, if the target has unbounded domain and since the B-splines have always a finite support, then the B-splines should be combined with other kind of proposal densities specifically designed for the tails. In the case, 17

20 mixture of Gaussian or Student-t distribution are used instead, this problem is solved since the densities of these distributions are defined in R. In this type of adaptation scheme, the weights should be chosen satisfying the passing conditions through the points (s i, π(s i )), i = 1,..., m t. However, it is necessary that ω j for all j = 1,..., m t, (11) in order to be able to draw from the proposal q t (x S t ). The problem is that, in general, satisfying the passing conditions some weights can be negative, ω k <. For this reason, this approach can appear completely useless. To overcome this issue, the passing conditions can be relaxed finding weights that minimize a least squares cost function with the constrains ω j, for instance. Several further considerations are needed, however a detailed treatment of this case oversteps the aims of this work and deserves a specific and separate study. 3.4 Heavy-tail proposal distribution The adaptation procedures presented in the previous sections refer to proposal distributions with exponential tails. It is worth to mention that it is not strictly necessary to change the construction of the tails, but there are some benefits in handling the tails with different approaches. Specifically, we can diminish the dependence from the initial points and also speed up the convergence of the chain when the target has heavy tails. For instance, an alternative choice (in the log-domain) for the tails is to use functions of type h(x) = a + log[1/x γ ], γ > 1, (12) instead of the straight lines, in the intervals I and I mt. In the pdf-domain, the function T (x) corresponds to exp(a) x that is proportional to a Pareto-type γ pdf. By noting that h(x) = a γ log[x], then we can set a and γ for the left tail, in I, by solving the following linear system in a and γ, { V (s 1 ) = log[π(s 1 )] = a γ log[s 1 ], (13) V (s 2 ) = log[π(s 2 )] = a γ log[s 2 ]. Analogously, we can fix a and γ for the right tail considering the points (s mt 1, V (s mt 1)) and (s mt, V (s mt )). This approach is suitable when we are in presence of an heavy-tailed target. 18

21 It is important to observe that, in the log-pdf domain, if the tails of the function V (x) = log[π(x)] are convex, hence a proposal pdf with exponential decays can be perfectly adequate. On the other hand, if the tails of the function V (x) = log[π(x)] are concave, then Pareto choice can be more suitable. If there is no information about the tails of the target, we can implement an automatic strategy for fixing T (x). Hence, the resulting method is a complete self-tuning algorithm useful in several different frameworks. Consider the set of support points S t = {s 1,..., s mt } sorted in ascending order. Therefore we can use the first three (s 1, s 2 and s 3 ) and the last three (s mt 2, s mt 1 and s mt ) support points to estimate the concavity of the function V (x) in the tails. We can consider the quadratic polynomial function y = αx 2 + βx + c passing through (s 1, V (s 1 )), (s 2, V (s 2 )) and (s 3, V (s 3 )) or/and (s mt 2, V (s mt 2)), (s mt 1, V (s mt 1)) and (s mt, V (s mt )). Then, if α we use light-exponential tails whereas if α we use heavy tails. Clearly, the sign of α can vary along the iterations, depending on S t, so that the type of used tails can change accordingly. 3.5 Unbounded density functions The ASMTM algorithm, with the constructions described in the previous section, can be applied to bounded target pdfs π(x). A cautionary note is in order if the target pdf is unbounded. In this case, the sticky algorithms may need an unbounded proposal. As an example, consider a target function π(x) with a vertical asymptote at x = a and the set of support points S t = {s 1,..., s k,..., s mt }, sorted in ascending order with s k = a. A suitable construction procedure should use specific functions for the intervals I k 1 = (s k 1, s k ) and I k = (s k, s k+1 ]. For the rest of intervals, the constructions in the previous section are completely adequate. For instance, we can use functions of the following form g j (x) = 1 x a α + β j, j = k 1, k, with < α < 1 and the constants β k 1 = π(s k 1 ) 1 s k 1 a α and β k = π(s k+1 ) 1 s k+1 a α are set in order to obtain g k 1 (s k 1 ) = π(s k 1 ) and g k+1 (s k+1 ) = π(s k+1 ), respectively. 19

22 3.6 Approximation bounds For the approximation methods presented in the previous section it is possible to show that the proposal distributions generated by the interpolation algorithm converge to the target distribution when the number of support points goes to infinity Convergence of the sequence of the unnormalized proposal Theorem 3. Consider a continuous bounded target density π(x) with bounded second order derivative. Denote with π the unormalized density, with x X, and with { q t (x S t 1 )} + t=1 a sequence of possibly unnormalized proposal density functions such that q t (x S t 1 ) > for all x X. Then, X q t(x S t 1 ) π(x) dx t Proof. Let us consider a generic set of support points, S t 1 = {s 1,..., s mt 1 }, with s 1 <... < s mt 1, at time step t. Note that, by using any of the procedures described above in this section, the corresponding proposal density function, q t (x S t 1 ), is a bounded function, since π(x) is bounded. Moreover, since X π(x)dx < + and X q t(x S t 1 )dx < +, then the L 1 -distance between q t (x S t 1 ) and π(x) is bounded for any t, i.e., X q t(x S t 1 ) π(x) dx < +. Let us consider the finite interval I = [s 1, s mt ], then all the interpolation methods proposed in this section to build q t (x S t 1 ) can be represented as a Taylor approximation of the order zero or one inside each interval. Hence, the discrepancy between q t (x S t 1 ) and π(x) inside I can be bounded as I q t (x S t 1 ) π(x) dx = m t 1 1 i=1 m t 1 1 i=1 q t (x S t 1 ) π(x) dx I i r (i) l (x) dx, I i (14) where r (i) l (x) is the remainder associated to the l-th order (with l {, 1} in our case) polynomial approximation of π(x) inside the interval I i, as given by Taylor s theorem. Let us recall that the Lagrange form of this remainder is r (i) l (x) = (x s i) l+1 d l+1 π(x) (l + 1)! dx l+1, (15) x=ξ 2

23 for a value ξ [s i, x]. Moreover, since x I i = [s i, s i+1 ], it is straightforward to show that r (i) l (x) (s i+1 s i ) l+1 C (i) (l + 1)! l, (16) where C (i) l = max x Ii π l+1) (x), and π l+1) (x) denotes the (l + 1)-th derivative of π(x), i.e., π l+1) (x) = dl+1 π(x). Hence, replacing (16) in (14), dx l+1 we obtain m t 1 i=1 r (i) l I i mt 1 (x) dx i=1 (s i+1 s i ) l+2 C (i) (l + 2)! l. (17) Now, let us assume that a new point, s I k = [s k, s k+1 ] for 1 k m t 1, is added at the next iteration. In this case, the construction of the proposal changes only inside the interval I k, as shown in this section. Indeed, assume that I k is now split into I (1) = [s k, s ] and I (2) = [s, s k+1 ], i.e., I k = I (1) I (2). Obviously, max x I (j) π l+1) (x) max x Ik π l+1) (x) with j {1, 2}, and (s s k ) l+2 + (s k+1 s ) l+2 < (s i+1 s i ) l+2, (18) for any l, since A l+2 + B l+2 < (A + B) l+2 for any A, B > thanks to Newton s binomial theorem, and we have A = s s k > and B = s k+1 s >. Hence, the bound in Eq. (17) always decreases when a new support point is incorporated and we can finally ensure that m t 1 lim t + i=1 r (i) l I i (x) dx =, (19) since support points become arbitrarily close as t (i.e., s i+1 s i ), and thus the bound in the right hand side of (17) tends to zero as t. Hence, we can guarantee that I q t(x S t 1 ) π(x) dx for t +. Note that we cannot guarantee a monotonical decrease of the distance between q t (x S t 1 ) and π(x) inside I, since adding a new support point might occasionally lead to an increase in the discrepancy. However, we can guarantee that the upper bound on this distance decreases monotonically, thus ensuring that q t (x S t 1 ) π(x) as t, i.e., adding support points will eventually take us arbitrarily close to π(x). Finally, w.r.t. the tails, note that the distance between q t and π remains bounded even for heavy tailed distributions. Furthermore, the interval I will become greater as t +, since there is always a non-null probability of adding new support points inside the tails. Therefore, the probability 21

24 mass associated to the tails decreases monotonically as t. Hence, even though the distance between the target and the proposal may again increase occasionally due the introduction of a new support point in the tails, we can guarantee such a distance goes to zero as t goes to infinity Convergence to the normalized target For sake of simplicity, in this section we denote as q t (x S t 1 ) and π(x) the unnormalized density functions whereas q t (x S t 1 ) and π(x) indicate the normalized densities. However, we remark that in the rest of this work we have considered q t (x S t 1 ) and π(x) as unnormalized pdfs. Therefore, so far the interpolation (or approximation) was applied to the unnormalized target π(x) to deal with the general case. Hence the proposal function q t (x S t 1 ) is unnormalized as well. Namely, we build q t (x S t 1 ) via interpolation using the information of π(x). We denote the corresponding normalizing constants 1/c t and 1/c π, respectively. As the q t converges to π in L 1 as t goes to infinity, then the normalizing constants also convergences, i.e. c t converge to c π. Indeed, denoting as d(f, g) = π q t = f(x) g(x) dx, the L 1 distance between f(x) and g(x), we have the following result. Theorem 4. Let q t (x S t 1 ) = 1 c t q t (x S t 1 ) and π(x) = 1 c π π(x), where c π = π = X π(x)dx and c t = q t = X q t(x S t 1 )dx. If d( q t, π), t then d(q t, π) t Proof. Let us denote D t = d( q t, π) and D t = d(q t, π). We can use an extended triangle inequality of type d(a, E) d(a, B) + d(b, C) + d(c, E), using the points A = q t, B = 1 c t q t, C = 1 c π q t and E = π, i.e., ) ( d(q t, π) d (q t, 1ct 1 q t + d q t, 1 ) ( ) 1 q t + d q t, π, c t c π c π ) ( d(q t, π) d (q t, 1ct 1 q t + d q t, 1 ) ( 1 q t + d q t, 1 ) π, c t c π c π c π 1 D t + c t 1 c t c t c π + 1 Dt, c π X 22

25 Hence, setting C t = 1 ct c π we can finally write C t + 1 c π Dt D t. (2) Since D t, if lim t C t = and lim t Dt = then lim t D t = as well. Therefore, now we just need to prove c t c π when lim t Dt =. Clearly, π(x) q t (x S t 1 ) π(x) q t (x S t 1 ) = π(x) q t (x S t 1 ) since π(x), q t (x S t 1 ). The equality is given if π(x) q t (x S t 1 ), so that π(x) q t (x S t 1 ) = π(x) q t (x S t 1 ). Moreover, using again the triangle inequality, we can also write π = ( π q t ) + q t π q t + q t π q t π q t, q t = ( q t π) + π q t π + π ( π q t ) π q t. Combining the two previous inequalities, we obtain π q t π q t. Since D t = d( π, q) = π q t and c π = π, c t = q t, we can finally rewrite this expression as D t c π c t. (21) The expression above is also called reverse triangle inequality. lim t Dt =, we also have Then, if i.e., c t c π for t and C t = 1 ct c π. 4 Practical Implementation lim c π c t =, (22) t 4.1 Updating of the set of support points In this section, we focus on the update step of Algorithm 1-2 where a test is introduced for controlling the evolution of the set of support points. This step can be seen as a measure of similarity, in the proposed point, between the proposal and target distributions. It is a part of the algorithm 23

26 that is extremely important since it controls the trade-off between better performance and greater computational cost. Indeed, the use of more support points improves the performance but, at the same time, increases the computational cost. In this step a choice of two functions η and d t is needed. The first one is a strictly increasing function with values in [, 1], and d t is a distance between the proposal and the target distribution. For instance, following the literature on adaptive mixture proposals, one can choose logistic weights and a local absolute distance between proposal and target, which has a low computational cost. These choices corresponds to the following specification: η(d t (z)) = exp { d t (z)}, d t (z) = π(z) q t (z S t 1 ). (23) In order to reduce the computational complexity of the algorithm, it is possible to recycle some of the outputs of the Metropolis steps of the Algorithm 1. From this perspective a natural choice of η and d t could be η(d t (z)) = d t (z), d t (z) = q t(z S t 1 ) π(z) max{π(z), q t (z S t 1 )}. (24) Note that the choice of a linear function for η produces valid weights if d t [, 1]. This condition is satisfied in this case, in fact and then η(d t (z)) = q t(z S t 1 ) π(z) max{π(z), q t (z S t 1 )} = max{π(z), q t(z S t 1 )} min{π(z), q t (z S t 1 )}, max{π(z), q t (z S t 1 )} η(d t (z)) = 1 min{π(z), q t(z S t 1 )} max{π(z), q t (z S t 1 )}. (25) At a first look, this choice of η(d t (z)) may appear arbitrary, but it becomes natural if one think to the classical construction in the ARS technique. η(d t (z)) reminds the probability of adding a new support point in the ARS method and if q t (z S t 1 ) π(z), z D and t, then it becomes 1 π(z) q t(z S t 1 ), that is exactly the probability of incorporating z to the set of support points in the ARS method. The updating rules presented above for Algorithm 1 require some changes when used in a multiple proposal algorithm such as Algorithm 2. Let us 24

27 consider the updating scheme in Eq. 24. Let z i, i = 1,..., M be a set of proposals, then the updating step for S t 1 splits in two parts. First, a z is selected among the proposals, z 1,..., z M, with probability proportional to ϕ t (z i ) = max { } 1 w(z i ), w(z i ) = max{π(z i), q t (z i S t 1 )} min{π(z i ), q t (z i S t 1 )}, (26) i = 1,..., M. This step selects with high probability a sample at which the proposal value is far from the target. The second step is a control step, where z is included in the set of support points with probability d t (z) = 1 1 ϕ t (z) This step is similar to the accept-reject step in the ARMS algorithm and the probability of the point to be included corresponds exactly to the probability of a proposal to be be accepted in a ARMS algorithm. It can be shown that this two-steps updating procedure corresponds to the one-step procedure S t = S t 1 {z i } with prob. η i (d t (z i )) = ϕt(z i) 1 P M j=1 ϕt(z j), S t 1 with prob. M P M i=1 ϕ i(d t(z i )), 1 where ϕ t (z i ) = 1 d t(z i ) and d t(z i ) = 1 min{π(z i),q t(z i S t 1 )} max{π(z i ),q t(z i S t 1 )}. Finally, note that the updating rules in Eq. 23 and 24 can be generalized. For instance the updating rule η(d t (z)) = exp { γ(d t (z) ε)}, d t (z) = π(z) q t (z S t 1 ). (27) with γ, ε (, + ) has the rule in Eq. 23 as a special case for γ = 1 and ε =. This generalization has an interesting limiting case that will be considered in our experiments. For γ + we obtain a sort of deterministic updating of the set of support point. In this case, the function η takes value, if d t (z) > ε and 1 if d t (z) ε. Through the threshold parameter ε it is possible to control the number of support points. The parameter can be updated over the iterations following a deterministic rule in such a way to stop the adaptation of the proposal and to reduce the computational cost of the algorithm. 25

Adaptive Rejection Sampling with fixed number of nodes

Adaptive Rejection Sampling with fixed number of nodes L. Martino, F. Louzada Institute of Mathematical Sciences and Computing, Universidade de São Paulo, Brazil. Abstract The adaptive rejection sampling