The Wang-Landau algorithm in general state spaces: Applications and convergence analysis

Size: px
Start display at page:

Download "The Wang-Landau algorithm in general state spaces: Applications and convergence analysis"

Transcription

1 The Wang-Landau algorithm in general state spaces: Applications and convergence analysis Yves F. Atchadé and Jun S. Liu First version Nov. 2004; Revised Feb. 2007, Aug. 2008) Abstract: The Wang-Landau algorithm [21]) is a recent Monte Carlo method that has generated much interest in the Physics literature due to some spectacular simulation performances. The objective of this paper is two-fold. First, we show that the algorithm can be naturally extended to more general state spaces and used to improve on Markov Chain Monte Carlo schemes of more interest in Statistics. In a second part, we study asymptotic behaviors of the algorithm. We show that with an appropriate choice of the step-size, the algorithm is consistent and a strong law of large numbers holds under some fairly mild conditions. We have also shown by simulations the potential advantage of the WL algorithm for problems in the Bayesian inference. AMS 2000 subject classifications: Primary 60C05, 60J27, 60J35, 65C40. Keywords and phrases: Monte Carlo methods, Wang-Landau algorithm, Multicanonical sampling, Trans-dimensional MCMC, Adaptive MCMC, Geometric ergodicity, Stochastic approximation. 1. Introduction Although the idea of Monte Carlo computation has been around for more than a century, its first real scientific use occurred during the World War II when the first generation computer became available. Nick Metropolis coined the name Monte Carlo for the method when he was at Los Alamos National Labs and it quick evolved to an active research area due to the active involvements of leading physicists in the Labs. Ever since then, physicists have been at the forefront of the methodological research in the field. One of their latest additions is an algorithm proposed by F. Wang and D. P. Landau [21]). The Wang-Landau WL) algorithm has been successfully applied to some complex sampling problems in physics. The algorithm is closely related to multicanonical sampling, a method due to B. A. Berg and T. Neuhaus [9]). Briefly, Department of Statistics, University of Michigan, yvesa@umich.edu Department of Statistics, Harvard University, jliu@stat.harvard.edu 1

2 /The WL algorithm in general state spaces 2 if π is the probability measure of interest, the idea behind multicanonical sampling is to obtain an importance sampling distribution by partitioning the state space along the energy function log πx)) and re-weighting appropriately each component of the partition so that the modified distribution π spends equal amount of time in each component, i.e., uniform in the energy space. The method is often criticized for the difficulty involved in computing the weights. The main contribution of the WL algorithm is in proposing an efficient algorithm that simultaneously computes the balancing weights and samples from the re-weighted distribution. The objective of this paper is to take a more probabilistic look at the WL algorithm and to explore its potential for Monte Carlo simulation problems of more direct interest to statisticians. We achieve this goal by proposing a general state space version of the algorithm. Then we show that the WL algorithm offers an effective strategy to improve on simulated tempering and transdimensional MCMC. From a probabilistic standpoint, the WL algorithm is an interesting example of adaptive Markov Chain Monte Carlo MCMC). Adaptive MCMC is an approach to Monte Carlo simulation where the transition kernel of the algorithm is sequentially adjusted over time in order to achieve some prescribed optimality. Some early work on the subject includes [13], [4], [15] See also [10], [19]. Early theoretical analysis includes [6], [2], [1], [20]). We take a similar path-wise approach to analyze the WL algorithm. The analysis of the WL algorithm is not a straightforward application of the theory in the aforementioned papers because of the specific adaptive control involved. The key point is the stability of the algorithm. We say that the WL algorithm is stable if no component of the partition receives infinitely more visits than any other component as n. On a stable sample path and under appropriate conditions, we show that the WL algorithm learns the optimal weights and satisfies a strong law of large numbers Theorem 4.1). In the specific cases of multicanonical sampling and simulated tempering, which includes the original the WL algorithm, we show that the algorithm is stable and that the aforementioned limit results hold. It came to our attention after the first draft of this paper that a similar extension of the WL algorithm has been proposed independently by F. Liang and coworkers see e.g. [16]). Their approach differs from the WL approach in that these authors took a more classic approach based on stochastic approximation with step-sizes set deterministically. Our proposed generalization to the WL algorithm is presented in Section 2. Some particular

3 /The WL algorithm in general state spaces 3 cases are discussed in Section 3. The theoretical analysis is discussed in Section 4 but the proofs are postponed to Section 6 to facilitate the flow of ideas. 2. The Wang-Landau algorithm In multicanonical sampling, we are given a state space X and a probability measure π. X is then partitioned as X = X i, where X i X j = φ and π is re-weighted in each component X i. An abstract way to do the same and much more is the following. We start with X i, B i, λ i ) i = 1,..., d, a finite family of measure spaces where λ i is a σ-finite measure. We introduce the union space X = d i=1 X i {i}. We equip X with the σ algebra B generated by {A i, i), i {1,..., d}, A i B i } and the measure λ satisfying λa, i) = λ i A)1 Bi A). Let h i : X i R be a non-negative measurable function and define θ i) = X i h i x)λ i dx)/z where Z = d i=1 X i h i x)λ i dx). We assume that θ i) > 0 for all i = 1,..., d and consider the following probability measure on X, B): π dx, i) h ix) θ i) 1 X i x)λ i dx). 1) Our objective is to sample from π. The problem of sampling from such a distribution arises in a number of different Monte Carlo strategies. For example, and as explained above, if π is a probability measure of interest on some space X, B, λ), we can partition X along the energy function logπ) and re-weight π by πx i ) in each component X i. The sampling problem then becomes of the form 1). This powerful strategy appeared first in the Physics literature as multicanonical sampling [9]). This is discussed in some more details is Section 3.1. Sampling from 1) also arises naturally when optimizing the simulated tempering algorithm [18], [12]). In simulated tempering, the states space X is not partitioned but, instead, some auxiliary distributions π 2,..., π d are introduced take π 1 = π). These distributions are chosen close to π but easier to sample from. For good performances, one typically imposes that all the distributions have the same weight. Taking each probability space X, B, π i ) as a component in the formalism above leads to a sampling problem of the form 1). Multicanonical sampling and simulated tempering have been combined in [5] giving an algorithm which can also be framed as 1). Sampling from 1) can also be an efficient strategy to improve on trans-dimensional MCMC samplers for Bayesian inference with model uncertainty. This is detailed in Section 3.3.

4 /The WL algorithm in general state spaces 4 The main obstacle in sampling from π is that the normalizing constants θ are not known. The contribution of the Wang-Landau algorithm [21]) is an efficient algorithm that simultaneously estimates θ and sample from π. The algorithm was introduced in a discrete setting with the π being uniform in i. In this work we extend the algorithm to general state spaces and to arbitrary probability measures. To carry on the discussion in our general framework, we introduce the family of probability measures {π θ, θ 0, ) d } on X, B, λ) defined by: π θ dx, i) h ix) θi) 1 X i x)λ i dx). 2) We assume that for all θ 0, ) d, we have at our disposal a transition kernel P θ on X, B) with invariant distribution π θ. Note that π θ and P θ remain unchanged if we multiply the vector θ by a positive constant. How to build such Markov chain P θ typically depends on the particular instance of the algorithm. We give some examples later. The structure of the WL algorithm is as follows. We start out with some initial value X 0, I 0 ) X, φ 0 0, ) d and set θ 0 i) = φ 0 i)/ d j=1 φ 0 j), i = 1,..., d. Here θ 0 serves as an initial guess of θ. At iteration n+1, we generate X n+1, I n+1 ) by sampling from P φn X n, I n ; ) and update φ n to φ n+1, which is used to form θ n+1 i) = φ n+1 i)/ j φ n+1j). The updating rule for φ n is fairly simple. For i {1,..., d}, if X n+1 X i equivalently, if I n+1 = i), then φ n+1 i) = φ n i)1 + ρ) for some ρ > 0; otherwise φ n+1 i) = φ n i). This leads to a first version of the WL algorithm. Algorithm 2.1 The Wang-Landau algorithm I). Let {ρ n } be a sequence of decreasing positive numbers. Let X 0, I 0 ) X be given. Let φ 0 R d be such that φ 0 i) > 0 and set θ 0 i) = φ 0 i)/ j φ 0j), i = 1,..., d. At some time n 0, given X n, I n ) X, φ n R d, θ n R d : i) Sample X n+1, I n+1 ) P θn X n, I n ; ). ) ii) For i = 1,..., d, set φ n+1 i) = φ n i) 1 + ρ n 1 {In+1 =i} and θ n+1 i) = φ n+1 i)/ j φ n+1j). It remains to choose the sequence {ρ n }. As we show below, {θ n } as defined by Algorithm 2.1 is a stochastic approximation process driven by {X n, I n )}. The general guidelines in the literature to choose {ρ n } are: ρ n > 0, ρ n = and ρ 1+ε n < for some ε > 0, often ε = 1. The typical choice is ρ n n 1. In practice, more careful choices are often necessary for good performances. To the best of knowledge, there is no general, satisfactory way of choosing the step-size in stochastic approximation. Interestingly, Wang-Landau came up with a clever, adaptive way of choosing {ρ n }

5 /The WL algorithm in general state spaces 5 which works very well in practice. We describe their approach next, again in more probabilistic terms. Let v n,k i) denote the proportion of visits to X i {i} between times n + 1 and k. That is, v k,n i) = 0 for k n and for k n + 1, v n,k i) = 1 k n kj=n+1 1{I j = i}. Let c 0, 1) be a parameter to be specified by the user. We introduce two additional random sequences {κ n } and {a n }. Initially, κ 0 = 0. For n 1, define { κ n = inf k > κ n 1 : max 1 i d v κ n 1,ki) 1 d c }, 3) d with the usual convention that inf =. We need another sequence {γ n } of positive decreasing numbers, representing stepsizes. Then, {a n } represents the index of the element of the sequence {γ n } used at time n: a 0 = 0, if k = κ j for some j 1, then a k = a k otherwise a k = a k 1. In other words, we start Algorithm 2.1 with a step-size equal to γ 0 and keep using it until time κ 1 when all the components are visited equally well. Only then we change the step-size to γ 1 and keep it constant until time κ 2 etc... Combining this with Algorithm 2.1, we get the following. Algorithm 2.2 The Wang-Landau algorithm II). Let {γ n } be a sequence of decreasing positive numbers. Let X 0, I 0 ) X be given. Set a 0 = 0, κ = 0, c 0, 1), φ 0 R d such that φ 0 i) > 0 and θ 0 i) = φ 0 i)/ j φ 0j), i = 1,..., d. At some time n 0, given X n, I n ) X, φ n R d, θ n R d, a n and κ: i) Sample X n+1, I n+1 ) P θn X n, I n ; ). ) ii) For i = 1,..., d, set φ n+1 i) = φ n i) 1 + γ an 1 {In+1 =i} and θ n+1 i) = φ n+1 i)/ j φ n+1j). vκ,n+1 iii) If max i i) 1 d c/d then set κ = n + 1 and a n+1 = a n + 1, otherwise a n+1 = a n. Remark The performances of Algorithm 2.2 depends very much on the choice of {γ n } and c. Theoretically we show that the choice γ n n 1 guaranties the convergence of the algorithm. But in practice, this type of step-size can be overly slow. The user might then consider instead γ n a n a > 1) originally proposed by Wang-Landau. But we were not able to obtain the convergence of the algorithm for summable step-sizes. A good compromise is to start the sampler with γ n a n until γ n < ε e.g. ε = 10 5 ) and then switch to γ n = n 1. There is a bias-variance trade-off involved in the choice of c. For c close

6 /The WL algorithm in general state spaces 6 to 0, Algorithm 2.2 will have a low bias in estimating θ ) but a high variance. Such values of c are more suitable for γ n a n. Whereas for larger values of c, the bias will be high with a low variance. For c = d 1, we get a standard stochastic approximation algorithm with deterministic step-size for which a step-size that sums to is necessary for convergence. We found empirically from our simulations that c in the range yields reasonably good samplers. 2. In the actual implementation of the algorithm, it is not necessary to re-normalize φ n into θ n as in ii). In fact, for computational stability, we recommend carrying out the recursion on a logarithmic scale: log φ n i) = log φ n 1 i) + log1 + γ an 1 )1 {In =i}. 3. Another interesting feature of Algorithm 2.2 is that Step iii) can serve as a stopping rule: we stop the simulation when γ an get smaller than some pre-specified value. 4. Under some regularity conditions, if f : X R is some function of interest that is π- integrable and {X n, I n, θ n )} is as described in Algorithm 2.2, we will show below that, 1 n n fx k, I k ) π f), a.s. as n. k=1 If we denote π i the distribution on X i with density with respect to λ i proportional to h i x)1 Xi x), we can estimate integrals with respect to π i as well: nk=1 fx k, I k )1 Xi X k ) nk=1 1 Xi X k ) π i f, i)), a.s. as n. Now if we denote π the distribution on X whose density with respect to λ is proportional to h i x) on X i, the ratio πx, i)/π x, i) is dθ i) and integrals with respect to π can also computed by importance sampling: d n n fx k, I k )θ k I k ) πf), a.s. as n. k=1 Various methods for recycling Monte Carlo samples can be implemented as well. All these results follow from the strong law of large numbers of Theorem Some Applications In this section, we detail briefly some applications of the general algorithm to multicanonical sampling and simulated tempering and trans-dimensional MCMC.

7 3.1. Multicanonical sampling /The WL algorithm in general state spaces 7 Multicanonical sampling is a powerful algorithm proposed by [9]. It holds the potential of improving on mixing times of classical MCMC algorithms. It fits naturally in the framework above. But the implementation can be tedious. Assume that we want to sample from a probability measure πdx) hx)λdx) on some probability space Σ, A, λ). We use the energy function Ex) = loghx)) to build a d-component partition X i ) i of Σ. X i = {x Σ : E i 1 < Ex) E i }, where E 0 < E 1 <... < E d are predefined values. Denote θ i) = πx i ) and assume θ i) > 0. As above, we introduce the union space X = X i {i}. The idea of multicanonical sampling is to sample from π given by: π dx, i) hx) θ i) 1 X i x)λdx), which is of the form 1). There is a simpler formulation of the algorithm. Since the component of the partition to which a point x belongs can be obtained from x itself, multicanonical sampling is equivalent to sampling from π on Σ, A) given by: π dx) d i=1 hx) θ i) 1 X i x)λdx), 4) and the union space formalism is not needed. After sampling from π in 4), a straightforward importance sampling estimate allows to recover π. The algorithm tries to break the barriers in the energy landscape of the distribution by re-weighting each component X i. Clearly, the success depends heavily on a good choice of the energy rings E 0,..., E d. This typically requires some prior information on π or some pilot simulations. We point out that although the energy function E is a natural candidate to utilize to partition the space, the idea can be extended to other functions. In the description of multicanonical sampling given above, taking Σ as a discrete space, π the uniform distribution on X and X e = {x Σ : Ex) = e}, e {e R : Ex) = e for some x Σ} yields the Wang-Landau algorithm of [21]) Simulated Tempering The method can be applied to the simulated tempering of [18] and [12] by taking X i, B i, λ i ) X 1, B 1, λ 1 ) and h i = h 1/t i, 1 = t 1 < t 2 <... < t d. Simulated tempering is a well-known Monte

8 /The WL algorithm in general state spaces 8 Carlo strategy for sampling from difficult target distributions. Assume that the distribution of interest is π 1 dx) hx)λdx). Typically for large temperature t, h 1/t is a more well-behaved distribution for which faster mixing Markov chains can be built. In simulated tempering, we try to take advantage of these faster mixing chains, by targeting the distribution: π θ dx, i) = h1/t i x) 1 Xi x)λ 1 dx), 5) θi) on the union space X, B, λ). A MCMC sampler P θ with invariant distribution π θ is readily designed. Typically, P θ takes the form P θ x, i); A {j}) = B x,θ i, j)p [j] x, A), 6) where B x,θ is a transition kernel on {1,..., d} with invariant distribution h j x)/θj)) i h ix)/θi)) 1 and P [i] a transition kernel on X 1, B 1 ) with invariant distribution proportional to h i x)λdx). Typically, one takes B x,θ i, j) = h i x)/θi)) i h ix)/θi)) 1. Another common choice is to take B x,θ as a Metorpolis-Kernel on {1,..., d} with proposal qi, j). By standard importance sampling techniques, we can convert samples from the higher temperature distributions to also estimate π 1. The method holds for any θ R d, θi) > 0. But the choice of θ can significantly impact the efficiency. Heuristically, it seems that, to improve on mixing, we need a θ that allows to sample from fast converging distributions but close to π); but since π 1 is the distribution of interest, for statistical efficiency, we need a θ that favors π 1. One easy way to resolve this tradeoff is to choose θ such that all the distributions are equally visited. For this we need to choose θi) = θ i) h i x)λ 1 dx) and sample from which can be done with Algorithm 2.2. π dx, i) h1/t i x) θ i) 1 X i x)λ 1 dx), Example 1. We compare a plain simulated tempering with weight θi) 1 and the Wang-Landau simulated tempering described above for sampling from a multimodal bivariate Gaussian mixture distribution. The target distribution given below was taken from [17] πx) = 1 2πσ 2 20 i=1 { ω i exp 1 } 2σ 2 x µ i) x µ i ), 7) where σ = 0.1 and ω i The µ i s are listed in Table 1. The distribution is highly multimodal and it is clear that a plain Random Walk Metropolis algorithm for this distribution does not

9 /The WL algorithm in general state spaces 9 mix in a reasonable time. Simulated tempering can be particularly efficient in such situations. We compare two strategies: a plain simulated tempering where θi) 1 in 5) and the WL adaptation of simulated tempering as described above. We use the temperature scale 1 < 7.7 < 31.6 < 100. Table 2 presents the mean squared errors MSEs) of the two methods in estimating the first two moments of the two components of π. We can see that the WL version is about three-four times more efficient than the plain version in terms of MSE. The estimates are based on 30 independent replications of the samplers. We run each sampler for 100, 000 iterations. In applying Algorithm 2.2, we use γ n = 1/n and c = 0.3 i µ i1 µ i2 i µ i1 µ i2 i µ i1 µ i2 i µ i1 µ i Table 1 The 20 means of the two-dimensional Gaussian mixture. EX 1 ) EX 2 ) EX1 2 ) EX2 2 ) Plain ST WL-ST Ratio Table 2 Mean Square Errors of the plain and the WL simulated tempering algorithms. Based on 30 independent replications with 100, 000 iterations of each sampler Application to trans-dimensional MCMC It is often the case in Statistics that many alternative models are considered for the same data. One is then interested in issues like model comparison, model selection and averaging. Let fdata k, x k ) be the likelihood of model k with parameter x k. Assume that we have a finite number d of models and that x k X k, B k, λ k ). Let X = d i=1 X i {i} be the union space equipped as above with the σ-algebra B and the σ-finite measure λ. X, B, λ) is the natural space to consider when dealing both with model uncertainty and parameter estimation. In the Bayesian framework,

10 /The WL algorithm in general state spaces 10 a prior density with respect to λ) px k, k) in X, B) is specified for x k, k). The posterior distribution of x k, k) is therefore πx k, k) h k x k ) = fdata k, x k )px k, k). In this framework, one is often interested in the Bayes factor of model i to model j defined as B ij := θ i)pj)/θ j)pi)), where θ i) X i πx i, i)λ i dx i ) and pi) = px i, i)λ i dx i ). Trans-dimensional MCMC is a set of specialized MCMC algorithms to sample from distributions like π defined on spaces of variable dimensions. The reversible-jump algorithm of Green [14]) is the most popular such sampler. In the spirit of the WL algorithm, an alternative to sampling directly from π is to sample from the distribution π dx i, i) h ix i ) θ i) 1 X i x)λ i dx i ). 8) By such re-weighting, we give the same posterior weight to all the models. The WL algorithm then offers an effective strategy to sample from π and we recover π by importance sampling. This strategy can improve on the mixing of the sampler Example 2. We set X i = R i for i = 1,..., 20 and consider the following rather trivial transdimensional target distribution: πx i, i) a 1 i e 1 2 x i 2, where we let a i = 1 for i 4, and a 4 = 2π) 16/2. In this distribution, x i R i, the i-dimensional Euclidean space and πx i, i) restricted to R i is proportional to the standard normal distribution. We are interested in the marginal distribution pi) of i. This distribution, as shown in Figure 1, is bimodal with modes at 5 and 20. We pretend that this distribution is intractable and sample from it using a Birth-and-Death Reversible-Jump MCMC. For the fixed-dimensional move, we use a Random Walk Metropolis kernel with a Gaussian proposal with covariance matrix σ p I i, σ p = 0.1. We implement a Birthand-Death move for the trans-dimensional jump. Given x, i), we randomly select j {i 1, i+1} with respective probability ω i,i 1, ω i,i+1. We choose ω i,i+1 = 1/2 with the usual correction at the boundaries. If j = i + 1, we proposal y = x i, u) where u N0, σ 2 ) with σ = 0.1. We accept y, j) with probability min1, A) where A = πy, j) ω ji 1 2πσe 2σ πx, i) ω 2 u2. ij Similarly if j = i 1, we write x = y, u ) with u R and propose y, j). This value is then

11 accepted with probability min1, A) with /The WL algorithm in general state spaces 11 A = πy, j) ω ji 1 e 1 2σ πx, i) ω 2 u 2. ij 2πσ This vanilla RJMCMC sampler fails to sample from π. Depending on its starting point, the sampler typically found one of the two modes and got stuck around that mode even after 10 millions iterations see Fig 2a). In contrast, the WL algorithm provided a reasonable estimate of the distribution in only 2 millions iterations. In this example, computationally each WL iteration step costs roughly about that for 1.2 step of the vanilla sampler. For the WL approach we use c = 0.4 and γ n = 2 n until 10 4 before switching to γ n = n 1. a) b) c) Figure 1: Marginal posterior distribution of models. a) estimate from the plain RJMCMC; b) estimate from the WL-RJMCMC ; c) True posterior distribution. Estimates are based on iterations for the plain RJMCMC and iterations for the WL-RJMCMC.

12 4. Some theoretical results /The WL algorithm in general state spaces 12 We look at some theoretical aspects of the algorithm. We investigate the convergence of θ n to θ and a strong law of large numbers for {X n, I n )}. The difficulty is in proving that the algorithm is stable in the following sense. Definition 4.1. Let v n i) be the occupation measure of X i by time n: v n i) = 1 nk=1 n 1 {Ik =i}. The Wang-Landau algorithm is said to be stable if max i,j lim sup nv n i) v n j)) <, a.s. 9) n This is an essential property of the algorithm. When the algorithm is stable, we will show that all the stopping times κ l defined in Algorithm 2.2 are finite and the step-size γ an will gradually converges to 0. Moreover Theorem 4.1 below) on a path where the algorithm is stable, θ n θ and a strong law of large numbers hold for 1 n nk=1 fx k, I k ). Then, we will derive some verifiable conditions under which the algorithm is shown to be stable. To maintain the flow of ideas, some of the proofs are postponed to Section Ergodicity Let Θ = {θ R d : di=1 θi) = 1, θi) 0, 1), i = 1,..., d} and X 0, I 0 ), θ 0 ) X Θ be the initial state of the algorithm. This initial state will be considered fixed but arbitrary. Let {γ n } be the step-size sequence. Let Pr be the distribution of the process {X n, I n, θ n } started at X 0, I 0, θ 0 ) with step-size sequence {γ n } and denote E the expectation with respect to Pr. To simplify the notations, we will omit to make explicit the dependence of Pr on X 0, I 0, θ 0 ) and {γ n }. All statements made almost surely will be with respect to Pr. For θ R d, θ denotes the Euclidean norm of θ. For any ε 0, ε ), where ε = min i θ i), define Θ ε = {θ Θ, θi) ε, i = 1,..., d}. Our main assumption asserts that the family {P θ, θ Θ ε } is Lipschitz and uniformly V -ergodic. See e.g. [2] for some examples of MCMC samplers where these assumptions hold. Before stating these assumptions, we need some notations. For any function f : X R and fx) W : X [1, ) we denote f W := sup x X W x), and introduce the set of W -bounded functions L W := {f meas., f : X R, f W < }. A transition kernel on X, B) operates on measurable real-valued functions f as P fx) = P x, dy)fy) and the product of two transition kernels P 1

13 /The WL algorithm in general state spaces 13 and P 2 is the transition kernel defined as P 1 P 2 x, A) = P 1 x, dy)p 2 y, A). For two transition kernels P 1 and P 2 we define P 1 P 2 W, the W -distance between P 1 and P 2 as P 1 P 2 W := sup P 1 f P 2 f W. f W 1 A1) We assume the existence of a measurable function V : X [1, ), a set C X, a probability measure ν on X, B) such that νc) > 0 with the following property. For all ε 0, ε ) we can find constants λ ε 0, 1), b ε [0, ), β ε 0, 1] and integer n 0,ε such that: and inf θ Θ ε P n 0,ε θ x, A) β ε νa)1 C x), x X, A B; 10) sup P θ V x) λ ε V x) + b ε 1 C x), x X. 11) θ Θ ε The inequality 11) of A1) is the so-called drift condition and 10) is the so-called minorization condition. Next, we assume that P θ is Lipschitz as a function of θ. A2) For all α [0, 1], for all ε 0, ε ), there exists K = Kα, ε) < such that for all θ, θ Θ ε P θ P θ V α K θ θ, 12) where V is defined in A1). Finally we assume that the step-size sequence is well-behaved: A3) {γ n } is non-increasing, γ n > 0, γ n = and γ n = On 1 ) as n. For B > 0, we introduce the following stopping time: τb) := inf{k 0 : max k v k i) v k j)) > B}, 13) i,j with the usual convention that inf =. τb) is the first time where one component will accumulate B visits or more than some other component. Thus Definition 4.1 is precisely equivalent to τb) = for some B. With this in mind, the next results says essentially that if the algorithm is stable and A1-3) hold then it is ergodic. Theorem 4.1. Assume A1-3). Let B > 0 be given. Then:

14 /The WL algorithm in general state spaces 14 i) θ n θ 1 {τb)>n} 0, a.s. as n. 14) ii) For any function f L V 1/2, denoting f = f π f), we have: Proof. See Section n n k=1 fx k, I k )1 {τb)>k 1} 0, a.s. as n. 15) 4.2. Checking A1-2) We impose the drift condition and the minorization condition A1) uniformly for θ Θ ε, not uniformly for θ Θ. This is an important point. Indeed and as we will see now, a drift and minorization condition uniformly-in-θ for θ Θ ε is almost always true as soon as one P θ satisfies these conditions Proposition 4.1). Whereas a minorization and drift condition uniformly-in-θ for θ Θ is almost never true. Indeed, suppose that each P θ is a Metropolis-Hastings kernel with invariant distribution π θ and proposal kernel Q. That is P θ fx, i) = M θ fx, i) + r θ x, i)fx, i), where and M θ fx, i) = d j=1 r θ x, i) = 1 X j min d j=1 1, θi) θj) Ry, j; x, i) ) fy, j)qx, i; dy, j) X j min 1, θi) θj) Ry, j; x, i) ) Qx, i; dy, j), and Rx, i; y, j) the Radon-Nikodym density of πdx, i)qx, i; dy, j) with respect to πdy, j)qy, j; dx, i). With θ = 1,..., 1), we use π resp. P and M) to denote π θ resp. P θ and M θ ). The next result states that we only need to check that P is geometrically ergodic to obtain A1-2). Proposition 4.1. Suppose that existence of a measurable function V : X [1, ), a set C X, a probability measure ν on X, B) such that νc) > 0; constants λ 0, 1), b [0, ), β 0, 1] and finite integer n 0 such that: M n 0 x, A) βνa)1 C x), x X, A B;

15 /The WL algorithm in general state spaces 15 and P V x) λv x) + b1 C x), x X. Then A1-2) hold. Proof. See Section Stability Theorem 4.1 asserts that under A1-3), the WL algorithm will converge to the right limit on stable paths. This opens the question of checking the stability of the algorithm. The stability condition is difficult to check in general. The next theorem gives some easily checked conditions under which the algorithm is stable. Theorem 4.2. The WL algorithm is stable under either of the following two conditions. a) There exist ε 0, 1), K 0, ) and integer n 0 0 such that for any i, j {1,..., d} and θ R d, θi)/θj) > K implies that P n 0 θ x, i), X j {j}) ε for all x X i. b) There exists ε 0, 1), such that for any j {1,..., d} and θ R d, θj) min 1 i d θi) implies P θ x, i), X j {j}) ε for all x / X i. Proof. See Section Application to Multicanonical sampling Consider the multicanonical sampling of Section 3.1. Suppose that P θ is the Independence- Sampler with proposal distribution Qdx) = qx)λdx) and invariant distribution π θ dx) di=1 hx)/θi))1 Xi x)λdx). Assume that the function ωx) hx)/qx) is bounded with supremum ω 0. Then clearly, for all x X i, as soon as θi) θj), taking ε j P θ x, X j ) min 1, θi) ) min θj) X j ε j, 1, ωy) ω 0 ) Qdy) = ) X j min 1, ωy) ω 0 Qdy) > 0 since πx i ) > 0). Thus for any i, j {1,..., d}, i j and by Theorem 4.2, the WL algorithm is stable. Now, since ω is

16 /The WL algorithm in general state spaces 16 bounded, each P θ satisfies a drift condition and a minorization condition which implies A1-2) by Proposition 4.1. Similarly, if P θ is a Random Walk Metropolis with proposal kernel qy x) we have: P θ x, X j ) min 1, θi) ) min θj) X j 1, πy) ) qy x)dy. πx) It follows that if X is compact and π, q positive and continuous, then i j for all i, j. Under the same assumption, A1-2) also hold. We can conclude that: Corollary 4.1. In the case of multicanonical sampling of Section 3.1. Assume either i) or ii): i) P θ is an independent-metropolis sampler with proposal distribution Qdx) = qx)λdx) and ω h/q is bounded. ii) P θ is a RWM sampler with proposal qy x); X is compact and π and q are positive and continuous. Then the algorithm is stable and under A3), we have: θ n θ 0; and 1 n for any bounded measurable function f. n fx k ) π f) a.s. as n, k= Application to Simulated tempering Theorems 4.1 and 4.2 can also be applied to the WL version of simulated tempering as described in Section 3.2. We consider the simulated tempering algorithm with kernel P θ x, i); A {j}) = B x,θ i, j)p [j] x, A) where B x,θ i, j) h j x)/θj) and P [j] is a transition kernel not necessarily Metorpolis-Hastings) with invariant distribution h j. The following corollary is easily proved and is left to the reader. Corollary 4.2. Suppose that X 1 is compact, h positive and continuous and {γ n } satisfies A3). Suppose that there exist ε > 0, a probability measure ν and an integer n 0 such that νx i ) > 0 and P [i]) n 0 x, A) ενa), i {1,..., d}. Then θ n θ 0; and 1 n for any bounded measurable function f. n fx k, I k ) π f) a.s. as n, k=1

17 5. Discussion and open problems /The WL algorithm in general state spaces 17 In this paper, we propose an extension of the WL algorithm to general states spaces. The WL algorithm differs from other adaptive Markov Chain Monte Carlo algorithms based on stochastic approximation by the adaptive nature of its step-size. We have shown through examples that the algorithm can be used effectively to improve on simulated tempering and trans-dimensional MCMC algorithms. We have also studied the asymptotic behavior of the WL algorithm. We have shown that on stable sample paths and with an appropriate step-size, θ n converges to θ and a strong law of large numbers hold. Finally, when the state space is compact, we have shown that in most cases, the algorithm is stable and the aforementioned limit results apply. Two main questions have remained unanswered. The first question concerns the stability of the algorithm in unbounded state spaces. Secondly, in order to exploit the full potential of the algorithm, a more precise understanding of its efficiency is needed. In particular we need to understand how the rate of convergence and the asymptotic variances of the algorithm are related to the parameter c and step-size γ n. We are currently investigating some of these questions. 6. Proofs The techniques used here can be found in various forms in [8], [11], [3]. Throughout, C ε denotes a generic constant whose value can be different from one equation to another. The key result in the proof of Theorem 4.1 is Lemma 6.6 which states that the weighted sum of the noise process in the stochastic approximation followed by {θ n } is summable Proof of Proposition 4.1 Lemma 6.1. Under the conditions of Proposition 4.1, A2) hold. Proof. Let ε 0, ε ) and α 0, 1]. For θ Θ ε, f V α and x, i) X, we have: where ex, i) 1, and M θ fx, i) = P θ fx, i) = M θ fx, i) + fx, i) 1 M θ ex, i)), d j=1 X j min 1, θi) θj) ry, j; x, i) ) fy, j)qx, i; dy, j).

18 As a consequence, for θ 2, θ 1 Θ ε : /The WL algorithm in general state spaces 18 P θ2 P θ1 V α M θ2 M θ1 V α + M θ2 M θ1 T V, where P T V = P V with V 1. For x, i) X, the function θ M θ fx, i) is differentiable and: d d M θ fx, i) θ j=1 j C ε M f x, i). j=1 A2) then follows by the mean-value theorem and the drift condition on P. Next we show that for any ε 0, ε ), the family P θ ) θ Θε ) satisfies a uniform in θ) drift condition. Lemma 6.2. Under the conditions of Proposition 4.1, A1) hold. Proof. The minorization is immediate. Since min1, ab) min1, a) min1, b) for a, b > 0, P θ x, i; A) M θ x, i; A) 1/ε)M 0 x, i; A) β/ενa)1 C x, i). The argument to show the uniform drift is fairly simple and we only sketch it. Let B θ,r = Θ ε Bθ, r) the open ball of Θ ε with center θ Θ ε and radius r > 0. It follows from Lemma 6.1 that by choosing r > 0 small enough, if P θ satisfies a drift condition toward a small set C, then the family {P θ, θ B θ,r } satisfies a uniform in θ ) drift condition toward C. Therefore, starting from P, we can find an open coverage of Θ ε by the B θ,r such that on each such B θ,r a uniform drift toward C hold. Since Θ ε is compact it admits a finite coverage by the B θ,r and using the maximum of the constants of the drift condition toward C for each such ball, we get a uniform drift toward C for {P θ, θ Θ ε } Proof of Theorem 4.1 We start with some preliminary remarks. For any ε 0, ε ) and α 0, 1], it is known from Markov chains theory that A1) implies the existence of C ε,α <, ρ ε,α 0, 1) and b ε as in A1) such that: and sup P θ n π θ V α C ε,α ρ n ε,α, 16) θ Θ ε sup π θ V ) b ε. 17) θ Θ ε

19 /The WL algorithm in general state spaces 19 For a proof, see e.g. [7] and the references therein. Define ξε) = inf{k 0 : θ k / Θ ε }. An easy calculation using A1) gives that ] E [V X n, I n )1 {ξε)>n} = E [V X n, I n )1 Θε θ 0 ) 1 Θε θ n )] λ n ε V X 0, I 0 ) + b ε /1 λ ε ) from which we deduce: ) sup E V X n, I n )1 {ξε)>n} <. 18) n The proof of the theorem will be based on some nice properties of solutions of so-called Poisson equation. These solutions will allow us to obtain a martingale approximation to the process nk=1 fx k, I k ). For f L V α, α 0, 1] and θ Θ ε, define the function h θ = Pθ k f π θ f)). 19) h θ solves the Poisson equation f π θ f) = h θ P θ h θ. By A1), h θ exists and h θ L V α. For f L V α and θ, θ Θ ε, denoting f θ = f π θ f), we have π θ f) π θ f) = = π θ [P θ k f θ ] π θ [ Pθ k f ] θ ) + C ε ρ k ε + θ θ k [ )] π θ P k j θ P θ P θ ) P j 1 θ f θ j=1 k ρε j 1, j=1 using A1-2) from which we deduce that there exists a finite constant C ε such that: sup π θ f) π θ f) C ε θ θ. 20) f L V α The constant C ε is not necessarily the same from one equation to the other. Similarly, denoting P θ the operator P θ π θ, we have Pθ k f θ Pθ k f θ = = P θ k f P θ k f k j=1 P k j θ P θ P θ ) C ε θ θ kρ k 1 ε V α, P j 1 θ f

20 /The WL algorithm in general state spaces 20 using A1-2). This, together with 20) imply the existence of a finite constant C ε such that for all α 0, 1], θ, θ Θ ε : h θ h θ V α + P θ h θ P θ h θ V α C ε θ θ. 21) In our analysis, we mainly see {θ n } as a stochastic approximation sequence. The recursion on {θ n } writes: θ n+1 i) = φ n+1 i) de=1 φ n+1 e) = φ ni) + γ an φ n i)1 {In+1 =i} de=1 φ n e) + γ an φ n I n+1 ) = θ n i) 1 + γ a n 1 {In+1 =i} 1 + γ an θ n I n+1 ) = θ n i) + γ an H i θ n, I n+1 ) + γ 2 a n r i,n θ n, I n+1 ), 22) ) where H i θ, I) = θi) 1 {I=i} θi) and r i,n θ, I) = θi)θi) 1 {I=i} θi) 1+γ an θi). The mean field function h i θ) = π θ H i ) is h i θ) = θ i) θi) dj=1. 23) θ j) θj) Lemma 6.3. Assume A3). Let B > 0 be given and c 0 = 2Bd/c. Then on {τb) > n}, a n n/c 0. Moreover γan =, and γ 2 an 1 {τb)>n} < a.s. Proof. On {τb) > n}, for any k < k n, and for any i {1,..., d}, v k,k i) 1/d) 2B/k k). Therefore if lc 0 n < τb), κ l n. That is, a n n/c 0 on {τb) > n}. Since {γ n } is non-increasing and a n n, γ an γ n =. On the other hand γa 2 n 1 {τb)>n} Cc 0 n 2 <. For any ε > 0, we introduce the stopping time ξε) := inf {k 0 : θ k / Θ ε }. We will need the following lemma whose proof is left to the reader. Lemma 6.4. Let {γ n, n 0} be a non-increasing sequence of positive number and {v n, n 0} a sequence of numbers such that N v k B for all N 0. Then N γ k v k γ 0 B for all N 0.

21 The following lemma relates τb) and ξε). /The WL algorithm in general state spaces 21 Lemma 6.5. Assume A3). For any B > 0, we can find ε 0, ε ) such that τb) ξε). 1 Proof. Take ε = 1 + d 1)e 0) Bγ > 0. Without any loss, we assume that ε < ε and ε min i θ 0 i). We need to show that min i θ n i) > ε for all n < τb). But θ n i) = 1 + ) φ nj) 1. j i φ n i) It is thus enough to show that φ n j)/φ n i) e Bγ 0 for all i j and for any n < τb). But we have: φ n j) φ n i) = exp p=0 γ p Nκp,n κ p+1 j) N κp,n κp+1 i) ), where N l,m i) = 0 if m l and N l,m i) = m q=l+1 1 Xi X q ) otherwise N l,m i) is the number of visits to X i from time l +1 to m). For any n < τb), P p=0 Nκp,n κ p+1 i) N κp,n κ p+1 j) ) B for all P 0. Lemma 6.4 thus implies that φ n i)/φ n j) e γ 0B. Lemma 6.6. Assume A1-3). Let B > 0 be given. Let {γ n} be a sequence that satisfies A3) and such that γ n/a γ n/a < for all a > 0. For θ Θ, Let H θ : X R be a measurable function such that H θ L V 1/2. Then: γ a k 1 {τb)>k} [H θk X k+1, I k+1 ) π θk H θk )] <, a.s. 24) Proof. Let ε > 0 as in Lemma 6.5. From A1), h θ L V 1/2 exists that solves the Poisson equation h θ P θ h θ = H θ π θ H θ ) for all θ Θ ε. Using this we can write: n γ a k 1 {τb)>k} [H θk X k+1, I k+1 ) π θk H θk )] = n γ a k 1 {τb)>k} U 1) k+1 + U 2) k+1 + U 3) ) k+1, where U 1) k+1 = h θ k X k+1, I k+1 ) P θk h θk X k, I k ), U 2) k+1 = P θ k h θk X k, I k ) P θk+1 h θk+1 X k+1, I k+1 ), U 3) k+1 = P θ k+1 h θk+1 X k+1, I k+1 ) P θk h θk X k+1, I k+1 ). Clearly, M n = n γ a k 1 {τb)>k} U 1) k+1 is a martingale and using Lemma 6.3, 18) and since h θ

22 /The WL algorithm in general state spaces 22 and P θ h θ L V 1/2, we have: ) E Mn 2 C ε n [ ] E γ a k ) 2 1 {τb)>k} V X k+1, I k+1 ) n [ ] C ε γ k/c 0 )2 E 1 {τb)>k} V X k+1, I k+1 ) C ε γ k/c 0 )2 <. By Doob s convergence theorem for martingales, γ a k 1 {τb)>k} U 1) k+1 is finite a.s. On {τb) = l}, l <, γ a k 1 {τb)>k} U 2) k+1 = l 1 γ a k U 2) k+1 which is finite almost surely. On {τb) = }, we can write: 1 {τb)= } n γ a k 1 {τb)>k} U 2) k+1 = γ a 0 P θ0 h θ0 X 0, I 0 ) γ a n 1 {τb)= } P θn+1 h θn+1 X n+1, I n+1 ) + n 1 γ a k+1 γ a k ) 1 {τb)= } P θk+1 h θk+1 X k+1, I k+1 ). [ ) ] 2 E γ a n 1 {τb)= } P θn+1 h θn+1 X n+1, I n+1 ) C ε γ n/c 0 )2 and γ n/c 0 )2 < thus γ a n 1 {τb)= } P θn+1 h θn+1 X n+1, I n+1 ) converges a.s. to 0. n 1 ) γ a k+1 γ a k 1 {τb)= } P θk+1 h θk+1 X k+1, I k+1 ) C ε k=1 γ a k γ a k+1 ) 1 {τb)= } V 1/2 X k+1, I k+1 ). Since a k only changes at the stopping times κ i, we have [ ] [ ) ] E γ ak γ ak+1 1 {τb)= } V 1/2 X k+1, I k+1 ) = E γ k 1 γ k ) 1{τB)= } V 1/2 X κk +1, I κk +1) = k=1 [ ] γ k 1 γ k) E 1 {τb)= } V 1/2 X κk +1, I κk +1) C ε γ k 1 γ k) k=1 C ε γ 0. By Lebesgue s dominated convergence theorem, we can conclude that [ ] ) E γ a k+1 γ a k 1 {τb)= } P θk+1 h θk+1 X k+1, I k+1 ) <, This is sufficient to conclude that γ a k 1 {τb)>k} U 2) k+1 is finite almost surely.

23 /The WL algorithm in general state spaces 23 and Using 21): n γ a U 3) k 1 {τb)>k} k+1 C ε γ a k γ ak 1 {τb)>k} V 1/2 X k+1, I k+1 ) C ε γ n/c 0 γ n/c 0 1 {τb)>k} V 1/2 X k+1, I k+1 ) [ ] E γ n/c 0 γ n/c 0 1 {τb)>k} V 1/2 X k+1, I k+1 ) C ε γ n/c 0 γ n/c 0 < by 18). With Lebesgue s dominated convergence theorem, we deduce that [ ] E γ a U 3) k 1 {τb)>k} k+1 C ε γ n/c 0 γ n/c 0 < which implies that n γ a k 1 {τb)>k} U 3) k+1 converges almost surely to a finite limit. This completes the proof of the lemma. We are now in position to prove Theorem 4.1. We start with i). Proposition 6.1. Assume A1-3) and let B > 0 be given. Then θ n θ 1 {τb)>n} 0 with probability one as n. Proof. The idea of the proof is borrowed from [11]. We saw in 22) that θ n+1 = θ n + γ an Hθ n, I n+1 ) + γ 2 a n r n, ) where H = H 1,..., H d ), r n = r n,1,..., r n,d ), H i θ, I) = θi) 1 {I=i} θi) and r i,n = θi)θi) 1 {I=i} θi) 1+γ a nθi). We note that H 1 and r n 1. Let ε 0, ε ) be such that when τb) ξε) Lemma 6.5). We recall that the mean field function of the recursion is h i θ) = θ i) θi))/ d j=1 θ j) θj). We introduce θ n = θ n + j=n γ aj 1 {τb)>j} [Hθ j, I j+1 ) hθ j )]. From Lemma 6.6, θ n θ n 0 almost surely. {θ n} satisfies the recursion θ n+1 = θ n + γ an hθ n ) + γ an 1 {τb) n} [Hθ n, I n+1 ) hθ n )] + γ 2 a n r n. We can then deduce: θ n+1 θ 2 1 {τb)>n} = θ n θ 2 1 {τb)>n} + 2γ an θ n θ, hθ n ) 1 {τb)>n} + γ an 1 {τb)>n} r n 1 2εγ an ) θ n θ 2 1 {τb)>n} + γ an 1 {τb)>n} r n + θ n θ, hθ n ) hθ n) ).

24 /The WL algorithm in general state spaces 24 where r n 0 almost surely as n. Since θ n remains in the compact Θ ε, h is continuous and since θ n θ n 0, it follows that θ n θ, hθ n ) hθ n) 0. We can summarize the situation like this. Writing U n = θ n θ 2 1 {τb)>n}, we have: U n+1 1 2εγ an )U n + γ an r n, 25) where r n 0 as n. This implies that U n 0 which, given Lemma 6.6 proves the Proposition. To see why U n 0, let δ > 0 be given. Take n 0 > 0 such that for n n 0, r n 2εδ and 1 2εγ an )1 {τb)>n} > 0. Then for n n 0, U n+1 δ) 1 2εγ an )U n δ) + 2εγ an r n/2ε δ) 1 2εγ an )U n δ) which implies that lim supu n δ) 0 and since δ > 0 is arbitrary, we conclude that lim U n = 0. Proposition 6.2. Assume A1-3) and let B > 0 be given. For any function f L V 1/2, denoting f = f π f), we have: 1 n n k=1 fx k, I k )1 {τb)>k 1} 0, a.s. as n. 26) Proof. In view of Proposition 6.1 and 20), we only need to show that 1 n n k=1 ) fx k, I k ) π θk 1 f) 1 {τb)>k 1} 0 a.s. 27) Kronecker s Lemma applied to 24) of Lemma 6.6 with γ n = 1/n, H θ = f yields 27) Proof of Theorem 4.2 Proof. Assume that [a] hold. Define α k = 1 + γ 0 ) 1+n0)k and suppose that lim sup n nv n i) v n j)) =. This implies the existence of an increasing sequence of integers {n k, k 1} such that n k v nk i) v nk j)) > α k and X nk, I nk ) X i {i} but X nk +n 0, I nk +n 0 ) / X j {j} for all k 1. Clearly, n k v nk i) v nk j)) > α k implies that θ nk i)/θ nk j) converges to +. But, since i leads to j, we can then find ε > 0 and k 0 such that for k k 0 : Pr [X nk +n 0, I nk +n 0 ) / X j {j} F nk, X nk, I nk ) X i {i}, n k v nk i) v nk j)) > α k ] 1 ε). Thus Pr lim sup n nv n i) v n j)) = ) lim k 1 ε) k = 0.

25 /The WL algorithm in general state spaces 25 Assume that [b] hold. Define α k = 1 + γ 0 ) 2k and suppose that lim sup n nmax i v n i) min j v n j)) =. Then we can find i 0 {1,..., d} and an increasing sequence of integers {n k, k 1} such that min j v n j) = v n i 0 ), n k max j v nk j) v nk i 0 )) > α k, X nk, I nk ) / X i0 {i 0 } and X nk +1, I nk +1) / X i0 {i 0 } for all k 1. Then we can proceed as above and conclude. Acknowledgments: The authors are grateful to Christophe Andrieu, David P. Landau, Eric Moulines and for very helpful discussions. We also thank David P. Sanders and the referees for comments that have made this paper hopefully more readable. This work is partly supported by the National Science Foundation grant DMS and by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada. References [1] Andrieu, C. and Atchade, Y. F. 2007). On the efficiency of adaptive mcmc algorithms. Electronic Communications in Probability [2] Andrieu, C. and Moulines, É. 2006). On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab [3] Andrieu, C., Moulines, É. and Priouret, P. 2005). Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim electronic). [4] Andrieu, C. and Robert, C. P. 2001). Controlled mcmc for optimal sampling. Technical report, Université Paris Dauphine, Ceremade [5] Atchade, Y. F. and Liu, J. S. 2006). Discussion of the paper by kou, zhou and wong. Annals of Statistics To appear. [6] Atchade, Y. F. and Rosenthal, J. S. 2005). On adaptive markov chain monte carlo algorithm. Bernoulli [7] Baxendale, P. H. 2005). Renewal theory and computable convergence rates for geometrically ergodic markov chains. Annals of Applied Probability [8] Benveniste, A., Métivier, M. and Priouret, P. 1990). Adaptive Algorithms and Stochastic approximations. Applications of Mathematics, Springer, Paris-New York. [9] Berg, B. A. and Neuhaus, T. 1992). Multicanonical ensemble: A new approach to simulate first-order phase transitions. Phys. Rev. Lett. 68.

26 /The WL algorithm in general state spaces 26 [10] Brockwell, A. E. and Kadane, J. B. 2005). Identification of regeneration times in MCMC simulation, with application to adaptive schemes. J. Comput. Graph. Statist [11] Delyon, B. 1996). General results on the convergence of stochastic algorithms. IEEE Trans. Automat. Control [12] Geyer, C. J. and Thompson, E. 1995). Annealing markov chain monte carlo with applications to pedigree analysis. Journal of the American Statistical Association [13] Gilks, W. R., Roberts, G. O. and Sahu, S. K. 1998). Adaptive Markov chain Monte Carlo through regeneration. J. Amer. Statist. Assoc [14] Green, P. J. 1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika [15] Haario, H., Saksman, E. and Tamminen, J. 2001). An adaptive metropolis algorithm. Bernoulli [16] Liang, F., Liu, C. and Carroll, R. J. 2007). Stochastic approximation in monte carlo computation. JASA [17] Liang, F. and Wong, W. H. 2001). Real-parameter evolutionary monte carlo with applications to bayesian mixture models. Journal of the American Statistical Association [18] Marinari, E. and Parisi, G. 1992). Simulated tempering: A new monte carlo schemes. Europhysics letters [19] Mira, A. and Sargent, D. J. 2003). A new strategy for speeding Markov chain Monte Carlo algorithms. Stat. Methods Appl [20] Rosenthal, J. S. and Roberts, G. O. 2007). Coupling and ergodicity of adaptive mcmc. Journal of Applied Probablity [21] Wang, F. and Landau, D. P. 2001). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters

Monte Carlo methods for sampling-based Stochastic Optimization

Monte Carlo methods for sampling-based Stochastic Optimization Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS & Telecom ParisTech Paris, France Joint works with B. Jourdain, T. Lelièvre, G. Stoltz from ENPC and E. Kuhn from

More information

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Some Results on the Ergodicity of Adaptive MCMC Algorithms Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship

More information

Chapter 12 PAWL-Forced Simulated Tempering

Chapter 12 PAWL-Forced Simulated Tempering Chapter 12 PAWL-Forced Simulated Tempering Luke Bornn Abstract In this short note, we show how the parallel adaptive Wang Landau (PAWL) algorithm of Bornn et al. (J Comput Graph Stat, to appear) can be

More information

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms Yan Bai Feb 2009; Revised Nov 2009 Abstract In the paper, we mainly study ergodicity of adaptive MCMC algorithms. Assume that

More information

An introduction to adaptive MCMC

An introduction to adaptive MCMC An introduction to adaptive MCMC Gareth Roberts MIRAW Day on Monte Carlo methods March 2011 Mainly joint work with Jeff Rosenthal. http://www2.warwick.ac.uk/fac/sci/statistics/crism/ Conferences and workshops

More information

University of Toronto Department of Statistics

University of Toronto Department of Statistics Norm Comparisons for Data Augmentation by James P. Hobert Department of Statistics University of Florida and Jeffrey S. Rosenthal Department of Statistics University of Toronto Technical Report No. 0704

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Markov Chain Monte Carlo Lecture 4

Markov Chain Monte Carlo Lecture 4 The local-trap problem refers to that in simulations of a complex system whose energy landscape is rugged, the sampler gets trapped in a local energy minimum indefinitely, rendering the simulation ineffective.

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Adaptive Markov Chain Monte Carlo: Theory and Methods

Adaptive Markov Chain Monte Carlo: Theory and Methods Chapter Adaptive Markov Chain Monte Carlo: Theory and Methods Yves Atchadé, Gersende Fort and Eric Moulines 2, Pierre Priouret 3. Introduction Markov chain Monte Carlo (MCMC methods allow to generate samples

More information

On Reparametrization and the Gibbs Sampler

On Reparametrization and the Gibbs Sampler On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Recent Advances in Regional Adaptation for MCMC

Recent Advances in Regional Adaptation for MCMC Recent Advances in Regional Adaptation for MCMC Radu Craiu Department of Statistics University of Toronto Collaborators: Yan Bai (Statistics, Toronto) Antonio Fabio di Narzo (Statistics, Bologna) Jeffrey

More information

Quantitative Non-Geometric Convergence Bounds for Independence Samplers

Quantitative Non-Geometric Convergence Bounds for Independence Samplers Quantitative Non-Geometric Convergence Bounds for Independence Samplers by Gareth O. Roberts * and Jeffrey S. Rosenthal ** (September 28; revised July 29.) 1. Introduction. Markov chain Monte Carlo (MCMC)

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO. By Yves F. Atchadé University of Michigan

KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO. By Yves F. Atchadé University of Michigan Submitted to the Annals of Statistics arxiv: math.pr/0911.1164 KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO By Yves F. Atchadé University of Michigan We study the asymptotic

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Examples of Adaptive MCMC

Examples of Adaptive MCMC Examples of Adaptive MCMC by Gareth O. Roberts * and Jeffrey S. Rosenthal ** (September, 2006.) Abstract. We investigate the use of adaptive MCMC algorithms to automatically tune the Markov chain parameters

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 Sequential parallel tempering With the development of science and technology, we more and more need to deal with high dimensional systems. For example, we need to align a group of protein or DNA sequences

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet Stat 535 C - Statistical Computing & Monte Carlo Methods Lecture 18-16th March 2006 Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Trans-dimensional Markov chain Monte Carlo. Bayesian model for autoregressions.

More information

A Note on Auxiliary Particle Filters

A Note on Auxiliary Particle Filters A Note on Auxiliary Particle Filters Adam M. Johansen a,, Arnaud Doucet b a Department of Mathematics, University of Bristol, UK b Departments of Statistics & Computer Science, University of British Columbia,

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known

More information

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains A regeneration proof of the central limit theorem for uniformly ergodic Markov chains By AJAY JASRA Department of Mathematics, Imperial College London, SW7 2AZ, London, UK and CHAO YANG Department of Mathematics,

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet Stat 535 C - Statistical Computing & Monte Carlo Methods Lecture 13-28 February 2006 Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Limitations of Gibbs sampling. Metropolis-Hastings algorithm. Proof

More information

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

Sequential Monte Carlo Samplers for Applications in High Dimensions

Sequential Monte Carlo Samplers for Applications in High Dimensions Sequential Monte Carlo Samplers for Applications in High Dimensions Alexandros Beskos National University of Singapore KAUST, 26th February 2014 Joint work with: Dan Crisan, Ajay Jasra, Nik Kantas, Alex

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

SUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS

SUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS Submitted to the Annals of Statistics SUPPLEMENT TO PAPER CONERGENCE OF ADAPTIE AND INTERACTING MARKO CHAIN MONTE CARLO ALGORITHMS By G Fort,, E Moulines and P Priouret LTCI, CNRS - TELECOM ParisTech,

More information

Sampling multimodal densities in high dimensional sampling space

Sampling multimodal densities in high dimensional sampling space Sampling multimodal densities in high dimensional sampling space Gersende FORT LTCI, CNRS & Telecom ParisTech Paris, France Journées MAS Toulouse, Août 4 Introduction Sample from a target distribution

More information

Stochastic Approximation in Monte Carlo Computation

Stochastic Approximation in Monte Carlo Computation Stochastic Approximation in Monte Carlo Computation Faming Liang, Chuanhai Liu and Raymond J. Carroll 1 June 22, 2006 Abstract The Wang-Landau algorithm is an adaptive Markov chain Monte Carlo algorithm

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Probability and Measure

Probability and Measure Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real

More information

Stochastic Approximation in Monte Carlo Computation

Stochastic Approximation in Monte Carlo Computation Stochastic Approximation in Monte Carlo Computation Faming Liang, Chuanhai Liu and Raymond J. Carroll 1 June 26, 2006 Abstract The Wang-Landau algorithm is an adaptive Markov chain Monte Carlo algorithm

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo Winter 2019 Math 106 Topics in Applied Mathematics Data-driven Uncertainty Quantification Yoonsang Lee (yoonsang.lee@dartmouth.edu) Lecture 9: Markov Chain Monte Carlo 9.1 Markov Chain A Markov Chain Monte

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert

More information

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) Rasmus Waagepetersen Institute of Mathematical Sciences Aalborg University 1 Introduction These notes are intended to

More information

SPECTRAL GAP FOR ZERO-RANGE DYNAMICS. By C. Landim, S. Sethuraman and S. Varadhan 1 IMPA and CNRS, Courant Institute and Courant Institute

SPECTRAL GAP FOR ZERO-RANGE DYNAMICS. By C. Landim, S. Sethuraman and S. Varadhan 1 IMPA and CNRS, Courant Institute and Courant Institute The Annals of Probability 996, Vol. 24, No. 4, 87 902 SPECTRAL GAP FOR ZERO-RANGE DYNAMICS By C. Landim, S. Sethuraman and S. Varadhan IMPA and CNRS, Courant Institute and Courant Institute We give a lower

More information

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Introduction to Markov Chain Monte Carlo & Gibbs Sampling Introduction to Markov Chain Monte Carlo & Gibbs Sampling Prof. Nicholas Zabaras Sibley School of Mechanical and Aerospace Engineering 101 Frank H. T. Rhodes Hall Ithaca, NY 14853-3801 Email: zabaras@cornell.edu

More information

Stat 516, Homework 1

Stat 516, Homework 1 Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Sampling from complex probability distributions

Sampling from complex probability distributions Sampling from complex probability distributions Louis J. M. Aslett (louis.aslett@durham.ac.uk) Department of Mathematical Sciences Durham University UTOPIAE Training School II 4 July 2017 1/37 Motivation

More information

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov chain Monte Carlo methods in atmospheric remote sensing 1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,

More information

Computer Practical: Metropolis-Hastings-based MCMC

Computer Practical: Metropolis-Hastings-based MCMC Computer Practical: Metropolis-Hastings-based MCMC Andrea Arnold and Franz Hamilton North Carolina State University July 30, 2016 A. Arnold / F. Hamilton (NCSU) MH-based MCMC July 30, 2016 1 / 19 Markov

More information

ON CONVERGENCE RATES OF GIBBS SAMPLERS FOR UNIFORM DISTRIBUTIONS

ON CONVERGENCE RATES OF GIBBS SAMPLERS FOR UNIFORM DISTRIBUTIONS The Annals of Applied Probability 1998, Vol. 8, No. 4, 1291 1302 ON CONVERGENCE RATES OF GIBBS SAMPLERS FOR UNIFORM DISTRIBUTIONS By Gareth O. Roberts 1 and Jeffrey S. Rosenthal 2 University of Cambridge

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics Eric Slud, Statistics Program Lecture 1: Metropolis-Hastings Algorithm, plus background in Simulation and Markov Chains. Lecture

More information

Resampling Markov chain Monte Carlo algorithms: Basic analysis and empirical comparisons. Zhiqiang Tan 1. February 2014

Resampling Markov chain Monte Carlo algorithms: Basic analysis and empirical comparisons. Zhiqiang Tan 1. February 2014 Resampling Markov chain Monte Carlo s: Basic analysis and empirical comparisons Zhiqiang Tan 1 February 2014 Abstract. Sampling from complex distributions is an important but challenging topic in scientific

More information

arxiv: v1 [stat.co] 28 Jan 2018

arxiv: v1 [stat.co] 28 Jan 2018 AIR MARKOV CHAIN MONTE CARLO arxiv:1801.09309v1 [stat.co] 28 Jan 2018 Cyril Chimisov, Krzysztof Latuszyński and Gareth O. Roberts Contents Department of Statistics University of Warwick Coventry CV4 7AL

More information

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1 Recent progress on computable bounds and the simple slice sampler by Gareth O. Roberts* and Jerey S. Rosenthal** (May, 1999.) This paper discusses general quantitative bounds on the convergence rates of

More information

Stochastic Approximation Monte Carlo and Its Applications

Stochastic Approximation Monte Carlo and Its Applications Stochastic Approximation Monte Carlo and Its Applications Faming Liang Department of Statistics Texas A&M University 1. Liang, F., Liu, C. and Carroll, R.J. (2007) Stochastic approximation in Monte Carlo

More information

Lecture 8: The Metropolis-Hastings Algorithm

Lecture 8: The Metropolis-Hastings Algorithm 30.10.2008 What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X 1,..., X p ) in turn by drawing from the full conditionals: X (t) j Two drawbacks:

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet. Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information

Inference in state-space models with multiple paths from conditional SMC

Inference in state-space models with multiple paths from conditional SMC Inference in state-space models with multiple paths from conditional SMC Sinan Yıldırım (Sabancı) joint work with Christophe Andrieu (Bristol), Arnaud Doucet (Oxford) and Nicolas Chopin (ENSAE) September

More information

Brief introduction to Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to Content 1 and motivation Classical

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS

TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS U.P.B. Sci. Bull., Series A, Vol. 73, Iss. 1, 2011 ISSN 1223-7027 TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS Radu V. Craiu 1 Algoritmii de tipul Metropolis-Hastings se constituie într-una

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

Bayesian inference: what it means and why we care

Bayesian inference: what it means and why we care Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Mathématiques de la Décision Université Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine)

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

A Dirichlet Form approach to MCMC Optimal Scaling

A Dirichlet Form approach to MCMC Optimal Scaling A Dirichlet Form approach to MCMC Optimal Scaling Giacomo Zanella, Wilfrid S. Kendall, and Mylène Bédard. g.zanella@warwick.ac.uk, w.s.kendall@warwick.ac.uk, mylene.bedard@umontreal.ca Supported by EPSRC

More information

Bayesian computation for statistical models with intractable normalizing constants

Bayesian computation for statistical models with intractable normalizing constants Bayesian computation for statistical models with intractable normalizing constants Yves F. Atchadé, Nicolas Lartillot and Christian Robert (First version March 2008; this version Jan. 2010) Abstract: This

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model Thai Journal of Mathematics : 45 58 Special Issue: Annual Meeting in Mathematics 207 http://thaijmath.in.cmu.ac.th ISSN 686-0209 The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for

More information

A Geometric Interpretation of the Metropolis Hastings Algorithm

A Geometric Interpretation of the Metropolis Hastings Algorithm Statistical Science 2, Vol. 6, No., 5 9 A Geometric Interpretation of the Metropolis Hastings Algorithm Louis J. Billera and Persi Diaconis Abstract. The Metropolis Hastings algorithm transforms a given

More information

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm On the Optimal Scaling of the Modified Metropolis-Hastings algorithm K. M. Zuev & J. L. Beck Division of Engineering and Applied Science California Institute of Technology, MC 4-44, Pasadena, CA 925, USA

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling Rejection Importance Hastings-Metropolis Gibbs Markov Chains

More information

An ABC interpretation of the multiple auxiliary variable method

An ABC interpretation of the multiple auxiliary variable method School of Mathematical and Physical Sciences Department of Mathematics and Statistics Preprint MPS-2016-07 27 April 2016 An ABC interpretation of the multiple auxiliary variable method by Dennis Prangle

More information

A new iterated filtering algorithm

A new iterated filtering algorithm A new iterated filtering algorithm Edward Ionides University of Michigan, Ann Arbor ionides@umich.edu Statistics and Nonlinear Dynamics in Biology and Medicine Thursday July 31, 2014 Overview 1 Introduction

More information

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Advances and Applications in Perfect Sampling

Advances and Applications in Perfect Sampling and Applications in Perfect Sampling Ph.D. Dissertation Defense Ulrike Schneider advisor: Jem Corcoran May 8, 2003 Department of Applied Mathematics University of Colorado Outline Introduction (1) MCMC

More information

Chapter 7. Markov chain background. 7.1 Finite state space

Chapter 7. Markov chain background. 7.1 Finite state space Chapter 7 Markov chain background A stochastic process is a family of random variables {X t } indexed by a varaible t which we will think of as time. Time can be discrete or continuous. We will only consider

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

Auxiliary Particle Methods

Auxiliary Particle Methods Auxiliary Particle Methods Perspectives & Applications Adam M. Johansen 1 adam.johansen@bristol.ac.uk Oxford University Man Institute 29th May 2008 1 Collaborators include: Arnaud Doucet, Nick Whiteley

More information

Lecture 12. F o s, (1.1) F t := s>t

Lecture 12. F o s, (1.1) F t := s>t Lecture 12 1 Brownian motion: the Markov property Let C := C(0, ), R) be the space of continuous functions mapping from 0, ) to R, in which a Brownian motion (B t ) t 0 almost surely takes its value. Let

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries Chapter 1 Measure Spaces 1.1 Algebras and σ algebras of sets 1.1.1 Notation and preliminaries We shall denote by X a nonempty set, by P(X) the set of all parts (i.e., subsets) of X, and by the empty set.

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.

More information

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham MCMC 2: Lecture 3 SIR models - more topics Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham Contents 1. What can be estimated? 2. Reparameterisation 3. Marginalisation

More information

Asymptotics and Simulation of Heavy-Tailed Processes

Asymptotics and Simulation of Heavy-Tailed Processes Asymptotics and Simulation of Heavy-Tailed Processes Department of Mathematics Stockholm, Sweden Workshop on Heavy-tailed Distributions and Extreme Value Theory ISI Kolkata January 14-17, 2013 Outline

More information