On an adaptive preconditioned Crank-Nicolson algorithm for infinite dimensional Bayesian inferences

Size: px

Start display at page:

Download "On an adaptive preconditioned Crank-Nicolson algorithm for infinite dimensional Bayesian inferences"

Melvin Hawkins
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) On an adaptive preconditioned Crank-Nicolson algorithm for infinite dimensional Bayesian inferences Zixi Hu Zhewei Yao Jinglai Li Received: date / Accepted: date Abstract The preconditioned Crank-Nicolson () method is a MCMC algorithm for implementing the Bayesian inferences in function spaces. A remarkable feature of the algorithm is that, unlike many usual MCMC algorithms, which become arbitrary slow under the mesh refinement, the efficiency of the algorithm is dimension independent. In this work we develop an adaptive version of the algorithm, where the proposal is adaptively improved based on the sample history. Under the chosen parametrization of the proposal distribution, the proposal parameters can be efficiently updated in our algorithm. We show that the resulting adaptive algorithm is dimension independent and has the correct ergodicity properties. Finally we provide numerical examples to demonstrate the efficiency of the proposed algorithm. Keywords Bayesian inference covariance operator dimension independence Markov Chain Monte Carlo. Mathematics Subject Classification (2) 62F5 65C5 Introduction Many scientific problems, such as nonparametric regression [2], and inverse problems [4, 26], require to perform Bayesian inferences in function spaces. In This work was supported by the NSFC under grant number ZH and ZY contribute equally to the work. Z. Hu Z. Yao Department of Mathematics and Zhiyuan College, Shanghai Jiao Tong University, 8 Dongchuan Rd, Shanghai 224, China. J. Li Institute of Natural Sciences, Department of Mathematics, and the MOE Key Laboratory of Scientific and Engineering Computing, Shanghai Jiao Tong University, 8 Dongchuan Rd, Shanghai 224, China. jinglaili@sjtu.edu.cn

2 2 Z. Hu, Z. Yao and J. Li practice, often the posterior distributions do not admit a closed form and need to be computed numerically. Specifically one first represents the unknown function with a finite-dimensional parametrization, for example, by discretizing the function on a pre-determined mesh grid, and then solves the resulting finite dimensional inference problem with the Markov Chain Monte Carlo (MCMC) simulations. It has been known that standard MCMC algorithms, such as the random walk Metropolis-Hastings (RWMH), can become arbitrarily slow as the discretization mesh of the unknown is refined [2,23,4,8]. That is, the mixing time of an algorithm can increase to infinity as the dimension of the discretized parameter approaches to infinity, and in this case the algorithm is said to be dimension-dependent. To this end, a very interesting line of research is to develop dimension-independent MCMC algorithms by requiring the algorithms to be well-defined in the function spaces. In particular, a family of dimension-independent MCMC algorithms were presented in [7] by constructing a Crank-Nicolson discretization of a stochastic partial differential equation (SPDE) that preserves the reference measure. Just like the finite dimensional problems, one can improve the sampling efficiency of the infinite dimensional MCMC by incorporating the data information in the proposal design. To this end, a very popular class of methods guide the proposal with the local derivative information of the likelihood function. Such derivative based methods include: the stochastic Newton MCMC [7, 9], the operator-weighted proposal method [6], the infinitedimensional Metropolis-adjusted Langevin algorithm (MALA) [5, 3], the dimensionindependent likelihood-informed (DILI) MCMC [8], and the generalized preconditioned CN (g) algorithm [24], just to name a few. In this work, we focus on an alternative type of methods to utilize the data information, i.e., the adaptive MCMC (c.f. [, 2, 22] and the references therein), which adjust the proposal based on the sample history. A major advantage of the adaptive methods is that they do not require the knowledge of the gradient, which makes them particularly convenient for problems with black-box models. In a recent work [], we develop an adaptive independence sampler MCMC algorithm for the infinite dimensional problems. A major limitation of independence sampler MCMC algorithms is that the efficiency of such algorithms depends critically on the ability of the chosen proposal, often in a parametrized form, to approximate the posterior in the entire state space, and the algorithm may perform very poorly if the proposal can not well approximate the posterior distribution. In this respect, random walk based algorithms may be advantageous as they do not require such a global proposal. In this work, we present an adaptive random walk MCMC based on the preconditioned Crank-Nicolson () algorithm in [7]. Specifically, we adaptively adjust the preconditioning operator in the algorithm to improve the sampling efficiency. We parametrize the preconditioning operator in a specific form that has been used in [2,], and we provide an algorithm that can efficiently update the parameter values as the iteration proceeds. By design, the acceptance probability is well defined and thus the algorithm is dimension independent. In addition, our algorithm ensures that acceptance probability

3 Adaptive preconditioned Crank-Nicolson algorithm 3 is the same as that in the standard algorithm, which is independent of the proposal distribution. Finally we note that an important issue in designing an adaptive MCMC algorithm is to preserve the ergodicity while allowing the proposal distribution to vary during the iterations. Following the roadmap outlined in [], we provide some theoretical results regarding the ergodicity of the proposed algorithm. We note that two methods that are similar to our works are the g in [24], and the dimension independent adaptive Metropolis (DIAM) proposed in [6]. Compared to the g method, our algorithm utilizes a specific parametrized form of the proposal and as a result the parameters can be updated very efficiently, which makes an adaptive algorithm feasible. The DIAM is also an adaptive MCMC algorithm, and the major difference between it and our method is that, by design, our method preserves an important feature of the standard algorithm, i.e., the acceptance probability being independent on the proposal distribution. It should also be noted that, our algorithm is specifically designed for Gaussian priors, and there are works concerning MCMC algorithms for non-gaussian priors [27, 28]. The rest of the paper is organized as the following. In section 2 we describe the setup of infinite dimensional inference problems and present our adaptive MCMC algorithm in detail. In section 3 we provide several numerical examples to demonstrate the performance of the proposed algorithm. Finally we offer some concluding remarks in section 4. 2 The adaptive preconditioned Crank-Nicolson algorithm 2. Problem setup We present the standard setup of the problem following [26]. We consider a separable Hilbert space X with inner product, X. Our goal is to estimate the unknown u X from data y Y where Y is the data space and y is related to u via a likelihood function L y (u). In the Bayesian inference we assume that the prior µ of u, is a (without loss of generality) zero-mean Gaussian measure defined on X with covariance operator C, i.e. µ = N(, C ). Note that C is symmetric positive and of trace class. The range of C 2, E = {u = C 2 x x X} X, which is a Hilbert space equipped with inner product [9],, E = C 2, C 2 X, is called the Cameron-Martin space of measure µ. In this setting, the posterior measure µ y of u conditional on data y is provided by the Radon-Nikodym derivative: dµ y dµ (u) = L y (x), (2.)

4 4 Z. Hu, Z. Yao and J. Li which can be interpreted as the Bayes rule in the infinite dimensional setting. In a standard setting, the likelihood function takes form of L y (u) = Z exp( Φy (u)), (2.2) where Z is a normalization constant. In what follows, without causing any ambiguity, we shall drop the superscript y in Φ y, L y and µ y for simplicity, while keeping in mind that these functions depends on the data y. For the inference problem to be well-posed, one typically requires the functional Φ to satisfy the Assumptions (6.) in [7]. Finally we quote the following lemma ([9], Chapter ), which will be useful later: Lemma There exists a complete orthonormal basis {e j } j N on X and a sequence of non-negative numbers {α j } j N such that C e j = α j e j and j= α j <, i.e., {e j } k N and {α j } k N being the eigenfunctions and eigenvalues of C respectively. 2.2 The Crank-Nicolson algorithms We start by briefly reviewing the family of Crank-Nicolson algorithms for infinite dimensional Bayesian inferences, developed in [7]. Simply speaking the algorithms are based on the stochastic partial differential equation (SPDE) du ds = KLu + 2K db ds, (2.3) where L = C is the precision operator for µ, K is a positive operator, and b is a Brownian motion in X with covariance operator the identity. The proposal is then derived by applying the Crank-Nicolson (CN) scheme to the SPDE (2.3), yielding, v = u 2 δkl(u + v) + 2K δξ, (2.4) for a white noise ξ and δ (, 2). In [7], two choices of K are proposed, resulting in two different algorithms. First, one can choose K = I, the identity, obtaining: (2C + δi)v = (2C δi)u + 8δw, where w N (, C ), which is known as the plain Crank-Nicolson (CN) algorithm. Alternatively one can choose K = C, resulting in the preconditioned Crank-Nicolson () proposal: where v = ( β 2 ) 2 u + βw, (2.5) β = 8δ 2 + δ. It is easy to see that β [, ]. In both CN and algorithms, the acceptance probability is a(v, u) = min{, L(v) }. (2.6) L(u)

5 Adaptive preconditioned Crank-Nicolson algorithm Parametrizing the operator K A natural extension of the CN and algorithms (which is also proposed in [7]) is to consider other choices of the operator K to improve the algorithm efficiency. To this end, we first rewrite the proposal Eq. (2.4) as v = (I 2 δkl) 2δK (I + 2 δkl)u + (I + 2 δkl)ξ (2.7) Before discussing specific choices of the operator K, we present the following proposition regarding the acceptance probability: Proposition Suppose operator K is symmetric positive and of trace class. Let q(u, ) be the proposal distribution associated to Eq. (2.7). Define measures η(du, dv) = q(u, dv)µ(du) and η (du, dv) = q(v, du)µ(dv) on X X. If K commutes with C, η is absolutely continuous with respect to η, and dη L(v) (u, v) = dη L(u). Proof Define η (u, v) = q(u, dv)µ (u). The measure η is Gaussian. From K and C are commutable, we have E η v v = (I + 2 δkl) 2 (I 2 δkl)2 C + (I + 2 δkl) 2 2δK = C = E η u u. Then η is symmetric in u, v. Now η(du, dv) = q(u, dv)µ(du), η (du, dv) = q(u, dv)µ (du) and µ,µ are equivalent. It follows that η and η are equivalent and dη dη (u, v) = dµ dµ (u) = L(u). Since η is symmetric in u, v we also have that η and η are equivalent and that dη dη (u, v) = L(v). Since equivalence of measures is transitive, it follows that η and η are equivalent and dη L(v) (u, v) = dη L(u).

6 6 Z. Hu, Z. Yao and J. Li Now we discuss how to specify the operator K, and we start with assuming K an appropriate parametrized form. Note that an essential condition in Proposition is that K must commute with C. To satisfy this condition, it is convenient to design a K that has common eigenfunctions with C. Namely, we write K in the form of where H is defined as K = C + H, (2.8a) H = J h j e j, e j, j= (2.8b) with h j being coefficients. Here J is prescribed positive integer that is either smaller or equal to the dimensionality of the problem. It is easy to see that K is a symmetric operator with eigenvalue-eigenfunction pair {λ j, e j } j=, where λ j = α j + h j for j =...J and λ j = α j for j = J +..., which implies that K and C commute. 2.4 The adaptive algorithm A well-adopted rule in designing efficient MCMC algorithms is that the proposal covariance should be close to the covariance operator of the posterior [23, ]. Next we give a heuristic argument for our method to determine the operator K. In the case of small δ, the proposal (2.7) is approximately equal to v u + 2δw, where w N (, K ), which implies that K provides an approximation to the proposal covariance in this case. Thus we shall require K to be close to the posterior covariance. Note that such an approximation is only valid for small δ in principle, and thus we recommend not to use very large δ in the proposed algorithm. Now suppose the posterior covariance is C, and one can determine K by min K C, (2.9) {h j} J h= where is the Hilbert-Schmidt norm and K is given by Eq. (2.8). By some basic algebra, we can show that the optimal solution of Eq (2.9) is h j = Ce j, e j α j or equivalently λ j = Ce j, e j for j =...J. Since C is the posterior covariance, for any v and v X, we have [9], Cv, v = v, u m v, u m µ(du), (2.)

7 Adaptive preconditioned Crank-Nicolson algorithm 7 where m is the mean of µ. Using Eq. (2.), we can derive that h j = (xj u j ) 2 dµ, or, λ j = (x j u j ) 2 dµ, (2.) α j where x j = m, e j and u j = u, e j for j =...J. In practice, the posterior covariance C is not directly available, and so here we determine the operator K with an adaptive MCMC algorithm. Simply speaking, the adaptive algorithm starts with an initial guess of K and then adaptively updates the K based on the sample history of the posterior. The essential part in the algorithm is to update K, i.e. to estimate the values of h j, from posterior samples. To this end, suppose we have a set of posterior samples {u n } n i=, and the values of parameters h j are estimated using the sample average approximation of Eq. (2.): x n j = n u i, e j, n + i= (2.2a) n s n j = (u i j) 2, (2.2b) i= h n j = n+ sn j (xn j )2 + ɛ, (2.2c) 2 α j for j =...J. Here ɛ is a small constant, introduced to ensure the stability of the algorithm, i.e., to prevent h n j becoming arbitrarily large. For efficiency s sake, we can rewrite Eq (2.2) in a recursive form x n j = n n + xn j + n + un, e j, (2.3a) s n j = s n j + (u n j ) 2, (2.3b) h n j = n+ sn j, (xn j )2 α j (2.3c) for j =...J and n >. Let us denote the operator K resulting from {h n j }J j= as K n and it is easy to see that K n is symmetric positive and of trace class. As a result we can rewrite the proposal v = (I 2 δk nl) 2δ (I + 2 δk nl) u + (I + 2 δk w, (2.4) nl) where w N (, K n ). Finally we note that, it is not robust to estimate the parameter values with a very small number of samples, and to address the issue, we first draw a certain number of samples with a standard algorithm and then start the adaptive algorithm. We describe the complete adaptive (A) algorithm in Algorithm.

8 8 Z. Hu, Z. Yao and J. Li Algorithm The adaptive algorithm : Initialize u S; 2: for n = to n do 3: Propose v using Eq (2.5); 4: Draw ρ U[, ] 5: Let a := min{, 6: if ρ a then 7: u n+ = v; 8: else 9: u n+ = u n ; : end if : end for 2: Compute {x n j, sn j L(v) L(u n ) };, hn j }J j= using Eq. (2.2) and samples {ui } n i= ; 3: for n = n to N do 4: Compute K n from Eqs. (2.8) with {h n j }J j= ; 5: Propose v using Eq (2.4); 6: Draw ρ U[, ] 7: Let a := min{, 8: if ρ a then 9: u n+ = v; 2: else 2: u n+ = u n ; 22: end if 23: Compute {h n+ 24: end for L(v) L(u n ) }; j } J j= using Eqs. (2.3); 2.5 Ergodicity analysis As has been mentioned, an important issue in an adaptive MCMC algorithm is to verify that it has the correct ergodic properties. Directly proving the ergodicity property in the infinite dimensional setting is rather challenging. However, as eventually the algorithm must be implemented in a finite dimensional setting, it is reasonable to consider the ergodic properties of the finite dimensional implementation instead. Namely, we first approximate u with a d-dimensional representation, say z = P d u. In this case, the state space X becomes R d and the prior µ (dz) of z reduces to a d-variate Gaussian distribution over R d. Now we shall perform our ergodicity analysis on this finite dimensional problem. In particular we follow the analysis outlined in [], which requires to make a small modification to the likelihood function (2.2), L S (z) = { Z exp( Φ(z)), u S, u / S. (2.5) Here S = {z R d z 2 2 < R} where R > is a positive constant that can be chosen arbitrarily. The posterior of z becomes dµ d dµ (z) = L S (z).

9 Adaptive preconditioned Crank-Nicolson algorithm 9 We emphasize that modifying the likelihood function is only for the convenience of proof (as the technique employed in [] requires the posterior support to be bounded), and clearly the modified likelihood function well approximates the original one provided a sufficiently large R is chosen. In this setting we have the following theorem indicating the ergodicity of our algorithm: Theorem The chain {z n } generated by Algorithm, with any initial distribution (the distribution of u ) on S, simulates properly the target distribution µ d : for any bounded and µ measurable function f : S R, the equality holds almost surely. lim n n + n f(z i ) = E µd [f(z)], i= We leave the proof in Appendix A. 3 Numerical examples 3. An ODE example Our first example is a simple inverse problem where the forward model is governed by an ordinary differential equation (ODE): x(t) t = u(t)x(t) with a prescribed initial condition. We assume that we observe the solution x(t) several times in the interval [, T ], and we want to infer the unknown coefficient u(t). In our experiments, we let the initial condition be x() = and T =. Now suppose that the solution is measured every T/ time unit from to T and the error in each measurement is assumed to be an independent Gaussian N(,.5 2 ). The data is generated by applying the forward model to a true coefficient u and then adding noise to the result. The data and the truth that is used to generate the data are shown in Fig.. In the inference, 2 equally spaced grid points are used to represent the unknown u. The prior is chosen to be a zero-mean Gaussian measure in X with an exponential covariance function: K(t, t 2 ) = exp( t t 2 /2). We sample the posterior with both the and the A algorithms, each with 6 samples. In the, we choose β = /5 and in the A we choose δ = /4. These parameter values are chosen in a way such that the two algorithms result in reasonable acceptance probabilities. Moreover, in the A algorithm, we choose J = and ɛ = 4. The average acceptance probability of is 28% and that of the A is 3%. First we shall show

10 Z. Hu, Z. Yao and J. Li u(t) x(t) data without noise data with noise t t Fig. (for the ODE example) Left: the true coefficient. Right: the data generated with the true coefficient: blue solid line is the simulated data without observation noise and the red dashed line is the simulated data with observation noise. that the adaptation diminishes as the number of iterations increases. Thus, in Fig. 2, we plot the estimated values of λ and λ as a function of the number of iterations, and we can see from the plots that the values of these two parameters converge as the iterations proceed. Next we shall compare the performance of the two algorithms, and a commonly used performance indicator is the autocorrelation function (ACF). We particularly consider the unknown at t =.2,.5 and.8 and we plot the ACF for all the three points in Fig. 3. One can see from the figure that, for all three points, the ACF of the chain generated by the A decreases much faster than that of the standard, suggesting that the A method achieves a significantly higher efficiency. Alternatively, we compute the ACF of lag at all the grid points, which is plotted in Fig. 4 (left), and we can see that, the ACF of the chain generated by the A is much lower than that of the standard at all the grid points. The effective sample size (ESS) is another popular measure of the sampling efficiency of MCMC [5]. ESS is computed by ESS = N + 2τ, where τ is the integrated autocorrelation time and N is the total sample size, and it gives an estimate of the number of effectively independent draws in the chain. We compute the ESS of the unknown u at each grid point and show the results in Fig. 4 (right). The results show that the A algorithm produces much more effectively independent samples than the standard. 3.2 Estimating the Robin coefficient In this example we consider the one dimensional heat conduction equation in the region x [, L], u t (x, t) = 2 u (x, t), x2 (3.a) u(x, ) = g(x), (3.b)

11 Adaptive preconditioned Crank-Nicolson algorithm λ λ numer of iterations x number of iterations x 5 Fig. 2 (for the ODE example) The estimate of λ (left) and λ plotted as a function of the number of iterations. t =.2 t =.5 t = ACF.6.4 A ACF.6.4 A ACF.6.4 A lag lag lag Fig. 3 (for the ODE example) Autocorrelation functions (ACF) for the and the A methods. Left: ACF of the OMF plotted as a function of lags. Right: the lag ACF for u at each grid point. x 4 4 ACF (lag ) A t ESS 3 2 A t Fig. 4 (for the ODE example) Autocorrelation functions (ACF) for the and the A methods at different grid points: from left to right, at t =.2, t =.5 and t =.8. with the following Robin boundary conditions: u x (, t) + ρ(t)u(, t) = h (t), u x (L, t) + ρ(t)u(l, t) = h (t). (3.c) (3.d) Suppose the functions g(x), h (x) and h (x) are all known, and we want to estimate the unknown Robin coefficient ρ(t) from certain measurements of the temperature u(x, t). The Robin coefficient ρ(t) characterizes thermal proper-

12 2 Z. Hu, Z. Yao and J. Li u(t) x(t) data without noise data with noise t t Fig. 5 (for example ) Autocorrelation functions (ACF) for the and the A methods. Left: ACF of the OMF plotted as a function of lags. Right: the lag ACF for u at each grid point..6.4 λ λ numer of iterations x number of iterations x 5 Fig. 6 (for example ) The estimate of λ (left) and λ plotted as a function of the number of iterations. ties of the conductive medium on the interface which in turn provides information on certain physical processes near the boundary, e.g., corrosion [3]. In this example we choose L =, T = and the functions to be g(x) = x 2 +, h = t(2t + ), h = 2 + t(2t + 2). The solution is measured every T/5 time unit from to T and the error in each measurement is assumed to be an independent Gaussian N(,.5 2 ). The true Robin coefficient and the resulting data are shown in Fig. 5. In the computation, equally spaced grid points are used to represent the unknown. Moreover, the prior is the same as that used in the ODE example. We sample the posterior with both the and the A algorithms, each with 6 samples. In the, we choose β = /4 and in the A we choose δ = 2/. We choose J = and ɛ = 4 in the A algorithm. The average acceptance probability of is 28% and that of the A is 3%. As is in the ODE example, we first plot the estimated values of λ and λ as a function of the number of iterations in Fig. 6, where we can observe the convergence of the two parameters. We then plot the ACF for the unknown at grid points t =.2,.5 and.8, in Fig. 7. Next we compute the ACF of lag at all the grid points, and plot the results in Fig. 8 (left). In all these

13 Adaptive preconditioned Crank-Nicolson algorithm 3 t =.2 t =.5 t = ACF.6 ApCH ACF.6.4 A ACF.6.4 A lag lag lag Fig. 7 (for the Robin example) Autocorrelation functions (ACF) for the and the A methods. Left: ACF of the OMF plotted as a function of lags. Right: the lag ACF for u at each grid point x 4 A ACF (lag ).6.4 A ESS t t Fig. 8 (for example ) Autocorrelation functions (ACF) for the and the A methods. Left: ACF of the OMF plotted as a function of lags. Right: the lag ACF for u at each grid point. ACF plots, we can see that the results of our A algorithm are significantly better than those of the standard method. Finally we compute the ESS of the unknown u at each grid point and show the results in Fig. 8 (right), which once again indicates that the A algorithm outperforms the standard evidently. 4 Conclusions In summary, we consider MCMC simulations for Bayesian inferences in function spaces. In particular, we develop an adaptive version of the algorithm to improve the sampling efficiency. The implementation of the A algorithm is rather simple, without requiring any information of the underlying models, and during the iteration the proposal can be efficiently updated with explicit formulas. We also show that the adaptive algorithm has the correct ergodicity property. Finally we demonstrate the effectiveness and efficiency of the A algorithm with several numerical examples. We expect the A algorithm can be of use in many practical problems, especially in those involving blackbox models. It should be noted that, in the present work, we consider the ergodicity properties of the finite dimensional approximation of the algorithm. It is cer-

14 4 Z. Hu, Z. Yao and J. Li tainly desirable to ensure that the infinite dimensional MCMC algorithm itself has the correct ergodicity properties, which may require certain modifications of the present adaptive algorithm. We plan to work on this problem in the future. A Proof of Theorem Recall that, in the finite dimensional setting, our target distribution µ d is supported by S. Let M (S) denote the set of finite measures on S. The norm on M (S) is the total variation norm. Assume that K n(z, z,, z n 2, z) is the operator K at step n computed from z,, z n 2, z (i.e., z = z n ). For simplicity, let ζ n 2 = (z,, z n 2 ) and K n,ζn 2 (z) = K n(z, z,, z n 2, z). q n,ζn 2 (z; ) is the proposal distribution given by v = I 2 δk n,ζ n 2 (z)l 2δ I + 2 δk n,ζ n 2 (z)l z + I + 2 δk n,ζ n 2 (z)l w, where w N (, K n,ζn 2 (z)). It should be noted that all the operators reduce to matrices in the finite dimensional setting. Then define Q n,ζn 2 (z; dv) = acc(z, v)q n,ζn 2 (z, dv) + δ z(dv)( acc(z, x)q n,ζn 2 (z, dx)) as the transition probability at step n, where δ z( ) is a point mass, and the acceptance probability is acc(z, v) = min{, L S(v) L S (z) }. And define Q n(z, z,, z n 2, z; dv) = Q n,ζn 2 (z; dv), as the transition probability from (z, z,, z n 2, z) to v. Let T be a transition probability on S and set µ T µ 2 T Γ (T ) = sup µ,µ 2 µ µ 2 where the supremum is taken over distinct probability measures µ, µ 2 on S. Now we introduce some new notations. First following [], we use νt to denote the measure A S T (z; A)ν(dz), and for bounded measurable functions we write T f(z) = S T (z; dy)f(y) as well as νf = S f(y)ν(dy). Then we have the following proposition: Proposition 2 The transition probabilities (Q n) satisfy the following three conditions: I. There is a constant γ (, ) such that Γ (Q n,ζn 2 ) γ <, for ζ n 2 S n and n 2. II. There is a fixed positive constant γ 2 such that Q n,ζn 2 Q n+k,ζn+k 2 M(S) M(S) γ 2 k n where n, k and one assumes that ζ n+k 2 is a direct continuation of ζ n 2. III. There is a constant γ 3 such that µ d Q n,ζn 2 µ d γ 3 n, for ζ n 2 S n and n 2.

15 Adaptive preconditioned Crank-Nicolson algorithm 5 Proof and I. Let A n,ζn 2 (z) = I + 2 δk n,ζ n 2 (z)l, B n,ζn 2 (z) = I 2 δk n,ζ n 2 (z)l. Define that, for j =...d, and a n,ζn 2,i(z) = + 2 δλ n,ζ n 2,j(z)α j, b n,ζn 2,j(z) = 2 δλ n,ζ n 2,j(z)α j, where λ n,ζn 2,j(z) is the eigenvalue of K n,ζn 2 (z). Obviously, a n,ζn 2,j(z) and b n,ζn 2,j(z) are the eigenvalues of A n,ζn 2 (z) and B n,ζn 2 (z) respectively. And we know for j =...d, < a n,ζn 2,j(z) < M, and b n,ζn 2,j(z) < M for a positive constant M. According to the proposal, q n,ζn 2 (z; ) = N (A n,ζn 2 (z) B n,ζn 2 (z)z, 2δA n,ζn 2 (z) 2 K n,ζn 2 (z)). Since < a n,ζn 2,j(z) < M and by design, M 2 λ n,ζn 2,j(z) M 3 for some constants M 2, M 3 >, we have M 4 I 2δA n,ζn 2 (z) 2 K n,ζn 2 (z) M 5 I, for some constants M 4, M 5 >. And for any z S, there exists a constant M 6 > such that A n,ζn 2 (z) B n,ζn 2 (z)z 2 = d a n,ζn 2,j(z) 2 b n,ζn 2,j(z) 2 z, e i 2 M 6. i= Thus the density of q n,ζn 2 (z; ) is bounded below on S. Then it is trivial that q n,ζn 2 (z; A) cµ (A) for all z S, all A S, and a constant c >. Then we know that Γ (Q n,ζn 2 ) γ < (c.f. [?]). II. For any given ζ n 2, one has Q n,ζn 2 Q n+k,ζn+k 2 M(S) M(S) 2 sup Q n,ζn 2 (z; A) Q n+k,ζn+k 2 (z; A). z S,A S We then can show that Q n,ζn 2 (z; A) Q n+k,ζn+k 2 (z; A) 2 q n,ζn 2 (z; v) q(v) dv + 2 q(v) q n+k,ζn+k 2 (z; v) dv, R d R d (A.) where q is the Gaussian measure that has the same mean with q n,ζn 2 (z; ) and has the same covariance with q n+k,ζn+k 2 (z; ). Let and I = q n,ζn 2 (z; v) q(v) dv, R d I 2 = R d q(v) q n+k,ζn+k 2 (z; v) dv.

16 6 Z. Hu, Z. Yao and J. Li Let β n,j = 2δa n,ζn 2,j(z) 2 λ n,ζn 2,j(z). Then β n,j are eigenvalues of the covariance of q n,ζn 2 (z, ). It is easy to see that λ n,ζn 2,j(z) λ n+k,ζn+k 2,j(z) M 2 k n, (A.2) for a constant M 2 >, and it follows that β n,j β n+k,j M 22 k n, (A.3) for a constant M 22 >. And obviously, there is a positive constant M 23 such that β n,j, β n+k,j M 23. We first consider I. Actually, d I exp( z2 d j ) exp( z2 j ) dz dz d R d 2πβn,j 2β i= n,j 2πβn+k,j 2β i= n+k,j Thanks to Eq. (A.3), by some elementary calculations, we can show that I M 24 k/n for some constant M 24 >. We now consider I 2. Let Here we have I 2 R d z = A n,ζn 2 (z) B n,ζn 2 (z)z A n,ζn 2 (z) B n,ζn 2 (z)z. d i= Using Eq. (A.2), we have, and exp( (z j z, e j ) 2 ) 2πβn+k,j 2β n+k,j d exp( z2 j ) dz dz d. 2πβn+k,j 2β i= n+k,j a n,ζn 2,j(z n ) a n+k,ζn+k 2,j(z) < M 26 k n b n,ζn 2,j(z n ) b n+k,ζn+k 2,j(z) < M 26 k n, for some constant M 26 >. Thus, we have z, e j = a n+k,ζn+k 2,j(z) b n+k,ζn+k 2,j(z) a n,ζn 2,j(z) b n,ζn 2,j(z) z, e j M 27 k n. and so I 2 M 28 k/n for a constant M 28 >. We thus can come to the conclusion that for some constant γ 2 >. Q n,ζn 2 Q n+k,ζn+k 2 M(s) M(s) γ 2 k n,

17 Adaptive preconditioned Crank-Nicolson algorithm 7 III. Assume that K = K n,ζn 3 (z n 2 ). Define q (z; dv) to be the transition kernel according to (I 2 δk L)v = (I + 2 δk L)z + 2δN (, K ). Let Q (z; dv) = acc(z, v)q (z; dv) + δ z(dv)( q (z; dx)acc(x, z)). It is easy to see that the transition kernel Q satisfies the condition of detailed balance, and thus we have µ d Q = µ d. Since λ n,ζn 3,j(z n 2 ) λ n,ζn 2,j(u) M 3 n for i =...d and a constant M 3 >. Also, there exists M 32, M 33 >, such that M 32 < λ n,ζn 3,j(z n 2 ), λ n,ζn 2,j(z) < M 33. By a similar procedure to that of condition (II), we can obtain, Q n,ζn 2 Q M(s) M(s) M 34 n, for some constant M 34 >. It follows that for some constant γ 3 >. µ d Q n,ζn 2 µ d = µ d (Q n,ζn 2 Q ) γ 3 n, Now we have proved Proposition 2 for our algorithm, and thus Theorem follows immediately from Theorem 2 in [], Finally, it is worth noting that, following the analysis of [25], it may be possible to relax the requirement that the posterior must have a bounded support. Nevertheless, the investigation of unbounded support is not in the scope of the present work. References. Christophe Andrieu and Johannes Thoms, A tutorial on adaptive mcmc, Statistics and Computing, 8 (28), pp Yves Atchade, Gersende Fort, Eric Moulines, and Pierre Priouret, Adaptive markov chain monte carlo: theory and methods, Preprint, (29). 3. Alexandros Beskos, A stable manifold MCMC method for high dimensions, Statistics & Probability Letters, 9 (24), pp Alexandros Beskos, Gareth Roberts, Andrew Stuart, et al., Optimal scalings for local metropolis hastings chains on nonproduct targets in high dimensions, The Annals of Applied Probability, 9 (29), pp Alexandros Beskos, Gareth Roberts, Andrew Stuart, and Jochen Voss, Mcmc methods for diffusion bridges, Stochastics and Dynamics, 8 (28), pp Yuxin Chen, David Keyes, Kody JH Law, and Hatem Ltaief, Accelerated dimensionindependent adaptive metropolis, arxiv preprint arxiv:56.574, (25). 7. Simon L Cotter, Gareth O Roberts, AM Stuart, David White, et al., Mcmc methods for functions: modifying old algorithms to make them faster, Statistical Science, 28 (23), pp Tiangang Cui, Kody JH Law, and Youssef M Marzouk, Dimension-independent likelihood-informed mcmc, arxiv preprint arxiv:4.3688, (24). 9. Giuseppe Da Prato, An introduction to infinite-dimensional analysis, Springer, 26.. Zhe Feng and Jinglai Li, An adaptive independence sampler mcmc algorithm for infinite dimensional bayesian inferences, arxiv preprint arxiv: , (25).. Heikki Haario, Eero Saksman, and Johanna Tamminen, An adaptive metropolis algorithm, Bernoulli, (2), pp Nils Lid Hjort, Chris Holmes, Peter Müller, and Stephen G Walker, Bayesian nonparametrics, vol. 28, Cambridge University Press, Gabriele Inglese, An inverse problem in corrosion detection, Inverse problems, 3 (997), p. 977.

18 8 Z. Hu, Z. Yao and J. Li 4. Jari Kaipio and Erkki Somersalo, Statistical and computational inverse problems, vol. 6, Springer, Robert E. Kass, Bradley P. Carlin, Andrew Gelman, and Radford M. Neal, Markov Chain Monte Carlo in Practice: A Roundtable Discussion, The American Statistician, 52 (998), pp Kody JH Law, Proposals which speed up function-space mcmc, Journal of Computational and Applied Mathematics, 262 (24), pp James Martin, Lucas C Wilcox, Carsten Burstedde, and Omar Ghattas, A stochastic newton mcmc method for large-scale statistical inverse problems with application to seismic inversion, SIAM Journal on Scientific Computing, 34 (22), pp. A46 A Jonathan C Mattingly, Natesh S Pillai, Andrew M Stuart, et al., Diffusion limits of the random walk metropolis algorithm in high dimensions, The Annals of Applied Probability, 22 (22), pp Noemi Petra, James Martin, Georg Stadler, and Omar Ghattas, A computational framework for infinite-dimensional bayesian inverse problems, part ii: Stochastic newton mcmc with application to ice sheet flow inverse problems, SIAM Journal on Scientific Computing, 36 (24), pp. A525 A Frank J Pinski, Gideon Simpson, Andrew M Stuart, and Hendrik Weber, Algorithms for kullback-leibler approximation of probability measures in infinite dimensions, arxiv preprint arxiv:48.92, (24). 2. Gareth O Roberts, Andrew Gelman, Walter R Gilks, et al., Weak convergence and optimal scaling of random walk metropolis algorithms, The annals of applied probability, 7 (997), pp Gareth O Roberts and Jeffrey S Rosenthal, Examples of adaptive mcmc, Journal of Computational and Graphical Statistics, 8 (29), pp Gareth O Roberts, Jeffrey S Rosenthal, et al., Optimal scaling for various metropolis-hastings algorithms, Statistical science, 6 (2), pp Daniel Rudolf and Björn Sprungk, On a generalization of the preconditioned cranknicolson metropolis algorithm, arxiv preprint arxiv:54.346, (25). 25. Eero Saksman, Matti Vihola, et al., On the ergodicity of the adaptive metropolis algorithm on unbounded domains, The Annals of applied probability, 2 (2), pp A. M. Stuart, Inverse problems: a Bayesian perspective, Acta Numerica, 9 (2), pp Sebastian J Vollmer, Dimension-independent mcmc sampling for inverse problems with non-gaussian priors, arxiv preprint arxiv:32.223, (23). 28. Zhewei Yao, Zixi Hu, and Jinglai Li, A tv-gaussian prior for infinite-dimensional bayesian inverse problems and its numerical implementations, arxiv preprint arxiv:5.5239, (25).

Dimension-Independent likelihood-informed (DILI) MCMC

Dimension-Independent likelihood-informed (DILI) MCMC Tiangang Cui, Kody Law 2, Youssef Marzouk Massachusetts Institute of Technology 2 Oak Ridge National Laboratory 2 August 25 TC, KL, YM DILI MCMC USC