The Monte Carlo Computation Error of Transition Probabilities

Zuse Institute Berlin Takustr. 7 495 Berlin Germany ADAM NILSN The Monte Carlo Computation rror of Transition Probabilities ZIB Report 6-37 (July 206)

Zuse Institute Berlin Takustr. 7 D-495 Berlin Telefon: +49 30-8485-0 Telefax: +49 30-8485-25 e-mail: bibliothek@zib.de URL: http://www.zib.de ZIB-Report (Print) ISSN 438-0064 ZIB-Report (Internet) ISSN 292-7782

The Monte Carlo Computation rror of Transition Probabilities Adam Nielsen Abstract In many applications one is interested to compute transition probabilities of a Markov chain. This can be achieved by using Monte Carlo methods with local or global sampling points. In this article, we analyze the error by the difference in the L 2 norm between the true transition probabilities and the approximation achieved through a Monte Carlo method. We give a formula for the error for Markov chains with locally computed sampling points. Further, in the case of reversible Markov chains, we will deduce a formula for the error when sampling points are computed globally. We will see that in both cases the error itself can be approximated with Monte Carlo methods. As a consequence of the result, we will derive surprising properties of reversible Markov chains. Introduction In many applications, one is interested to approximate the term P[X B X 0 A] from a Markov chain (X n ) with stationary measure µ. A solid method to do this is by using a Monte Carlo method. In this article, we will give the exact error between this term and the approximated term by the Monte Carlo method. It will turn out that when we approximate the term with N trajectories with starting points distributed locally in A, then the squared error in the L 2 norm is exactly given by ( N µa [P x [ X B] 2 ] µa [P x [ X B]] 2) where µ A (B) := µ(a B) is the stationary distribution restricted on A and ( X n ) is the reversed Markov chain. If the Markov chain(x n ) is reversible and if we approximate the term with N trajectories with global starting points, then the squared error in the L 2 norm is given by N ( P[X2 A, X B X 0 A] P[X 0 A] P[X B X 0 A] 2 ). We even give the exact error for a more generalized term which is of interest whenever one wants to compute a Markov State Model of a Markov operator based on an arbitrary function space which has application in computational

drug design [0, ]. Further, we derive from the result some surprising properties for reversible Markov chains. For example, we will show that reversible Markov chains rather return to a set then being in a set, i.e. for any measurable set A. 2 Basics P[X 2 A X 0 A] P[X 0 A] We denote with (, Σ, µ) and (Ω, A, P) probability spaces on any given sets and Ω. We call a map p: Σ [0, ] Markov kernel if A p(x, A) is almost surely a probability measure on Σ and x p(x, A) is measurable for all A Σ. We denote with (X n ) n, X n : Ω a Markov chain on a measurable state space [8, 6], i.e.there exists a Markov kernel p with P[X 0 A 0, X A,..., X n A n ] =... A 0 p(y n, A n ) p(y n 2, dy n )... p(y 0, dy ) µ(dy 0 ) A n for any n N, A 0,..., A n Σ. In this article, we only need this formula for n 2, where it simplifies to P[X 0 A 0, X A, X 2 A 2 ] = p(y, A 2 ) p(x, dy) µ(dx) A for any A 0, A, A 2 Σ. We assume throughout this paper that µ is a stationary measure, i.e. P[X n A] = µ(a) for any n N, A Σ. We call a Markov chain (X n ) reversible if p(x, B) µ(dx) = p(x, A) µ(dx) holds for any A, B Σ. This is equivalent to A A 0 P[X 0 A, X B] = P[X 0 B, X A] for any A, B Σ. In particular, it implies that µ is a stationary measure. We denote with L (µ) the space of all µ-integrable functions [, Page 99]. It is known [3, Theorem 2.] that there exists a Markov operator P : L (µ) L (µ) with P f L = f L and P f 0 for all non-negative f L (µ) which is associated to the Markov chain, i.e. p(x, A) f(x) µ(dx) = (P f)(x) µ(dx) for all A Σ, f L (µ). Since µ is a stationary measure, we have P (L 2 (µ)) L 2 (µ) and we can restrict P onto L 2 (µ) [2, Lemma ]. It is known [3] that a reversed Markov chain ( X n ) exists with P[X B, X 0 A] = P[ X A, X 0 B] B A 2

and that the Markov operator can be evaluated point-wise as P f(x) = x [f( X )]. () In the case where (X n ) is reversible, we have that the Markov operator P : L 2 (µ) L 2 (µ) is self-adjoint [4, Proposition.] and that X n = X n, thus it can be evaluated point-wise by P f(x) = x [f(x )] (2) Throughout the paper we note for any probability measure µ the expectation as µ [f] = f(x) µ(dx) and use the shortcut := P. We denote with f, g µ = f(x) g(x) µ(dx) the scalar product on L 2 (µ). In this article, we are interested in the quantity C := f, P g µ f, µ (3) where denotes the constant function that is everywhere one and f, g L 2 (µ) with f, µ 0. For the special case where f(x) = A (x) and g(x) = B (x) holds, we obtain C = µ(a) B p(x, A) µ(dx) = P[X B X 0 A]. There are many applications that involve an approximation of C for different types of functions f, g which can be found in [0]. 3 Computation of the Local rror To compute the error locally, we rewrite C = f, P g µ = f, µ P g(x) µ i (dx) with µ i (A) = f(x)µ(dx). f, µ A This term can be approximated by Monte Carlo methods [5]. To do so, we need random variables Y, Y,..., Y N : Ω distributed according to µ i. Then we can approximate the term C by The error can be stated as C := N C C 2 L 2 (P) N P g(y k ). k= VAR[P g(y )] =. N 3

This follows from [ C] = C and C C 2 L 2 (P) = VAR[ C] = N N 2 VAR[P g(y )] In this case the variance is simply given as = i= VAR[P g(y )]. N VAR[P g(y )] = [P g(y ) 2 ] [P g(y )] 2 = µi [ x [g( X )] 2 ] µi [ x [g( X )]] 2. In the case where the Markov chain (X n ) is reversible, the variance simplifies to VAR[P g(y )] = µi [ x [g(x )] 2 ] µi [ x [g(x )]] 2. If further g = A, we obtain 3. xample VAR[P g(y )] = µi [P x [X A] 2 ] µi [P x [X A]] 2. Consider the solution (Y t ) t 0 of the stochastic differential equation dy t = V (Y t ) + σdb t with σ > 0 and V (x) = (x 2) 2 (x + 2) 2. The potential V and a realization of (Y t ) are shown in Figure. The solution (Y t ) is known to be reversible, see [7, Proposition 4.5]. For any value τ > 0 the family (X n ) n N with X n := Y n τ is a reversible Markov chain. Let us partition [ 3, 3] into twenty equidistant sets (A l ) l=,...,20. We compute a matrix M R 20 20 with M(i, j) = µi [P x [X A j ] 2 ] µi [P x [X A j ]] 2 with µ i (B) = µ(a i B). In each set A i we sample points x i,..., x i N with the Metropolis Monte Carlo Method distributed according to µ i. For each point x i j we sample M trajectories starting in x i j with endpoint yi k,j for k =,..., M. We then approximate and µi [P x [X A j ] 2 ] N ( N M Aj (y i M k,l) l= ( µi [P x [X A j ]] 2 N N M l= k= M Aj (yk,l) i The local sampled points and the matrix M are shown in Figure 2. It shows that in order to compute the transition probability from 0 to 2 or from 0 to 2, many trajectories are needed. k= ) 2 ) 2 4

20 2 V(x) 0 x(t) 0 2 0 3 2 0 2 3 x 0 0. 0.2 0.3 0.4 0.6 0.7 0.8 0.9 t 0 7 Figure : Potential (left) and trajectory (right). 3,000 2 0.0030 2,500 4 6 0.0025 2,000 8 0.0020,500 0 2 0.005,000 4 0.000 500 6 8 0.0005 0 3 2.5 2.5 0.5 2 2.5 3 20 2 4 6 8 0 2 4 6 8 20 0.0000 Figure 2: Local starting points (left) and matrix M (right). 4 Computation of the Global rror We assume in this section that (X n ) is reversible. If we denote with then one can rewrite φ(x) := P f(x) g(x) C = f, P g µ f, µ = P f, g µ f, µ = f, µ, φ(x) µ(dx). Analogous to the local case, we can approximate C by Monte Carlo methods. However, this time we need random variables Y, Y,..., Y N : Ω distributed according to µ. Then we can approximate the term C by Again, the error can be stated as C := N C C 2 L 2 (P) N φ(y k ). k= VAR[φ(Y )] =. N The main result of the article is the following theorem: Theorem It holds VAR[φ(Y )] = ν f [g 2 (X ) f(x 2 )] [f(x 0 )] νf [g(x )] 2 with ν f (A) = A f(x 0 (ω)) Ω f(x 0( ω)) P(d ω) P(dω). 5

3,000 2 2,500 2,000 4 6 8 2.5 2,500 0 2.5,000 4 500 6 8 0 3 2.5 2.5 0.5 2 2.5 3 20 2 4 6 8 0 2 4 6 8 20 0 Figure 3: Global starting points (left) and matrix M (right). The theorem shows that one can compute the exact error. Before we step into the proof, we will first show some applications of the theorem. 4. xample In the case where f(x) = A (x), g(x) = B (x) the variance simplifies to VAR[φ(Y )] = P[X 2 A, X B X 0 A] P[X 0 A] P[X B X 0 A] 2. We consider again the example from Section 3.. This time, we compute starting points globally with the Metropolis Monte Carlo Method distributed according to µ and we compute the matrix M ij = P[X 2 A i, X A j X 0 A i ] P[X 0 A i ] P[X A j X 0 A i ] 2. The results are shown in Figure 3. One may note that the entries of the global variance matrix are significantly higher then of the local variance matrix. The reason for that is the following. In the global case, the matrix M(i, j) gives indication how many global trajectories one needs to keep the error of P[X A j X 0 A i ] small. However, one can only use those trajectories with starting points in A i. But in the local case, we only sample starting points in A i and thus we do not need to dismiss any trajectories, this implies that we need less trajectories and thus it becomes clear that M(i, j) is smaller in the local case. specially, the global scheme has difficulties to capture the transition probability from a transition region (here the area near 0) to a transition region. Which is in turn no challenge for the local scheme. 5 Inequalities for Reversible Markov Chains Since VAR[φ(Y )] 0, we obtain νf [g 2 (X ) f(x 2 )] [f(x 0 )] νf [g(x )] 2 for any f, g L 2 (µ). If we set g(x) =, we obtain νf [f(x 2 )] [f(x 0 )]. 6

For the case where f(x) = A for A Σ, this simplifies to P[X 2 A X 0 A] P[X 0 A]. In short: Reversible Markov chains prefer to return to a set instead of being there. If = {,..., n} is finite and the Markov chain is described by a transition matrix P R n n which is reversible according to the probability vector π R n, i.e. π i p ij = π j p ji, then the inequality states This shows by the way that from it immediately follows P 2 (i, i) π i. n P 2 (i, i) = i= P 2 (i, i) = π i. Another inequality can be obtained as follows. Fix some sets A, B Σ with A B =. Define the function f at point x to be the probability that the Markov chain (X n ) hits next set A before hitting set B when starting in x. Since the Markov chain is reversible, this is equivalent to the probability that one visited last set A instead of set B. This function is also known as committor function [9]. The term νf [f(x 2 )] is then given as the conditioned probability that one came last from A instead of B, moves for two steps, and returns then next to A instead of B. The probability for this event is thus always higher, then [f(x 0 )] which represents simply the probability that one came last from A instead of B. xample Consider the finite reversible Markov chain 0 P = 0 0 visualized in Figure 4 with stationary measure π = ( 3, 3, 3). In this case we have P 2 (i, i) = 2 3 = π i. The probability that one came last from A := {} instead of B := {2} is given by [f(x 0 )] = 3 + 3 0 + 3 2 = 2. The conditioned probability that one came last from A instead of B, moves for two steps, and returns then next to A instead of B is given by νf [f(x 2 )] = 2 ( 3 4 + 4 2 + 4 + ) 4 0 +0+ ( 3 4 2 + 4 + 4 2 + ) 4 0 = 7 2 7

2 3 Figure 4: Visualized simple Markov chain where we have used that ν f ({}) = 2 3, ν f ({2}) = 0 and ν f ({3}) = 3. Thus, we have as predicted 7 2 = ν f [f(x 2 )] [f(x 0 )] = 2. 6 The proof We will now give the proof of Theorem. First, note that we have VAR[φ(Y )] = [φ 2 (Y )] [φ(y )] 2. For sets A, B Σ, we have [ A (X 0 ) B (X )] = P[X 0 A, X B] = A (x) B (y) p(x, dy) µ(dx). For measurable functions f, g 0, we then obtain [f(x 0 ) g(x )] = f(x) g(y) p(x, dy) µ(dx). (4) For φ(x) = P f(x) g(x) f, µ we first compute [φ 2 ( ) 2 (Y )] = P f(x) g(x) µ(dx) f, 2 µ ( ) = f, 2 P f(x) g 2 P f (x) µ(dx) µ ( ( ) ) = f, 2 f(x) P g 2 P f (x) µ(dx) µ = f, 2 µ leading to [φ 2 (Y )] = f, 2 µ ( ) f(x) g 2 (y) P f(y) p(x, dy) µ(dx) f(x) g 2 (y) y [f(x )] p(x, dy) µ(dx), where we have used quation (2). From quation (4) we get [ ] [φ 2 (Y )] = f, 2 f(x 0 ) g 2 (X ) X [f(x )]. µ 8

Further, according to [6, quation (3.28) in Chapter 3], we have X [f(x )] = [f(x 2 ) F ], where F = σ(x 0, X ). Thus, we obtain [ f(x 0 ) g 2 (X ) X [f(x )] ] = [ f(x 0 ) g 2 (X ) [f(x 2 ) F ] ] Also note that we have [ A [f(x 2 ) F ]] = [ A f(x 2 )] for any A F. Since f(x 0 ) g 2 (X ) is measurable according to F, we obtain Thus we finally get with [ f(x 0 ) g 2 (X ) [f(x 2 ) F ] ] = [ f(x 0 ) g 2 (X ) f(x 2 ) ]. [φ 2 (Y )] = νf [g 2 (X ) f(x 2 )] µ [f] ν f (A) = for all measurable sets A. Finally, we have [φ(y )] = µ [f] = µ [f] = µ [f] A f(x 0 (ω)) Ω f(x 0( ω)) P(d ω) P(dω) P f(x) g(x) µ(dx) f(x) P g(x) µ(dx) [ ] f(x) g(y) p(x, dy) µ(dx) = µ [f] [f(x 0)g(X )] = νf [g(x )]. This gives us the exact formula for the variance. Acknowledgement The author gatefully thanks the Berlin Mathematical School for financial support. 9

References [] Heinz Bauer. Maß und Integrationstheorie. Walter de Gruyter, Berlin; New York, 2nd edition, 992. [2] John R. Baxter and Jeffrey S. Rosenthal. Rates of convergence for everywhere-positive markov chains. Statistics & Probability Letters, 22:333 338, 995. [3] berhard Hopf. The general temporally discrete Markoff process, volume 3. 954. [4] Wilhelm Huisinga. Metastability of Markovian systems. PhD thesis, Freie Universitt Berlin, 200. [5] David J. C. MacKay. Introduction to Monte Carlo methods. In M. I. Jordan, editor, Learning in Graphical Models, NATO Science Series, pages 75 204. Kluwer Academic Press, 998. [6] Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University Press, New York, second edition edition, 2009. [7] Grigorios A. Pavliotis. Stochastic Processes and Applications. Springer, 204. [8] Daniel Revuz. Markov Chains. North-Holland Mathematical Library, 2nd edition, 984. [9] Christof Schütte and Marco Sarich. Metastability and Markov State Models in Molecular Dynamics: Modeling, Analysis, Algorithmic Approaches. 204. A co-publication of the AMS and the Courant Institute of Mathematical Sciences at New York University. [0] Christof Schütte and Marco Sarich. A critical appraisal of markov state models. The uropean Physical Journal Special Topics, pages 8, 205. [] Marcus Weber. Meshless Methods in Conformation Dynamics. PhD thesis, Freie Universitt Berlin, 2006. 0