The Monte Carlo Computation Error of Transition Probabilities

Similar documents
Computing the nearest reversible Markov chain

On a Generalized Transfer Operator

University of Toronto Department of Statistics

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

A generalization of Dobrushin coefficient

Reversible Markov chains

Markov State Models for Rare Events in Molecular Dynamics

Markov Chains. Arnoldo Frigessi Bernd Heidergott November 4, 2015

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains

On Reparametrization and the Gibbs Sampler

Estimating exit rate for rare event dynamical systems by extrapolation

Markov Processes on Discrete State Spaces

MARKOV CHAINS AND HIDDEN MARKOV MODELS

Adaptive simulation of hybrid stochastic and deterministic models for biochemical systems

Local consistency of Markov chain Monte Carlo methods

Faithful couplings of Markov chains: now equals forever

1 Brownian Local Time

Lecture 10. Theorem 1.1 [Ergodicity and extremality] A probability measure µ on (Ω, F) is ergodic for T if and only if it is an extremal point in M.

Capacitary inequalities in discrete setting and application to metastable Markov chains. André Schlichting

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Lecture 7 and 8: Markov Chain Monte Carlo

7 Poisson random measures

The Kolmogorov continuity theorem, Hölder continuity, and the Kolmogorov-Chentsov theorem

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

ELEC633: Graphical Models

Markov Chains and Monte Carlo Methods

Entropy, Inference, and Channel Coding

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

Entropy for zero-temperature limits of Gibbs-equilibrium states for countable-alphabet subshifts of finite type

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

An Introduction to Entropy and Subshifts of. Finite Type

Lectures on Stochastic Stability. Sergey FOSS. Heriot-Watt University. Lecture 4. Coupling and Harris Processes

Lecture 12: Detailed balance and Eigenfunction methods

7 The Radon-Nikodym-Theorem

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Lecture 22: Variance and Covariance

On Finding Subpaths With High Demand

THE INVERSE FUNCTION THEOREM

First Hitting Time Analysis of the Independence Metropolis Sampler

Approximate Dynamic Programming

Verona Course April Lecture 1. Review of probability

17 : Markov Chain Monte Carlo

Markov Chain Monte Carlo

Fall TMA4145 Linear Methods. Solutions to exercise set 9. 1 Let X be a Hilbert space and T a bounded linear operator on X.

MARKOV CHAIN MONTE CARLO

Metastability of Markovian systems

Average Reward Parameters

Randomized Algorithms for Semi-Infinite Programming Problems

Reinforcement Learning

Auxiliary Particle Methods

STOCHASTIC PROCESSES Basic notions

An indicator for the number of clusters using a linear map to simplex structure

Lecture 12: Detailed balance and Eigenfunction methods

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

I. Bayesian econometrics

Ergodic Properties of Markov Processes

Non-Factorised Variational Inference in Dynamical Systems

Sampling Methods (11/30/04)

Bayesian Methods with Monte Carlo Markov Chains II

Brief introduction to Markov Chain Monte Carlo

Computer Intensive Methods in Mathematical Statistics

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Signed Measures. Chapter Basic Properties of Signed Measures. 4.2 Jordan and Hahn Decompositions

Sobolev Spaces. Chapter 10

Expectation Propagation for Approximate Bayesian Inference

Particle Filters: Convergence Results and High Dimensions

An efficient approach to stochastic optimal control. Bert Kappen SNN Radboud University Nijmegen the Netherlands

Exploring the Numerics of Branch-and-Cut for Mixed Integer Linear Optimization

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

A Geometric Interpretation of the Metropolis Hastings Algorithm

Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( )

A QUASI-ERGODIC THEOREM FOR EVANESCENT PROCESSES. 1. Introduction

ELEMENTS OF PROBABILITY THEORY

Some Results on the Ergodicity of Adaptive MCMC Algorithms

MCMC and Gibbs Sampling. Kayhan Batmanghelich

An Brief Overview of Particle Filtering

Markov Processes. Stochastic process. Markov process

Reduction of Random Variables in Structural Reliability Analysis

III. Iterative Monte Carlo Methods for Linear Equations

1 Stochastic Dynamic Programming

RECURRENCE IN COUNTABLE STATE MARKOV CHAINS

Central Limit Theorems for Conditional Markov Chains

P i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=

A Markov state modeling approach to characterizing the punctuated equilibrium dynamics of stochastic evolutionary games

Random Process Lecture 1. Fundamentals of Probability

APPROXIMATING SELECTED NON-DOMINANT TIMESCALES

On the Convergence of Optimistic Policy Iteration

Sampling Algorithms for Probabilistic Graphical models

Spectral properties of Markov operators in Markov chain Monte Carlo

Markov chain Monte Carlo

Prior estimation and Bayesian inference from large cohort data sets 1

A D VA N C E D P R O B A B I L - I T Y

Chapter 7. Markov chain background. 7.1 Finite state space

On the Central Limit Theorem for an ergodic Markov chain

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

(U) =, if 0 U, 1 U, (U) = X, if 0 U, and 1 U. (U) = E, if 0 U, but 1 U. (U) = X \ E if 0 U, but 1 U. n=1 A n, then A M.

Conformational Studies of UDP-GlcNAc in Environments of Increasing Complexity

MATH 202B - Problem Set 5

Transcription:

Zuse Institute Berlin Takustr. 7 495 Berlin Germany ADAM NILSN The Monte Carlo Computation rror of Transition Probabilities ZIB Report 6-37 (July 206)

Zuse Institute Berlin Takustr. 7 D-495 Berlin Telefon: +49 30-8485-0 Telefax: +49 30-8485-25 e-mail: bibliothek@zib.de URL: http://www.zib.de ZIB-Report (Print) ISSN 438-0064 ZIB-Report (Internet) ISSN 292-7782

The Monte Carlo Computation rror of Transition Probabilities Adam Nielsen Abstract In many applications one is interested to compute transition probabilities of a Markov chain. This can be achieved by using Monte Carlo methods with local or global sampling points. In this article, we analyze the error by the difference in the L 2 norm between the true transition probabilities and the approximation achieved through a Monte Carlo method. We give a formula for the error for Markov chains with locally computed sampling points. Further, in the case of reversible Markov chains, we will deduce a formula for the error when sampling points are computed globally. We will see that in both cases the error itself can be approximated with Monte Carlo methods. As a consequence of the result, we will derive surprising properties of reversible Markov chains. Introduction In many applications, one is interested to approximate the term P[X B X 0 A] from a Markov chain (X n ) with stationary measure µ. A solid method to do this is by using a Monte Carlo method. In this article, we will give the exact error between this term and the approximated term by the Monte Carlo method. It will turn out that when we approximate the term with N trajectories with starting points distributed locally in A, then the squared error in the L 2 norm is exactly given by ( N µa [P x [ X B] 2 ] µa [P x [ X B]] 2) where µ A (B) := µ(a B) is the stationary distribution restricted on A and ( X n ) is the reversed Markov chain. If the Markov chain(x n ) is reversible and if we approximate the term with N trajectories with global starting points, then the squared error in the L 2 norm is given by N ( P[X2 A, X B X 0 A] P[X 0 A] P[X B X 0 A] 2 ). We even give the exact error for a more generalized term which is of interest whenever one wants to compute a Markov State Model of a Markov operator based on an arbitrary function space which has application in computational

drug design [0, ]. Further, we derive from the result some surprising properties for reversible Markov chains. For example, we will show that reversible Markov chains rather return to a set then being in a set, i.e. for any measurable set A. 2 Basics P[X 2 A X 0 A] P[X 0 A] We denote with (, Σ, µ) and (Ω, A, P) probability spaces on any given sets and Ω. We call a map p: Σ [0, ] Markov kernel if A p(x, A) is almost surely a probability measure on Σ and x p(x, A) is measurable for all A Σ. We denote with (X n ) n, X n : Ω a Markov chain on a measurable state space [8, 6], i.e.there exists a Markov kernel p with P[X 0 A 0, X A,..., X n A n ] =... A 0 p(y n, A n ) p(y n 2, dy n )... p(y 0, dy ) µ(dy 0 ) A n for any n N, A 0,..., A n Σ. In this article, we only need this formula for n 2, where it simplifies to P[X 0 A 0, X A, X 2 A 2 ] = p(y, A 2 ) p(x, dy) µ(dx) A for any A 0, A, A 2 Σ. We assume throughout this paper that µ is a stationary measure, i.e. P[X n A] = µ(a) for any n N, A Σ. We call a Markov chain (X n ) reversible if p(x, B) µ(dx) = p(x, A) µ(dx) holds for any A, B Σ. This is equivalent to A A 0 P[X 0 A, X B] = P[X 0 B, X A] for any A, B Σ. In particular, it implies that µ is a stationary measure. We denote with L (µ) the space of all µ-integrable functions [, Page 99]. It is known [3, Theorem 2.] that there exists a Markov operator P : L (µ) L (µ) with P f L = f L and P f 0 for all non-negative f L (µ) which is associated to the Markov chain, i.e. p(x, A) f(x) µ(dx) = (P f)(x) µ(dx) for all A Σ, f L (µ). Since µ is a stationary measure, we have P (L 2 (µ)) L 2 (µ) and we can restrict P onto L 2 (µ) [2, Lemma ]. It is known [3] that a reversed Markov chain ( X n ) exists with P[X B, X 0 A] = P[ X A, X 0 B] B A 2

and that the Markov operator can be evaluated point-wise as P f(x) = x [f( X )]. () In the case where (X n ) is reversible, we have that the Markov operator P : L 2 (µ) L 2 (µ) is self-adjoint [4, Proposition.] and that X n = X n, thus it can be evaluated point-wise by P f(x) = x [f(x )] (2) Throughout the paper we note for any probability measure µ the expectation as µ [f] = f(x) µ(dx) and use the shortcut := P. We denote with f, g µ = f(x) g(x) µ(dx) the scalar product on L 2 (µ). In this article, we are interested in the quantity C := f, P g µ f, µ (3) where denotes the constant function that is everywhere one and f, g L 2 (µ) with f, µ 0. For the special case where f(x) = A (x) and g(x) = B (x) holds, we obtain C = µ(a) B p(x, A) µ(dx) = P[X B X 0 A]. There are many applications that involve an approximation of C for different types of functions f, g which can be found in [0]. 3 Computation of the Local rror To compute the error locally, we rewrite C = f, P g µ = f, µ P g(x) µ i (dx) with µ i (A) = f(x)µ(dx). f, µ A This term can be approximated by Monte Carlo methods [5]. To do so, we need random variables Y, Y,..., Y N : Ω distributed according to µ i. Then we can approximate the term C by The error can be stated as C := N C C 2 L 2 (P) N P g(y k ). k= VAR[P g(y )] =. N 3

This follows from [ C] = C and C C 2 L 2 (P) = VAR[ C] = N N 2 VAR[P g(y )] In this case the variance is simply given as = i= VAR[P g(y )]. N VAR[P g(y )] = [P g(y ) 2 ] [P g(y )] 2 = µi [ x [g( X )] 2 ] µi [ x [g( X )]] 2. In the case where the Markov chain (X n ) is reversible, the variance simplifies to VAR[P g(y )] = µi [ x [g(x )] 2 ] µi [ x [g(x )]] 2. If further g = A, we obtain 3. xample VAR[P g(y )] = µi [P x [X A] 2 ] µi [P x [X A]] 2. Consider the solution (Y t ) t 0 of the stochastic differential equation dy t = V (Y t ) + σdb t with σ > 0 and V (x) = (x 2) 2 (x + 2) 2. The potential V and a realization of (Y t ) are shown in Figure. The solution (Y t ) is known to be reversible, see [7, Proposition 4.5]. For any value τ > 0 the family (X n ) n N with X n := Y n τ is a reversible Markov chain. Let us partition [ 3, 3] into twenty equidistant sets (A l ) l=,...,20. We compute a matrix M R 20 20 with M(i, j) = µi [P x [X A j ] 2 ] µi [P x [X A j ]] 2 with µ i (B) = µ(a i B). In each set A i we sample points x i,..., x i N with the Metropolis Monte Carlo Method distributed according to µ i. For each point x i j we sample M trajectories starting in x i j with endpoint yi k,j for k =,..., M. We then approximate and µi [P x [X A j ] 2 ] N ( N M Aj (y i M k,l) l= ( µi [P x [X A j ]] 2 N N M l= k= M Aj (yk,l) i The local sampled points and the matrix M are shown in Figure 2. It shows that in order to compute the transition probability from 0 to 2 or from 0 to 2, many trajectories are needed. k= ) 2 ) 2 4

20 2 V(x) 0 x(t) 0 2 0 3 2 0 2 3 x 0 0. 0.2 0.3 0.4 0.6 0.7 0.8 0.9 t 0 7 Figure : Potential (left) and trajectory (right). 3,000 2 0.0030 2,500 4 6 0.0025 2,000 8 0.0020,500 0 2 0.005,000 4 0.000 500 6 8 0.0005 0 3 2.5 2.5 0.5 2 2.5 3 20 2 4 6 8 0 2 4 6 8 20 0.0000 Figure 2: Local starting points (left) and matrix M (right). 4 Computation of the Global rror We assume in this section that (X n ) is reversible. If we denote with then one can rewrite φ(x) := P f(x) g(x) C = f, P g µ f, µ = P f, g µ f, µ = f, µ, φ(x) µ(dx). Analogous to the local case, we can approximate C by Monte Carlo methods. However, this time we need random variables Y, Y,..., Y N : Ω distributed according to µ. Then we can approximate the term C by Again, the error can be stated as C := N C C 2 L 2 (P) N φ(y k ). k= VAR[φ(Y )] =. N The main result of the article is the following theorem: Theorem It holds VAR[φ(Y )] = ν f [g 2 (X ) f(x 2 )] [f(x 0 )] νf [g(x )] 2 with ν f (A) = A f(x 0 (ω)) Ω f(x 0( ω)) P(d ω) P(dω). 5

3,000 2 2,500 2,000 4 6 8 2.5 2,500 0 2.5,000 4 500 6 8 0 3 2.5 2.5 0.5 2 2.5 3 20 2 4 6 8 0 2 4 6 8 20 0 Figure 3: Global starting points (left) and matrix M (right). The theorem shows that one can compute the exact error. Before we step into the proof, we will first show some applications of the theorem. 4. xample In the case where f(x) = A (x), g(x) = B (x) the variance simplifies to VAR[φ(Y )] = P[X 2 A, X B X 0 A] P[X 0 A] P[X B X 0 A] 2. We consider again the example from Section 3.. This time, we compute starting points globally with the Metropolis Monte Carlo Method distributed according to µ and we compute the matrix M ij = P[X 2 A i, X A j X 0 A i ] P[X 0 A i ] P[X A j X 0 A i ] 2. The results are shown in Figure 3. One may note that the entries of the global variance matrix are significantly higher then of the local variance matrix. The reason for that is the following. In the global case, the matrix M(i, j) gives indication how many global trajectories one needs to keep the error of P[X A j X 0 A i ] small. However, one can only use those trajectories with starting points in A i. But in the local case, we only sample starting points in A i and thus we do not need to dismiss any trajectories, this implies that we need less trajectories and thus it becomes clear that M(i, j) is smaller in the local case. specially, the global scheme has difficulties to capture the transition probability from a transition region (here the area near 0) to a transition region. Which is in turn no challenge for the local scheme. 5 Inequalities for Reversible Markov Chains Since VAR[φ(Y )] 0, we obtain νf [g 2 (X ) f(x 2 )] [f(x 0 )] νf [g(x )] 2 for any f, g L 2 (µ). If we set g(x) =, we obtain νf [f(x 2 )] [f(x 0 )]. 6

For the case where f(x) = A for A Σ, this simplifies to P[X 2 A X 0 A] P[X 0 A]. In short: Reversible Markov chains prefer to return to a set instead of being there. If = {,..., n} is finite and the Markov chain is described by a transition matrix P R n n which is reversible according to the probability vector π R n, i.e. π i p ij = π j p ji, then the inequality states This shows by the way that from it immediately follows P 2 (i, i) π i. n P 2 (i, i) = i= P 2 (i, i) = π i. Another inequality can be obtained as follows. Fix some sets A, B Σ with A B =. Define the function f at point x to be the probability that the Markov chain (X n ) hits next set A before hitting set B when starting in x. Since the Markov chain is reversible, this is equivalent to the probability that one visited last set A instead of set B. This function is also known as committor function [9]. The term νf [f(x 2 )] is then given as the conditioned probability that one came last from A instead of B, moves for two steps, and returns then next to A instead of B. The probability for this event is thus always higher, then [f(x 0 )] which represents simply the probability that one came last from A instead of B. xample Consider the finite reversible Markov chain 0 P = 0 0 visualized in Figure 4 with stationary measure π = ( 3, 3, 3). In this case we have P 2 (i, i) = 2 3 = π i. The probability that one came last from A := {} instead of B := {2} is given by [f(x 0 )] = 3 + 3 0 + 3 2 = 2. The conditioned probability that one came last from A instead of B, moves for two steps, and returns then next to A instead of B is given by νf [f(x 2 )] = 2 ( 3 4 + 4 2 + 4 + ) 4 0 +0+ ( 3 4 2 + 4 + 4 2 + ) 4 0 = 7 2 7

2 3 Figure 4: Visualized simple Markov chain where we have used that ν f ({}) = 2 3, ν f ({2}) = 0 and ν f ({3}) = 3. Thus, we have as predicted 7 2 = ν f [f(x 2 )] [f(x 0 )] = 2. 6 The proof We will now give the proof of Theorem. First, note that we have VAR[φ(Y )] = [φ 2 (Y )] [φ(y )] 2. For sets A, B Σ, we have [ A (X 0 ) B (X )] = P[X 0 A, X B] = A (x) B (y) p(x, dy) µ(dx). For measurable functions f, g 0, we then obtain [f(x 0 ) g(x )] = f(x) g(y) p(x, dy) µ(dx). (4) For φ(x) = P f(x) g(x) f, µ we first compute [φ 2 ( ) 2 (Y )] = P f(x) g(x) µ(dx) f, 2 µ ( ) = f, 2 P f(x) g 2 P f (x) µ(dx) µ ( ( ) ) = f, 2 f(x) P g 2 P f (x) µ(dx) µ = f, 2 µ leading to [φ 2 (Y )] = f, 2 µ ( ) f(x) g 2 (y) P f(y) p(x, dy) µ(dx) f(x) g 2 (y) y [f(x )] p(x, dy) µ(dx), where we have used quation (2). From quation (4) we get [ ] [φ 2 (Y )] = f, 2 f(x 0 ) g 2 (X ) X [f(x )]. µ 8

Further, according to [6, quation (3.28) in Chapter 3], we have X [f(x )] = [f(x 2 ) F ], where F = σ(x 0, X ). Thus, we obtain [ f(x 0 ) g 2 (X ) X [f(x )] ] = [ f(x 0 ) g 2 (X ) [f(x 2 ) F ] ] Also note that we have [ A [f(x 2 ) F ]] = [ A f(x 2 )] for any A F. Since f(x 0 ) g 2 (X ) is measurable according to F, we obtain Thus we finally get with [ f(x 0 ) g 2 (X ) [f(x 2 ) F ] ] = [ f(x 0 ) g 2 (X ) f(x 2 ) ]. [φ 2 (Y )] = νf [g 2 (X ) f(x 2 )] µ [f] ν f (A) = for all measurable sets A. Finally, we have [φ(y )] = µ [f] = µ [f] = µ [f] A f(x 0 (ω)) Ω f(x 0( ω)) P(d ω) P(dω) P f(x) g(x) µ(dx) f(x) P g(x) µ(dx) [ ] f(x) g(y) p(x, dy) µ(dx) = µ [f] [f(x 0)g(X )] = νf [g(x )]. This gives us the exact formula for the variance. Acknowledgement The author gatefully thanks the Berlin Mathematical School for financial support. 9

References [] Heinz Bauer. Maß und Integrationstheorie. Walter de Gruyter, Berlin; New York, 2nd edition, 992. [2] John R. Baxter and Jeffrey S. Rosenthal. Rates of convergence for everywhere-positive markov chains. Statistics & Probability Letters, 22:333 338, 995. [3] berhard Hopf. The general temporally discrete Markoff process, volume 3. 954. [4] Wilhelm Huisinga. Metastability of Markovian systems. PhD thesis, Freie Universitt Berlin, 200. [5] David J. C. MacKay. Introduction to Monte Carlo methods. In M. I. Jordan, editor, Learning in Graphical Models, NATO Science Series, pages 75 204. Kluwer Academic Press, 998. [6] Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University Press, New York, second edition edition, 2009. [7] Grigorios A. Pavliotis. Stochastic Processes and Applications. Springer, 204. [8] Daniel Revuz. Markov Chains. North-Holland Mathematical Library, 2nd edition, 984. [9] Christof Schütte and Marco Sarich. Metastability and Markov State Models in Molecular Dynamics: Modeling, Analysis, Algorithmic Approaches. 204. A co-publication of the AMS and the Courant Institute of Mathematical Sciences at New York University. [0] Christof Schütte and Marco Sarich. A critical appraisal of markov state models. The uropean Physical Journal Special Topics, pages 8, 205. [] Marcus Weber. Meshless Methods in Conformation Dynamics. PhD thesis, Freie Universitt Berlin, 2006. 0