Lecture notes for Numerik IVc - Numerics for Stochastic Processes, Wintersemester 2012/2013. Instructor: Prof. Carsten Hartmann

Size: px

Start display at page:

Download "Lecture notes for Numerik IVc - Numerics for Stochastic Processes, Wintersemester 2012/2013. Instructor: Prof. Carsten Hartmann"

Scott Moody
5 years ago
Views:

1 Lecture notes for Numerik IVc - Numerics for Stochastic Processes, Wintersemester 212/213. Instructor: Prof. Carsten Hartmann Scribe: H. Lie December 15, 215 (updated version) Contents 1 Day 1, : Modelling issues Time-discrete Markov chains Markov jump processes Stochastic differential equations Day 2, : Probability theory basics Conditioning of random variables Stochastic processes Markov processes Day 3, : Brownian motion Scaled random walks Constructing Brownian motion Karhunen-Loève expansion Day 4, : Brownian motion, cont d More properties of Brownian motion Brownian bridge Simulating a Brownian bridge Day 5, : Stochastic integration Integrating against Brownian motion The Itô integral for simple functions Ambiguities in defining stochastic integrals Day 6, : Stochastic integration, cont d Itô isometry Itô Integral for random integrands Ornstein-Uhlenbeck process Day 7, : Itô calculus Functions of bounded and quadratic variation Itô s formula Geometric Brownian motion

2 8 Day 8, : Itô stochastic differential equations Martingale representation theorem Stochastic differential equations: Existence and uniqueness Applications from physics, biology and finance Day 9, : Stochastic Euler method The Euler-Maruyama scheme Convergence of Euler s method Some practical issues Day 1, : Kolmogorov backward equation Further remarks on the Euler-Maruyama scheme Strongly continuous semigroups Feynman-Kac theorem (backward equation) Day 11, : Exit times Some more semigroup theory Exit times: more Feynman-Kac formulae Applications of mean first exit time formulae Day 12, : Fokker-Planck equation Propagation of densities Adjoint semigroup: transfer operator Invariant measures Day 13, : Long-term behaviour of SDEs Invariant measures, cont d Convergence to equilibrium Spectral gaps and the space L 2 (R n, µ ) Day 14, : Markov chain approximations Ergodicity Markov Chain Monte Carlo (MCMC) Metropolis-Adjusted Langevin Algorithm (MALA) Day 15, : Small-noise asymptotics The Ising model as an illustration Laplace s method Kramers problem References 79 2

Figure 1.1: Simulation of a butane molecule and its approximation by a 3-state Markov chain (states in blue, green and yellow; solvent molecules not shown). 1 Day 1, 16.1.212: Modelling issues 1.

For a state space S = {1,..., n}, the transition probabilities p ij satisfy p ij = P (X t+1 = j X t = i) and yield a row-stochastic matrix P = (p ij ) i,j S. 1.

3 Figure 1.1: Simulation of a butane molecule and its approximation by a 3-state Markov chain (states in blue, green and yellow; solvent molecules not shown). 1 Day 1, : Modelling issues 1.1 Time-discrete Markov chains Time index set I is discrete, e.g. I N and state space S is countable or finite, e.g. S = {s 1, s 2, s 3 } (see Figure 1.1). Key objects are transition probabilities. For a state space S = {1,..., n}, the transition probabilities p ij satisfy p ij = P (X t+1 = j X t = i) and yield a row-stochastic matrix P = (p ij ) i,j S. 1.2 Markov jump processes These are time-continuous, discrete state-space Markov chains. Time index set I R +, S discrete. For a fixed time step h >, the transition probabilities are given by (see Figure 1.2) P (X t+h = s j X t = s i ) = hl ij + o(h) where L = (l ij ) i,j S and P h are matrices satisfying P h = exp (hl). Note: the matrix L is row sum zero, i.e. j l ij =. The waiting times for the Markov chain in any state s i are exponentially distributed in the sense that P (X t+s = s i, s [, τ) X t = s i ) = exp (l ii τ) 3

4 Figure 1.2: Simulation of butane: typical time series of the central dihedral angle (blue: metastable diffusion process, red: Markov jump process) and the average waiting time is l ii (by definition of the exponential distribution). Note: the spectrum of the matrix P h is contained within the unit disk, i.e. for every eigenvalue λ of P h, λ 1. This property is a consequence of P h being row-stochastic, i.e. that j P h,ij = 1. Since P h = exp(hl) it follows that σ(p h ) D := { x R 2 x 1 } σ(l) C = {y C Re(y) } Example 1.1. Suppose one has a reversible reaction in which one has a large collection of N molecules of the same substance. The molecules can be either in state A or state B and the molecules can change between the two states. Let k + denote the rate of the reaction in which molecules change from state A to B and let k denote the rate at which molecules change from state B to A. For t >, consider the quantity µ A i (t) := P (number of molecules in state A at time t is i) where i = {,..., N}. One can define quantities µ B i (t) in a similar way, and one can construct balance laws for these quantities, e.g. dµ A i (t) dt = k + µ A i+1(t) + k µ A i 1(t) (k + + k )µ A i (t). The above balance law can be written in vector notation using a tridiagonal matrix L. By adding an initial condition one can obtain an initial value problem dµ A (t) dt = L µ A (t), µ A () = µ. The solution of the initial value problem above is µ A (t) = µ exp ( tl ). 4

5 1.3 Stochastic differential equations These are time-continuous, continuous state space Markov chains. Stochastic differential equations (SDEs) may be considered to be ordinary differential equations (ODEs) with an additional noise term (cf. Figure 1.2). Let b : R n R n be a smooth vector field and let x(t) be a deterministic dynamical system governed by the vector field b( ). Then x(t) evolves according to dx dt = b(x), x() = x. (1.1) Now let (B t ) t> be Brownian motion in R d, and let (X t ) t> be a dynamical system in R d which evolves according to the equation dx t dt = b(x t ) + db t dt. (1.2) The additional term dbt dt represents noise, or random perturbations from the environment, but is not well-defined because the paths of Brownian motion are nowhere differentiable. Therefore, one sometimes writes which is shorthand for X t = X + dx t = b(x t )dt + db t, t b(x t )dt + t db t. The most common numerical integration method for SDEs is the forward Euler method. If x is a C 1 function of time t, then dx x(s + h) x(s) dt = lim. t=s h h The forward Euler method for ODEs of the form (1.1) is given by X t+h = X t + hb(x t ) and for SDEs of the form (1.2) it is given by X t+h = X t + hb(x t ) + ξ h where < h 1 is the integration time step and the noise term ξ in the Euler method for SDEs is modelled by a mean-zero Gaussian random variable. For stochastic dynamical systems which evolve according to SDEs as in (1.2), one can consider the probability that a system at some point x R d will be in a set A R d after a short time h > : P (X t+h A X t = x). The associated transition probability density functions of these stochastic dynamical systems are Gaussian because the noise term in (1.2) is Gaussian. What has been the generator matrix L in case of a Markov jump process is an infinite-dimensional operator acting on a suitable Banach space. Specifically, E x [f(x t )] f(x ) Lf(x ) = lim, t t provided that the limit exists. Here f : R n R is any measurable function and E x [ ] denotes the expectation over all random paths of X t satisfying X = x. L is a second-order differential operator if f is twice differentiable. 5

6 2 Day 2, : Probability theory basics Suggested references: [1, 16] Let (Ω, E, P) be a probability space, where Ω is a set and E 2 Ω is a σ-field or σ-algebra on Ω, and P is a probability measure (i.e., P is a nonnegative, countably additive measure on (Ω, E) with the property P(Ω) = 1). 2.1 Conditioning of random variables Let A E be a set of nonzero measure, i.e. P(A) > and define E A to be the set of all subsets of A which are elements of E, i.e. E A := {E A E E}. Definition 2.1 (Conditional probability, part I). For an event A and an event E E A, the conditional probability of E given A is P(E A) := P(E A). P(A) Remark 2.2. Think of P A := P( A) as a probability measure on the measurable space (A, E A ). Given a set B E, the characteristic or indicator function χ B : Ω {, 1} satisfies { 1 x B χ B (x) = x / B. Definition 2.3 (Conditional expectation, part I). Let X : Ω R be a random variable with finite expectation with respect to P. The conditional expectation of X given an event A is Remark 2.4. We have E(X A) = E [Xχ A] P(A) E(X A) = 1 P(A) Remark 2.5. Observe that P(E A) = E [χ E A]. A. XdP = XdP A. Up to this point we have only considered the case where A satisfies P(A) >. We now consider the general case. Definition 2.6 (Conditional expectation, part II). Let X : Ω R be an integrable random variable with respect to P and let F E be any sub-sigma algebra of E. The conditional expectation of X given F is a random variable Y := E [X F] with the following properties: Y is measurable with respect to F: B B(R), Y 1 (B) F. We have XdP = Y dp F F. F F 6

7 Remark 2.7. The second condition in the last definition amounts to the projection property as can be seen by noting that E [Xχ F ] = XdP = Y dp = E [Y χ F ] = E [E [X F] χ F ]. F F By the Radon-Nikodym theorem [16], the conditional expectation exists and is unique up to P-null sets. Definition 2.8 (Conditional probability, part II). Define the conditional probability of an event E E given A by P(E A) := E [χ E A] Exercise 2.9. Let X, Y : Ω R and scalars a, b R. Prove the following properties of the conditional expectation: (Linearity): (Law of total expectation): (Law of total probability): E [ax + by A] = ae [X A] + be [Y A]. E [X] = E [X A] + P(A) + E [X A c ] P(A c ) P(B) = P(B A)P(A) + P(B A c )P(A c ). Example 2.1. The following is a collection of standard examples. Gaussian random variables: Let X 1, X 2 be jointly Gaussian with distribution N(µ, Σ), where ( ) ( ) E[X1 ] a b µ =, Σ = E[X 2 ] b c such that Σ is positive definite. The density of the distribution is [ 1 ρ(x) = exp 1 ] det(2πσ) 2 (x µ) Σ (x µ) (Ex.: Compute the distribution of X 1 given that X 2 = a for some a R.) (Conditioning as coarse-graining): Let Z = {Z i } M i=1 i.e. Ω = M i=1 Z i with Z i Z j = and define be a partition of Ω, Y (ω) = M E [X Z i ] χ Zi (ω). i=1 Then Y = E [X Z] is a conditional expectation (cf. Figure 2.1) (Exponential waiting times): exponential waiting times are random variables T : Ω [, ) with the memoryless property: P (T > s + t T > s) = P (T > t). This property is equivalent to the statement that T has an exponential distribution, i.e. that P(T > t) = exp( λt) for a parameter value λ >. 7

8 Figure 2.1: Simulation of butane, coarse-grained into three states Z 1, Z 2, Z Stochastic processes Definition 2.11 (Stochastic process). A stochastic process X = {X t } t I is a collection of random variables on a probability space (Ω, E, P) indexed by a parameter t I [, ). We call X discrete in time if I N continuous in time if I = [, T ] for any T <. How does one define probabilities for X? We provide a basic argument to illustrate the possible difficulties in defining the probability of a stochastic process in an unambiguous way. By definition of a stochastic process, X t = X t (ω) is measurable for every fixed t I, but if one has an event of the form E = {ω Ω X t (ω) [a, b] t I} how does one define the probability of this event? If t is discrete, the σ-additivity of P saves us, together with the measurability of X t for every t. If, however, the process is time-continuous, X t is defined only almost surely (a.s.) and we are free to change X t on a set A t with P(A t ) =. By this method we can change X t on A = t I A t, The problem now is that P(A) need not be equal to zero even though P(A t ) = t I. Furthermore, P(E) may not be uniquely defined. So what can we do? The solution to the question of how to define probabilities for stochastic processes is to use finite-dimensional distributions or marginals. Definition (Finite dimensional distributions): Fix d N, t 1,..., t d I. The finite-dimensional distributions of the stochastic process X for (t 1,..., t d ) are defined as µ t1,...,t d (B) := P (Xtk ) k=1,...,d (B) = P ({ω Ω (X t1 (ω),..., X td (ω)) B}) for B B(R d ). 8

9 Here and in the following we use the shorthand notation P Y := P Y 1 to denote the push forward of P by the random variable Y. Theorem (Kolmogorov Extension Theorem): Fix d N, t 1,..., t d I, and let µ t1,...,t d be a consistent family of finite-dimensional distributions, i.e. for any permutation π of (1,..., d), µ t1,...,t d (B 1... B d ) = µ (tπ(1),...,t π(d) (B π(1)... B π(d) ) For t 1,..., t d+1 I, we have that µ t1,...,t d+1 (B 1... B d R) = µ t1,...,t d (B 1... B d ). Then there exists a stochastic process X = (X t ) t I with µ t1,...t d as its finitedimensional distribution. Remark The Kolmogorov Extension Theorem does not guarantee uniqueness, not even P-a.s. uniqueness, and, as we will see later on, such a kind of uniqueness would not be a desirable property of a stochastic process. Definition (Filtration generated by a stochastic process X): Let F = {F t } t I with F s F t for s < t be a filtration generated by F t = σ ({X s s t}) is called the filtration generated by X. 2.3 Markov processes Definition A stochastic process X is a Markov process if where for some event E. P (X t+s A F s ) = P (X t+s A X s ) (2.1) P ( X s ) := P ( σ(x s )), P (E σ(x s )) := E [χ E σ(x s )] Remark If I is discrete, then X is a Markov process if P (X n+1 A X = x,..., X n = x n ) = P (X n+1 A X n = x n ) Example Consider a Markov Chain (X t ) t N on a continuous state space S R and let S be a σ-algebra on S. Let the evolution of (X t ) t N be described by the transition kernel p(, ) : S S [, 1] which gives the single-step transition probabilities: p(x, A) := P (X t+1 A X t = x) = q(x, y)dy. A In the above, A B(S) and q = dp dλ is the density of the transition kernel with respect to Lebesgue measure. The transition kernel has the property that 9

10 x S, p(x, ) is a probability measure on S, while for every A S, p(, A) is a measurable function on S. For a concrete example, consider the Euler-Maruyama discretization of an SDE for a fixed time step t, X n+1 = X n + tξ n+1, X =, where (ξ i ) i N are independent, identically distributed (i.i.d) Gaussian N (, 1) random variables. The process (X i ) i N is a Markov Chain on R. The transition kernel p(x, A) has the Gaussian transition density q(x, y) = 1 2π t exp [ 1 2 y x 2 Thus, if X n = x, then the probability that X n+1 A R is given by P (X n+1 A X n = x) = q(x, y)dy. 3 Day 3, : Brownian motion Suggested references: [1, 16, 19] 3.1 Scaled random walks Recapitulation: A stochastic process X = (X t ) t I is a collection of random variables X t : Ω R indexed by t I (e.g. I = [, )) on some probability space (Ω, E, P). A filtration F := (F t ) t I is a collection of increasing sigma-algebras satisfying F t F s for t < s. A stochastic process X is said to be adapted to F if (X s ) s t is F t -measurable. For example, if we define F t := σ(x s : s t), then X is adapted to F. The probability distribution of a random variable X is given in terms of its finite dimensional distributions. Example 3.1 (Continued from last week). Let I = N and consider a sequence (X n ) n N of random variables X n = Xn t governed by the relation A X t n+1 = X t n + tξ n+1, X t = a.s. (3.1) where t >, and (ξ k ) k N are i.i.d. random variables with E [ξ k ] = and E [ ξk] 2 = 1 (not necessarily Gaussian). To obtain a continuous-time stochastic process, the values of the stochastic process on non-integer time values may be obtained by linear interpolation (cf. Figure 3.1 below). We want to consider the limiting behaviour of the stochastic process in the limit as t goes to zero. Set t = t/n for a fixed terminal time t < and let N ( t ). Then, by the central limit theorem, t N XN t = ξ k tz (3.2) N k=1 1 t ].

11 X t X t t t 2 x X t t 5 5 x Figure 3.1: Sample paths of (Xn t ) n for t =.5,.2,.1 over the unit time interval [, 1], with piecewise constant interpolation. The lower right plot shows the histogram (i.e., the unnormalized empirical distribution) of (X1) t at time t = 1, averaged over 1 independent realizations. where Z N (, 1), and means convergence in distribution, i.e., weak- convergence of probability measures in duality with bounded continuous functions; equivalently, the limiting random variable is distributed according to N (, t). for fixed t = N t is the same as the distribution of a centred Gaussian random variable with variance t. As this is true for any t >, we can think of the limiting process as a continuous-time Markov process B = (B t ) t> with Gaussian transition probabilities, In other words the limiting distribution of the random variable X t N P (B t+s A B s = x) = = A q s,t (x, y)dy 1 2π t s A exp ( y x 2 2 t s ) dy. The stochastic process B is homogeneous or time-homogeneous because the transition probability density q s,t (, ) does not depend on the actual values of t and s, but only on their difference, i.e., q s,t (, ) = q s t (, ) (3.3) Remark 3.2. The choice of exponent 1/2 in t = ( t) 1/2 in (3.2) is unique. For ( t) α with α (, 1 2 ), the limit of X t n explodes in the sense that the 11

12 variance of the process blows up, i.e., E[(XN t)2 ] as N. On the other hand, for ( t) α with α > 1/2, XN t in probability as N. 3.2 Constructing Brownian motion Brownian motion is named after the British botanist, Robert Brown ( ), who first observed the random motion of pollen particles suspended in water. Einstein called the Brownian process Zitterbewegung in his 195 paper, Über die von der molekularkinetischen Theorie der Wärme gefordete Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. The Brownian motion is a continuous-time stochastic process which is nowhere differentiable. It is also a martingale in the sense that on average, the particle stays in the same location at which it was first observed. In other words, the best estimate of where the particle will be after a time t > is its initial location. Definition 3.3. (Brownian motion) The stochastic process B = (B t ) t> with B t R is called the 1-dimensional Brownian motion or the 1-dimensional Wiener process if it has the following properties: (i) B = P-a.s. (ii) B has independent increments, i.e., for all s < t, (B t B s ) is a random variable which is independent of B r for r s. (iii) B has stationary, Gaussian increments, i.e., for t > s we have 1 B t B s D = Bt s (3.4a) D = N (, t s). (3.4b) (iv) Trajectories of Brownian motion are continuous functions of time. We now make precise some important notions: Definition 3.4. (Filtered probability space) A filtered probability space is a probability space (Ω, F, P) with a filtration (F t ) t such that t, F t F. Remark 3.5. One may write (Ω, F, F t, P) to refer to a filtered probability space. However, if one is working with a particular stochastic process X, one may consider the sigma-algebra F on Ω to simply be the smallest sigma-algebra which contains the union of the F X t, where F X t := σ(x s : s t). In symbols, we define the sigma-algebra in the probability space to be F := t F t := σ ( t F t ). Definition 3.6 (Martingale). A stochastic process X = (X t ) t> is a martingale with respect to a filtered probability space (Ω, F, F t, P) if X satisfies the following properties: (i) X is adapted to F, i.e. X t is measurable with respect to F t for every t 1 The notation X D = Y means X has the same distribution as Y. 12

13 (ii) X is integrable: X L 1 (Ω, P), i.e. E [ X ] = X(ω) dp(ω) < (iii) X has the martingale property: t > s Ω E [X t F s ] = X s. Definition 3.7. (Gaussian process) A 1-dimensional process G = (G t ) t> is called a Gaussian process if for any collection (t 1,..., t m ) I for arbitrary m N, the random variable (G t1,..., G tm ) has a Gaussian distribution, i.e. it has a density [ 1 f(g) = exp 1 ] det(2πσ) 2 (g µ) Σ 1 (g µ) (3.5) where g = (g 1,..., g m ), µ R m is a constant vector of means and Σ = Σ R m m is a symmetric positive semi-definite matrix. Remark 3.8. The Brownian motion process is a Gaussian process with the vector of means µ = and covariance matrix t Σ = t 2 t (3.6)..... t m t m 1 The covariance matrix is diagonal due to the independence of the increments of Brownian motion. Remark 3.9. Further remarks are in order. (a) Conditions (i)-(iii) define a consistent family of finite-dimensional distributions. Hence, the existence of the process B is guaranteed by the Kolmogorov Extension Theorem. (b) Conditions (i)-(iii) imply that E [B t ] = and E [B t B s ] = min(t, s) s, t R. The proof is left as an exercise. ) t> (which is obtained by linear interpolation between the XN t ) and B as random variables on the space of continuous trajectories (C(R + ) and B(C(R + ))), then the process (Xt t ) t> converges in distribution to B. (c) The discrete process (Xn t ) n N converges in distribution to a Brownian motion (B t ) t if the time discrete is linearly interpolated between two successive points. In other words, if we consider the continuous-time stochastic processes (X t t (d) We have that E [ (B t B s ) 2] = E [ (B t s ) 2] by (3.4a) in Definition 3.3 = t s by (3.4b) in Definition 3.3. (e) Brownian motion enjoys the following scaling invariance, also known as self-similarity of Brownian motion: for every t > and α >, B t D = α 1/2 B αt. 13

14 3.3 Karhunen-Loève expansion Observe that we have constructed Brownian motion by starting with the scaled random walk process and using the Kolmogorov Extension Theorem. Now we present an alternative method for constructing Brownian motion that is useful for numerics, called the Karhunen-Loève expansion of Brownian motion. We will consider this expansion for Brownian motion on the unit time interval [, 1]. Let {η k } k N be a collection of independent, identically distributed (i.i.d) Gaussian random variables distributed according to N (, 1), and let {φ k (t)} k N be an orthonormal basis of L 2 ([, 1]) = { u : [, 1] R : By construction, the basis functions satisfy φ i, φ j = 1 1 φ i (t)φ j (t)dt = δ ij, and we can represent any function f L 2 ([, 1]) by f(t) = k N α k φ k (t) } u(t) 2 dt <. (3.7) for α k = f, φ k. We have the following result. Theorem 3.1. (Karhunen-Loève): The process (W t ) t 1 defined by is a Brownian motion. W t = k N t η k φ k (s)ds (3.8) Proof. We give only a sketch of the proof. For details, see the Appendix in [16], or [9]). The key components of the proof are to show the following: (i) The infinite sum which defines the Karhunen-Loève expansion is absolutely convergent, uniformly on [, 1]. (ii) It holds that E [W t ] = and E [W t W s ] = min(s, t). 4 Day 4, : Brownian motion, cont d Suggested references: [9, 22] 4.1 More properties of Brownian motion From last week, we saw that the Brownian motion (B t ) t is a continuous-time stochastic process on R with stationary, independent, Gaussian increments 14

15 a.s. continuous paths. That is, for fixed ω, each (B t ) t (ω) is a continuous trajectory in R. Moreover the scaled random walk defined by X t n+1 = X t n + tξ n+1 with linear interpolation converges weakly (i.e. converges in distribution) to the Brownian motion process. Above, the (ξ n ) n N are independent, identically distributed (i.i.d) normalized Gaussian random variables (i.e. ξ n is Gaussian with mean zero and variance 1). Remark 4.1. Two remarks are in order. (i) Continuity can be understood using the Lévy construction of Brownian motion on the set of dyadic rationals, D := { } k D n, D n := : k =...., 2n. 2n n N The construction of Brownian motion on the unit time interval is as follows. Let {Z t } t D be a collection of independent, normalized random variables defined on a probability space. Define the collection of functions (F n ) n N, where F n : [, 1] R are given by t D n 1 F n (t) := 2 (j+1)/2 Z t t D j \ D j 1 lin. interp. in between. Then the process B(t) = F n (t). n=1 is indeed a Brownian motion on [, 1]. The Gaussianity of the {Z t } t D leads to the stationary, independent Gaussian increments of the process (B t ) t [,1]. The continuity of the process follows from an application of the Borel-Cantelli Lemma, which states that there exists a random and almost surely finite number N N such that for all n N and d D n, Z d < c n holds. This boundedness condition implies that n N we have a decay condition for the F n : F n < c n2 n/2. Therefore the sum j F j( ) converges uniformly on [, 1]. As each F j is continuous and the uniform limit of continuous functions is continuous, the process (B t ) t [,1] is continuous. For more details, see [18]. (ii) The Hausdorff dimension dim H of Brownian motion paths depends on the dimension of the space R d in which the Brownian motion paths live. 2 Let B [,1] = {B t R d : t [, 1]} be the graph of B t over I = [, 1]. Then { 3/2 d = 1 dim H B [,1] = 2 d 2. 2 If you do not know what this is, just think of the box counting dimension that is an upper limit of the Hausdorff dimension. 15

16 W t.2 W t t t W t t x Figure 4.1: Sample paths of the Karhunen-Loève expansion of (W t for M = 2, 64, 248 basis functions (you can guess which one is which). The lower right plot shows the unnormalized histogram of W t at time t = 1, using M = 64 basis functions and averaged over 1 independent realizations. The significance of this is as follows: if you consider Brownian motion paths confined to a smooth and compact two-dimensional domain and impose reflecting boundary conditions, then the Brownian motion paths will fill the domain in the limit as t. 4.2 Brownian bridge Recall the Karhunen-Loève expansion of Brownian motion: Theorem 4.2. Let {η k } k N be i.i.d. normalized random variables and {φ k } k N form a real orthonormal basis of L 2 ([, 1]). Then W t = k N t η k φ k (s)ds is a Brownian motion on the interval I = [, 1]. Exercise 4.3. Show that, for the definition of (W t ) t [,1] above, it holds that E [W t W s ] = min(s, t). Remark 4.4. Unlike the scaled random walk construction of Brownian motion, no forward iterations are required here. This helps for the consideration of round-off errors in the construction of (W t ) t [,1]. Furthermore: 16

17 (i) Standard choices for the orthonormal basis {φ k } k N are Haar wavelets or trigonometric functions. Hence the numerical error can be controlled by truncating the series and by the choice of the basis. (ii) To obtain a Brownian motion on any general time interval [, T ], it suffices to use the scaling property, e.g. W [,T ] D = T W[,1]/T = T k N Application: filtering of Brownian motion t/t η k φ k (s)ds. Suppose we know that W = and W 1 is equal to some constant ω. Without loss of generality, let ω =. Suppose we wanted to generate a Brownian motion path which interpolated between the values W = and W 1 =. Definition 4.5. A continuous, mean-zero Gaussian process (BB t ) t is called a Brownian bridge to ω if it has the same distribution as (W t ) t [,1] conditional on the terminal value W 1 = ω. Equivalently, (BB t ) t is a Brownian bridge if Cov [BB t BB s ] = min(s, t) st. Lemma 4.6. If (W t ) t [,1] is a Brownian motion, then BB t = W t tw 1 is a Brownian bridge. Proof. Observe that E [BB t ] = E [W t tw 1 ] = t =, so that (BB t ) t [,1] is indeed mean-zero. The process (BB t ) t [,1] inherits continuity from the process (W t ) t [,1]. The covariance process is given by Cov(BB t BB s ) = E [BB t BB s ] = E [(W t tw 1 ) (W s sw 1 )] = E [W t W s ] t E [W 1 W s ] s E [W 1 W t ] +tse [W 1 W 1 ] }{{}}{{} =min(s,1) =min(t,1) = min(t, s) ts st + ts. 4.3 Simulating a Brownian bridge First approach: forward iteration, using Euler s method. The time interval is [, 1] and we have a time step of t := 1/N, so we have (N + 1) discretized time nodes (t n = n t) n=,...,n and (N +1) values (Yn t ) n=,...,n. Let {ξ n } n=,...,n 1 be a collection of i.i.d. normalized random variables. Forward iteration gives ( Yn+1 t = Yn t 1 t ) + tξ n+1. 1 t n It holds that 1 t N 1 = t by definition of t = 1/N. Therefore from the formula above we have YN t = tξ N+1. 17

18 Therefore YN t is a mean zero Gaussian random variable with variance t. While this implies that YN t should converge in probability to the value as the step size t, the forward iteration approach is not optimal because the random variable ξ N+1 is continuous, so P ( Y t N = ) =. Therefore this construction of the Brownian bridge to the value ω = will in general not yield processes which are at at time t = 1. As a matter of fact, YN t is unbounded and can be arbitrarily far away from zero. Second approach: Recall the Karhunen-Loéve construction of Brownian motion and choose trigonometric functions as an orthonormal basis. Then the process (W t ) t [,1] given by W t (ω) = 2 M k=1 η k (ω) sin((k 1 2 )πt (k 1 2 )π is a Brownian motion and we can define the Brownian bridge to ω at t = 1 by Remark 4.7. It holds that BB t = W t t(w 1 ω). BB t = 2 k N η k sin(kπt) kπ = k N η k λk ψ k (t), where {λ k, ψ k } k N = { 2/kπ, sin(kπt) } is the eigensystem of the covariance k N operator T : L 1 ([, 1]) L 1 ([, 1]) of the process (BB t ) t [,1], defined by i.e., (T u)(t) = 1 Cov(BB t BB s ) u(s)ds, }{{} =min(t,s) st T ψ k ( ) = λ k ψ k ( ). The second approach works for any stochastic process which has finite variance over a finite time interval. For details, see [22]. 5 Day 5, : Stochastic integration Suggested references: [1, 16, 19] 5.1 Integrating against Brownian motion Recall that Brownian motion (B t ) t> is a stochastic process with the following properties: B = P-a.s. 18

19 1.4.6 Y n BB 2 t t n t Figure 4.2: Sample paths of the Brownian bridge approximation, using the Euler scheme with t =.5 (left panel) and Karhunen-Loève expansion with M = 2 basis functions (right panel). t < t 1 < t 2 <... < t n, the increments B ti B ti 1 are independent for i = 1,..., n and Gaussian with mean and variance t i t i 1. t B t (ω) is continuous P-a.s. but is P-a.s. nowhere differentiable. One of the motivations for the development of the stochastic integral lies in financial mathematics, where one wishes to determine the price of an asset that evolves randomly. The French mathematician Louis Bachelier is generally considered one of the first people to model random asset prices. In his PhD thesis, Bachelier considered the following problem. Let the value S t of an asset at time t > be modelled by S t = σb t where σ > is a scalar that describes the volatility of the stock price. Let f(t) be the amount of money an individual invests in the asset in some infinitesimal time interval [t, t + dt]. Then the wealth of the individual at the end of a time interval [, T ] is given by T T f(t)ds t = σ f(t)db t. However, it is not clear what the expression db t means. In this section, we will consider what an integral with respect to db t means, and we will also consider the case when the function f depends not only on time but on the random element ω. 5.2 The Itô integral for simple functions The first idea is to rewrite f(t)db t = f(t) db t dt dt but as Brownian motion is almost surely nowhere differentiable, we cannot write db t dt. 19

20 The second idea is to proceed as in the definition of the Lebesgue integral: start with simple step functions and later extend the definition to more general functions by the Itô Isometry. Step 1: Consider simple functions f(t) = n a i χ (ti,t i+1](t) i=1 where χ A is the indicator function of a set A satisfying { 1 x A χ A (x) = x / A. Observe that f takes a finite number n of values. By the theory of Lebesgue integration, we know that the set of these simple functions is dense in L 2 ([, )). We also know that the usual Riemann integral of such a function f corresponds to the area under the graph of f, with f(t)dt = i a i (t i+1 t i ) Step 2: We now extend the method above to stochastic integral with respect to Brownian motion: f(t)db t = a i (B ti+1 B ti ). Remark 5.1. By the equation above, it follows that the integral f(t)db t is a random variable, since the B ti are random variables. Since increments of Brownian motion are independent and Gaussian, the integral f(t)db t is normally distributed with zero mean. What about its variance? Lemma 5.2. (Itô Isometry for simple functions) For a simple function f(t) = i a iχ (ti,t i+1](t), it holds that [ ( ) ] 2 E f(t)db t = (f(t)) 2 dt. 2

21 Proof. ( ) ( ( ) ) var f(t)db t = var a i Bti+1 B ti i n = a 2 i var ( ) B ti+1 B ti = = = i=1 n a 2 i (t i+1 t i ) i=1 n a 2 i χ (ti,t i+1]dt i=1 n i=1 a 2 i χ (ti,t i+1]dt = (f(t)) 2 dt. Therefore ( var ) [ ( ) ] 2 [ ] f(t)db t = E f(t)db t E f(t)db t }{{} = [ ( ) ] 2 f(t)db t = E. 2 Step 3: Now we extend the definition of the integral to L 2 ([, )). result is the following The main Theorem 5.3. (Itô integral for L 2 ([, )) functions) The definition of the Itô integral can be extended to elements f L 2 ([, )) by setting f(t)db t := lim n f n (t)db t where the sequence (f n ) n N is a sequence of a simple functions satisfying f n f in L 2 ([, )), i.e. ( f n f L 2 ([, )) = (f n f) 2 (t)dt) 1/2 n. By the Itô isometry, we can show that ( f n (t)db t ) n N is a Cauchy sequence in the weighted L 2 space L 2 (Ω, P) := { F : Ω R : F L 2 (Ω,P) < } 21

22 of measurable functions F where F 2 L 2 (Ω,P) = F 2 (ω) dp(ω). To show that the sequence ( f n (t)db t ) n N is a Cauchy sequence, let (f l ) l N be a sequence of functions converging to f in L 2 ([, )) and consider for m, n N L f n (t)db t f m (t)db t 2 (Ω,P) ( [ ( ) ]) 2 1/2 = E f n (t)db t f m (t)db t ( [ ( ) ]) 2 1/2 = E f n (t) f m (t)db t ( = (f n (t) f m (t)) 2 dt) 1/2 (Itô isometry) = f n f m L 2 ([, )) f n f L 2 ([, )) + f m f L 2 ([, )) and using that f n f L 2 ([, )) and f m f L 2 ([, )) as m, n, the result follows. Since L 2 (Ω, P) is complete, the limit exists and is in the same space. Moreover, by the Itô isometry, the limit is independent of the sequence (f n ) n N used to approximate f in L 2 ([, )) (see [6] for an example). Example 5.4. Consider the random variable exp( t)db t. How is it distributed? Using the Itô Isometry, the random variable is Gaussian with mean zero and variance 1 2 = exp( 2t)dt. Corollary 5.5. The Itô Isometry holds as well for f L 2 ([, )), not just simple functions. Step 4: Now we consider functions f which depend both on the random element ω as well as time t. That is, we consider stochastic integrals of stochastic processes f : [, ) Ω R with the following properties: (i) f is B F-measurable, where B is the Borel sigma-algebra on [, ) and F is a given sigma-algebra on Ω. (ii) f(t, ω) is adapted with respect to F t, where F t := σ (B s : s t) (iii) E [ f(t, ω) 2 dt ] <. Consider simple stochastic processes of the form n f(t, ω) = a i (ω)χ (ti,t i+1](t). Then f(t, ω)db t = i=1 n a i (ω) ( ) B ti+1 B ti. i=1 22

23 5.3 Ambiguities in defining stochastic integrals Example 5.6. Fix n N, fix a time step t := 2 n and define the time nodes t i := i t for i =, 1, 2,.... Let (B t ) t> be the standard Brownian motion. Define the following processes on [, ): f 1 (t, ω) = i N B ti (ω)χ [ti,t i+1)(t) f 2 (t, ω) = i N B ti+1 (ω)χ [ti,t i+1)(t). Now fix T > and N such that T = t N = N t = N2 n and compute the expected values of the integrals of f 1 and f 2 over [, T ]. By the independent increments property of Brownian motion (or the martingale property of Brownian motion), we have [ ] T E f 1 (t, ω)db t = N 1 i= E [ B ti ( Bti+1 B ti )] =. Using the fact above with linearity of expectation, we also have [ ] T N 1 E f 2 (t, ω)db t = E [ ( )] B ti+1 Bti+1 B ti = = = i= N 1 i= N 1 i= N 1 i= ( [ ( )] [ ( )]) E Bti+1 Bti+1 B ti E Bti Bti+1 B ti [ ) ] 2 E (Bti+1 B ti t i+1 t i = T. In the case of Riemann integration of deterministic integrals, letting n would lead to the result that both integrals above are equal. We see that for stochastic integration, this is not the case; even if we let n, the expectations of the Itô integrals would not be equal. This is because the choice of endpoint of the interval matters in stochastic integration. Choosing the left endpoint (i.e. choosing B ti ) for f 1 and the right endpoint (i.e. B ti+1 for f 2 leads to different expectations. Note also that taking the right endpoint in f 2 leads to f 2 not being adapted, since B ti+1 is not measurable with respect to F t for t < t i+1. Therefore, by property (ii) above, we may not integrate f 2 with respect to db t in the way we have just described. 6 Day 6, : Stochastic integration, cont d 6.1 Itô isometry We extend the Itô integral to the case I[f](ω) = t f(s, ω)db s (ω), 23

24 where B t is one-dimensional Brownian motion. One aim is to understand that is SDE shorthand for dx t = b(t, X t )dt + σ(t, X t )db t, X = x X t = x + t b(s, X s ) ds + t σ(s, X s ) db s. A second objective later on will be to analyze discretizations of SDEs, such as X n+1 X n = b(t n, X n ) t + σ(t n, X n ) B n. We begin with a couple of definitions. Definition 6.1. We call V the norm defined by [ t ] f 2 V = E f(u, ) 2 du = s Ω t s f(u, ω) 2 du dp(ω). Definition 6.2 (Cf. the considerations at the bottom of p. 22). Let V = V(s, t) be the class of functions f : [, ) Ω R with (i) (t, ω) f(t, ω) is B F-measurable 3 (ii) f(t, ) is F t -adapted (iii) f V <. Definition 6.3. A simple function ϕ : [, ) Ω R is a function of the form ϕ(t, ω) = j e j (ω)χ [tj,t j+1)(t) where each e j is F tj -measurable and {F t } t with F t = σ(b s : s t) is the filtration generated by Brownian motion. Definition 6.4. The Itô integral for a simple function ϕ is defined by I[ϕ](ω) = t ϕ(s, ω)db s (ω) = j e j (ω)(b tj+1 B tj ). Lemma 6.5. Itô Isometry: If ϕ(t, ω) is a bounded, simple function, then [ ( t ) 2 ] [ t ] E ϕ(u, ω)db u (ω) = E ϕ(u, ω) 2 du. s s Proof. Define B j := B tj+1 B tj. Then I[ϕ] = j e j B j. By independence of e i e j B i from B j when i j, we have that { i j E [e i e j B i B j ] = E [ ej] 2 (tj+1 t j ) i = j. 3 Here again: B = B([, )) is the σ-algebra od Borel sets over [, ). 24

25 Therefore, [ ( t ) 2 ] E ϕ(u, )db u = s i,j = j E [e i e j B i B j ] E [ [ t ] e 2 j] (tj+1 t j ) = E ϕ(u, )du. s 6.2 Itô Integral for random integrands Now we will extend the Itô integral to V = V(s, t), by extending the Itô integral to progressively larger classes of functions. Step 1: Let g V be a uniformly bounded function which is continuous for each ω. Then there exists a sequence of simple functions (ϕ n ) n N such that as n. ϕ n g V Proof. Choose ϕ n (t, ω) = j g(t j, ω)χ [tj,t j+1)(t). Then ϕ n g in L 2 ([s, t]) for each ω Ω, and hence ϕ n g 2 V, i.e., [ t ] ( t ) E (ϕ n g) 2 du = (ϕ n g) 2 du dp s Ω s as n. Step 2: Let h V be bounded. Then there exists a bounded sequence of functions (g n ) n N V such that each g n is continuous in t for each ω and for each n N, such that g n h V. Proof. Suppose that h(t, ω) M <. For each n, let ψ n be defined by (i) ψ n (x) = for x (, 1 n ] [, ) (ii) R ψ n(x)dx = 1. Now define the functions g n : [, ) Ω R by g n (t, ω) = t ψ n (s t)h(s, ω)ds. Then it holds that g n h in L 2 ([s, t]), i.e., g n h L 2 ([s,t]) as n. As h is bounded, we can apply the bounded convergence theorem to obtain [ t ] E (g n h) 2 du s as n. 25

26 Remark 6.6. In the limit as n, the ψ n (x) become more sharply peaked at x = in other words, they approach a Dirac delta distribution: h(t, ω) = δ(s t)h(s, ω)ds. Step 3: Let f V. Then there exists a sequence of functions (h n ) n N V such that h n is bounded for each n N and h n f V as n. Proof. Define n f(t, ω) < n h n (t, ω) = f(t, ω) n f(t, ω) n n f(t, ω) > n. Then the assertion follows by the dominated convergence theorem. By Steps 1-3, f V can be approximated by sequences of simply functions ϕ n in the sense that f ϕ n V. Therefore we can define I[f](ω) = t s f(u, ω)db u (ω) = lim n t s ϕ n (u, ω)db u (ω), where by the Itô isometry, the limit exists in L 2 (Ω, P) because ( ) ϕ n (u, ω)db u (ω) is a Cauchy sequence in L 2 (Ω, P); see p. 22. Definition 6.7. Let f V = V(s, t). Then the Itô integral of f is defined by t lim n s n N ϕ n (u, ω)db u (ω) where (ϕ n ) n N is a sequence of simple functions with ϕ n f in V, i.e., [ t ] E (f(u, ω) ϕ(u, ω)) 2 du as n. s Corollary 6.8. (Itô isometry) For all f V = V(s, t), we have [ ( t ) 2 ] [ t ] E f(u, ω)db u (ω) = E f(u, ω) 2 du. s s Theorem 6.9. Let f, g V(, t) and s u t. Then (a.s.): (i) t f(τ, ω)db τ = u f(τ, ω)db τ + t s s u f(τ, ω)db τ. 26

27 (ii) (iii) (iv) t is F t -measurable. Proof. Exercise. s (αf + βgdb u = α t s fdb u + β t [ t ] E f(s, ω)db s = t s f(u, ω)db u s gdb u α, β R 6.3 Ornstein-Uhlenbeck process Example 6.1 (OU process). Consider the linear SDE for constants A, B R which means dx t = AX t dt + BdW t, X = x, X t = x + A t X s ds + B t dw s. One can show that the solution to the linear SDE can be expressed using the variation-of-constants-formula X t = e At x + t e A(t s) BdW s. The solution (X t ) t> is a Gaussian process, so it is completely specified by its mean and variance E [X t ] = e At x by property (iv) above [ [ ( E (X t E [X t ]) 2] t ) 2 ] = E e A(t s) BdW s [ t = E = t = B2 2A ( e A(t s) B e 2A(t s) B 2 ds ( e 2At 1 ) ) 2 ds ] by Itô isometry Remark The main things to remember are that the approximation procedure for defining the Itô integral reduces to the Itô isometry for elementary functions ϕ n f (convergence in V) and that the limiting integral I[f] is in L 2 (Ω, P). Specifically, we have proved that I[ϕ n ] I[f] in L 2 (Ω, P), i.e., E [ (I[ϕ n ] I[f]) 2] = (I[ϕ n ](ω) I[f](ω)) 2 dp(ω) as n. Ω 27

28 7 Day 7, : Itô calculus Recapitulation: The Itô integral for functions f V = L 2 (Ω [, T ], P λ) is defined by I[f](ω) = T f(ω, s)db s (ω) = lim n j e n j ) (B t B nj+1 t, nj with convergence in L 2 (Ω, P). Here the (e n j ) n,j N is a sequence of random variables that are measurable with respect to σ(b s : s t n j ), and (ϕ n ) n N, ϕ n (ω, t) = j e n j (ω)χ [t n j,t n j+1 ) (t) is a sequence of simple functions such that ϕ n f V. The Itô integral provides the solution to the stochastic differential equation dx t (ω) = b(x t (ω), t)dt + σ(x t (ω), t)db t (ω), X = x. (7.1) Specifically, assuming that (X t ) t is adapted to the filtration generated by B t, i.e., X t is measurable with respect to σ(b s : s t), we have X t = x + t b(x s, s) ds + t σ(x s, s) db s. 7.1 Functions of bounded and quadratic variation What we now need is a theory of differentiation that is useful in solving equations such as (7.1) and which can explain properties of the Itô integral, such as t Exercise 7.1. Prove the above equation. B t db t = 1 2 B2 t 1 2 t. Definition 7.2. Let T >. A sequence ( n ) n N of partitions of [, T ], with n = {t n,..., t n k n} [, T ], = tn < t n 1 < < t n k n = T is called a refinement of partitions of [, T ] if the sequence satisfies n+1 n & n := max t n i t n i 1 as n. i Example 7.3. An example of a refinement of partitions is the sequence of dyadic partitions. Let T = 1, and define { } j n = 2 n : j =, 1,..., 2n 1, 2 n. Definition 7.4. A function f : [, T ] R is of bounded variation (BV) if sup f(t n i ) f(t n i 1) < n N for all refinements of partitions ( n ) n N. 28

29 Definition 7.5. A function f : [, T ] R is of quadratic variation (QV) if its quadratic variation f t := sup f(t n i ) f(t n i 1) 2 <. n t n i t is finite for every t [, T ] and over all refinements of partitions. Remark 7.6. We make some remarks which we will not prove, with the exception of (iii). (i) Continuously differentiable functions are BV functions. (ii) If one integrates against a BV function, the resulting Riemann-Stieltjes integral is independent of the refinement of partitions. (iii) Continuous BV functions have zero QV: f t = sup f(t n i ) f(t n i 1) 2 n N t n i t = lim f(t n i ) f(t n i 1) 2 n t n i t max f(t n i ) f(t n i 1) lim f(t n i ) f(t n i 1) i n t n i t C lim n f(tn i ) f(t n i 1) (since f is a BV function) = (by continuity of f). (iv) The quadratic variation of Brownian motion at time t is equal to t: B t = t. Brownian motion is not of bounded variation. (v) Given an interval [, T ] for T > and a function f : [, T ] R of QV, the quadratic variation f t is a BV function of time, which follows from the fact that is monotonic as a function of time. 7.2 Itô s formula Theorem 7.7 (Itô s formula I). Let F C 2,1 (R, [, T ]) and let X = B C([, T ]) be Brownian motion. Then t F F (X t, t) = F (, ) + x (X s, s) dx s + t ( x 2 + ) F (X s, s) ds. s Proof. For convenience, we will drop the time-dependence of F, so that F (x, s) = F (x). By Taylor s theorem, F (X t n i ) F (X t n i 1 ) = F (X t n i 1 )(X t n i X t n i 1 ) F (ξ n i )(X t n i X t n i 1 ) 2 29

30 for a number ξi n (X t n i, X t n i 1 ). Then F (X t ) F (X ) = F (X )(X tn i 1 t n X i t n ) + 1 F (ξ i 1 i n )(X t n 2 i X t n i 1 ) 2. t n i } t t {{} n i } t {{} I n Q n We consider the two sums separately. As for the first sum, we observe that I n is a discrete version of the Itô integral, and therefore t 2 I F n x (X s, s) dx s L 2 (Ω,P) as n. Using (v) from the preceding remark, we know that the quadratic variation of (X t ) t [,T ] is itself a BV function of time. Therefore Q n converges to the standard Riemann-Stieltjes integral, 1 2 t F (X s ) d X s = 1 2 where, again, convergence is in L 2 (Ω, P). t F (X s ) ds, Corollary 7.8 (Itô s Formula II). Let B = (B t ) t be d-dimensional Brownian motion and let X = (X t ) t be the n-dimensional solution to the Itô stochastic differential equation dx t = b(x t, t)dt + σ(x t, t)db t, where b: R n R R n, σ : R n R R n d. Let F C 2,1 (R n, [, )). Then Y t := F (X t, t) solves the Itô equation dy t = x F (X t, t) dx t + F t (X t, t)dt }{{} = BV part (by chain rule) ( F t + n i= dx t 2 xf (X t, t)dx t }{{} QV part F b i + 1 ( σσ : 2 x i 2 xf )) (X t, t) dt + ( σ x F ) (X t, t) db t where A: B = (A T B) denotes the matrix inner product, and we have obtained the last equation by substituting dx t = b(x t, t)dt + σ(x t, t)db t, using the rules dtdt = dtdb i t = db j t dt = and db i tdb j t = δ ij dt (i, j = 1,..., d), where B i t denotes the i-th component of B t. Remark 7.9. The matrix family a(, ) := σσ : R n R R n n is sometimes called the diffusion matrix. Remark 7.1. Note that, for functions that have a quadratic variation, Itô s formula is what is chain rule for functions of bounded variation. Dropping the dependence on (X t, t) for the moment, one may rewrite the last equation as dy t = F n t + F b i + 1 n 2 F d n a ij dt + F σ ij dbt i. x i 2 x i x j x j i=1 i,j=1 i=1 j=1 3

31 Historical remarks Itô s original work was published in However it was recently revealed that in 194 Wolfgang Döblin, brother of novelist Alfred Döblin, French-German mathematician and student of Maurice Fréchet and Paul Lévy, sent a sealed letter to the Académie Française, while he was on the German front with the French army (as a telephone operator). Döblin committed suicide before he was captured by the German troops and burned all his mathematical notes. According to Döblin s last will, the letter to the Académie Française was opened in the year 2 and found to contain a proof of Itô s lemma Geometric Brownian motion Consider the Geometric Brownian motion S = (S t ) t that is the solution of the Itô stochastic differential equation (SDE) ds t = µs t dt + σs t db t, S >. We claim that the solution to the SDE is S t = S exp [(µ σ2 2 ) t + σb t ]. This can be seen as follows: using Itô s formula for F (x) = log x, we find and therefore Y t = log S t dy t = ds t Y t = Y + = Y + = t S t σ2 St 2 2 St 2 (µ σ2 2 (µ σ2 (µ σ2 2 dt ) dt + σdb t ) dt + σ 2 ) t + σb t t db t ( which proves that S t follows the log-normal distribution with mean and variance σ 2. Moreover, E [S t ] = exp (µt). µ σ2 2 ) t Remark The geometric Brownian motion is sometimes used to model the growth of one s wealth subject to some positive interest rate µ > and random fluctuations due to market conditions, represented by the volatility-modified Brownian motion term σb t. Good to know that E[S t ] = exp (µt). 4 Kiyoshi Itô, , Japanese Mathematician; the famous lemma appeared in [K. Itô, On stochastic differential equations, Memoirs AMS 4, 1 51, 1951]. 5 For a summary of Döblin s work, see [B. Bru and M. Yor, Comments on the life and mathematical legacy of Wolfgang Doeblin, Finance Stochast. 6, 4 47, 22] 31

32 S t t Figure 7.1: Typical realization of Geometric Brownian Motion (S t ) t [,1] for µ = 2 and σ = 1. The red dashed line shows the mean E[S t ]. It is also known, however, that the Brownian motion satisfies the Law of the Iterated Logarithm (see, e.g., [19, Thm ]) lim sup t lim inf t B t 2t log log t = +1 B t 2t log log t = 1, which states that Brownian grows sublinearly. Since ) ] S t = S exp [(µ σ2 t + σb t 2 it follows that depending on different values of µ and σ, the wealth process (S t ) t is dominated by the linear drift term. Indeed S, can exhibit rather different behaviours in the limit as t, depending on the values of µ and σ: (a) If µ < σ2 2, then S t as t. (b) If µ = σ2 2, then lim sup t S t =, lim inf t S t =. (c) If µ > σ2 2, then S t as t. (All the statements above hold P-almost surely.) The mind-blowing aspect of geometric Brownian motion is the seemingly contradictory property that, even though the expected value grows exponentially with time for every volatility value σ, the process will hit zero with probability 1 whenever the volatility is sufficiently large, i.e. when σ > 2µ. That is, even though the expected wealth grows exponentially (and thus never hits zero), for P-almost all ω, all the wealth will vanish due to fluctuations in the long time limit, i.e., every single market player goes broke with probability one. Think about it! 32

33 8 Day 8, : Itô stochastic differential equations 8.1 Martingale representation theorem Let us repeat Itô s Lemma: given F C 2 (R n ) and an Itô SDE dx t = b(x t, t)dt + σ(x t, t)db t, the process (Y t ) t, Y t := F (X t ) solves the Itô SDE dy t = F (X t ) dx t dx t 2 F (X t )dx t ( = F b + 1 ) 2 σσ : 2 F dt + ( σ F ) db t where u v = u v denotes the usual inner product between vectors and A : B = tr(a B) is the inner product between matrices. Furthermore, in the step from the first to the second line, we have used the rule that dtdt = dtdb i t = db i tdt = db i tdb j t = δ ij dt, i, j = 1,..., n. If we define the second-order differential operator (the infinitesimal generator of the stochastic flow X t that will be introduced later on) Itô s formula may be rewritten as Lϕ = 1 2 σσ : 2 ϕ + b ϕ, dy t = (LF ) (X t, t)dt + ( σ F ) (X t, t) db t. Remark 8.1. The fact that the usual chain does not apply for Itô processes has to do with the definition of the corresponding stochastic integral. We shall briefly comment on what makes Itô integral special. (i) The Martingale Representation Theorem states that every F t -martingale (X t ) t (i.e., X t is adapted to the filtration generated by B t and satisfies X s = E [X t F s ] ) can be written as the integral X t = X + t φ(ω, s)db s for a function φ V(, t), that is uniquely determined. Conversely, every Itô integral of the form t φ s db s is a martingale with respect to (F t ) t (ii) The Stratonovich integral is another stochastic integral, distinct from the Itô integral, that is based on the midpoint rule, i.e., ) ) (Btj+1 B tj ψ(ω, s) db s = lim n j ( ψ ω, t j

34 where we emphasize the different notation using the symbol and where t j+ 1 := t j + t j The Stratonovich integral has the property that [ ] E ψ s db s, hence the Stratonovich integral is not a martingale, unlike the Itô integral. Furthermore, the thus defined integral does not satisfy the Itô Isometry. On the other hand, the usual chain rule applies. (iii) The Stratonovich integral is used for integrating Stratonovich SDEs dx t = b(x t, t)dt + σ(x t, t) db t, where, again, the indicates that the SDE has to be interpreted in the Stratonovich sense (i.e., integrated using the Stratonovich integral). One can also convert Stratonovich SDEs to Itô SDEs using the conversion rule dx t = b(x t, t)dt + σ(x t, t) db t = (b + 12 ) σ σ (X t, t)dt + σ(x t, t)db t. 8.2 Stochastic differential equations: Existence and uniqueness We now want to find possible solutions (X t ) t R n for Itô SDEs of the form dx t = b(x t, t)dt + σ(x t, t)db t. (8.1) where b : R n R R n and σ : R n R R n d are measurable functions. Definition 8.2 (Strong solution). Let T >. A process (X t ) t [,T ] is called a strong solution of (8.1) if the map t X t is almost surely continuous and adapted to the filtration generated by the Brownian motion (B t ) t [,T ] and if it holds for P-almost all ω (i.e., if it holds P-almost surely) that X t (ω) = X (ω) + t b(x s (ω), s)ds + t σ(x s, ω)db s (ω) (ω fixed). Definition 8.3 (Uniqueness). The solution of (8.1) is called unique or pathwise unique if P(X = X ) = 1 P(X t = X t ) = 1 t [, T ] for any two solutions (X t ) t [,T ] and ( X t ) t [,T ] of (8.1). Theorem 8.4 (Existence and uniqueness). Let T > and b, σ in (8.1) satisfy (i) (Global Lipschitz condition): x, y R n, t [, T ], b(x, t) b(y, t) + σ(x, t) σ(y, t) L x y for some constant < L <. 6 6 Here the norm on the matrix terms σ(, ) is arbitrary and may be taken to be, e.g., the Frobenius norm σ = ( i,j σ ij 2 ) 1/2. 34

35 (ii) (Sublinear growth condition): x, R n and t [, T ], for some < G <. b(x, t) + σ(x, t) G(1 + x ) Given that the above conditions hold, if we have E [ X 2 ] <, then (8.1) has a pathwise unique, strong solution for any T >. Proof. See [19, Thm 5.2.1]. The main elements of the proof of the above theorem are the Itô Isometry and a Picard-Lindelöf-like fixed-point iteration, just as in case of ordinary differential equations. 8.3 Applications from physics, biology and finance Example 8.5. We now consider some examples which use Itô s formula. (i) (Geometric Brownian motion): the Itô SDE in this example is and the solution to this SDE is ds t = µs t dt + σs t db t, S > S t = S exp ) ] [(µ σ2 t + σb t 2 (ii) (Ornstein-Uhlenbeck process): the OU process is a Gaussian process whose evolution is given by dx t = AX t dt + BdW t, X = x for A R n n, B R n d (i.e., W t is d-dimensional Brownian motion). The solution to the above SDE is X t = e At x + t e A(t s) B dw s (iii) (Brownian Bridge): the Brownian bridge (BB t ) t [,1] is a Gaussian process with the property BB = BB 1 =. The associated SDE is The solution to the above SDE is dbb t = 1 BB t dt + db t, BB =. 1 t BB t = (1 t) The proof is left as an exercise. t (iv) (Logistic growth model): Consider the ODE dz dt db s 1 s. = rz (C z), z() > 35

36 z time x 1 4 Figure 8.1: Typical realization of the logistic growth model with r =.5, C = 3 and σ =.1; for comparison, the dashed green curve shows the deterministic model with σ = ; the red straight line shows the capacity bound C. that describes logistic growth of a population where r > is the (initial) growth rate and C > is the capacity bound. If we add random perturbations we obtain the Itô SDE dz t = rz t (C Z t )dt + σz t db t, Z >. which describes logistic growth in a random environment (see Figure 8.1). This is an interesting example because the drift coefficient is not globally Lipschitz even worse, the drift term can become unbounded. Despite this, one can obtain an analytic solution to the SDE above, given by [( ) ] exp rc σ2 2 t + σb t Ẑ t = Ẑ 1 + t exp [( ) ]. rc σ2 2 s + σbs ds Central issues for numerical methods for SDEs We now consider some key properties which we shall use in evaluating the quality of a numerical method for solving Itô SDEs of the form dx t = V (X t )dt + 2ɛdB t (8.2) with a smooth function V : R R (8.1) that will not meet the requirements of the existence and uniqueness theorem in general. (i) There are various choices for stable numerical schemes for solving equations like (8.2). As we will see below, one such choice is X n+1 X n = t ( X n ) + 2ɛ t ξ n+1, where t > and the ξ n are suitable i.i.d. random variables, e.g., standard normal or uniform on the set { 1, 1}, such that X n X(t n ) on a sufficiently fine grid = t < t 1 < t 3 <... with t = t n+1 t n. 36

37 x time V, histogram x Figure 8.2: Typical realization of (8.2) with a bistable potential V ; the solution has been computed using the Euler method (see the next section). (ii) The numerical scheme under (i) can be shown to yield an approximation to the continuous SDE on any finite time interval, but diverges when n. On the other hand, we may be interested in the limiting behaviour, specifically in the stationary distribution of the process (if it exists). In our case, (an under certain technical assumption on V ), the process satisfies (P Xt 1 )(A) e V/ɛ dx as t and for all Borel sets A R (assuming that the integral on the right hand side is properly normalized). For the continuous process convergence of the distribution can be shown to hold in L 1, but it may be very slow if ɛ in (8.2) is small. For the discrete approximation this question of convergence does not have an easy answer, for the standard numerical schemes are not asymptotically stable and the numerical discretization introduces a bias in the stationary distribution (see Figure 8.2). (iii) Can we compute functionals of paths of X t? For example, can we compute quantities, such as [ ] T E [φ(x T )], E ψ(x t, t)dt for bounded continuous functions φ, ψ and T >, or can we compute A E[τ X = x] with τ being some random stopping time (e.g., a first hitting time of a set E R). Questions dealing with such functionals, but also the longterm stability issue under (ii) will lead us to Markov Chain Monte-Carlo (MCMC) methods for PDEs and the celebrated Feynman-Kac formula. 9 Day 9, : Stochastic Euler method Suggested reference: [11] 37

38 We motivate the ideas in this method by considering the deterministic initial value problem dx dt = b(x, t), x() = x, for t [, T ]. The initial value problem has the solution x(t) = x + t b(x(s), s)ds, t [, T ] and we can approximate the true solution, employing a suitable quadrature rule for the integral, e.g., the rectangle rule : tn+1 x(t n+1 ) = x(t n ) + x(t n ) + t n tn+1 t n b(x(s), s)ds b(x(t n ), t n )ds = x(t n ) + b(x(t n ), t n ) (t n+1 t n ). Given a sufficiently fine grid of time nodes = t < t 1 <... < t N = T with fixed time step t = t n+1 t n, we recognize the forward Euler method x n+1 = x n + tb(x n, t n ). The discretization error induced by the Euler scheme can be shown to satisfy sup x(t n ) x n C t. n=1,...,n for a < C < that is independent of t. 9.1 The Euler-Maruyama scheme We construct a simple quadrature rule for the Itô integral. (X t ) t [,T ] solve the Itô SDE To this end let dx t = b(x t, t)dt + σ(x t, t)db t, X = x with b : R n R R n and σ : R n R R n m for (B t ) t> a m-dimensional Brownian motion. We wish to approximate (X t ) t> on the uniform grid { = t < t 1 <... < t N = T }, t = t n+1 t n. The rectangle rule for the solution between [t n, t n+1 ] [, T ] is tn+1 X tn+1 = X tn + b(x s, s)ds + t n tn+1 tn+1 t n tn+1 σ(x s, s)db s X tn + b(x tn, t n )ds + σ(x tn, t n )db s t n t n = X tn + tb(x tn, t n ) + ( ) B tn+1 B tn σ(x tn, t n ). }{{} =: B n N(, t) 38

39 Definition 9.1. (Euler-Maruyama scheme): For n =,..., N 1, the Euler- Maruyama scheme or Euler s method gives the n-th iterate as X n+1 = X n + tb( X n, t n ) + σ( X n, t n ) B n, X = x Remark 9.2. A few remarks are in order. (i) Euler s method is consistent with the definition of the Itô integral, in that is evaluates the integrand at the left endpoint of the interval. (ii) The Euler method gives the values of the numerical path at the time nodes. A numerical path is obtained by linear interpolation: for t [t n, t n+1 ], X t (ω) = X n (ω) + (t t n) t ( Xn+1 (ω) X n (ω)) = X n + (t t n ) b( X n, t n ) + t t n σ( t X n, t n ) ( ) B tn+1 B tn. Note that Xt, t t n+1 depends on B tn+1, i.e., the interpolant Xt is not non-anticipating. (iii) Sometimes one wishes to refine the partition for a specific realization, i.e. using the same path ω. For example, by halving the time step from from t to t/2, one obtains new grid points t n+ 1 2 = t n + t 2 The values of the refined Brownian motion can be computed by the rule B tn+ 1 2 (ω) = 1 2 [ Btn (ω) + B tn+1 (ω)] t ξn (ω) where the ξ n N(, 1) are i.i.d. standard normal random variables. 9.2 Convergence of Euler s method We can guess that X X(t n ), but in which sense? Example 9.3. Let X t = B t and Y t = B t ; then X t Y t (that is, (X t ) t> and (Y t ) t> have the same distribution) but X t Y t = 2 B t is unbounded for all t. Hence pathwise comparisons may not be very informative. Definition 9.4. (Strong convergence): Let X tn denotes the value of the true solution (X t ) t [,T ] of our SDE at the time t n. A numerical scheme ( X n ) n = ( ) n=,...,n is called strongly convergent of order γ > if X t n max E[ X n X tn ] C t γ n=,...,n where < C < is independent of t but can depend on the length T = N t of the time interval. 39

40 Definition 9.5. (Weak convergence): A numerical scheme ( X n ) n is called weakly convergent of order δ > if max E[f( X n )] E[f(X tn )] D t δ n=,...,n for all functions f in a suitably chosen class of functions (e.g. the space of continuous, bounded functions C b (R n, or the space of polynomials of degree k). The constant D is independent of t, but may depend on the class of test functions being considered. Remark 9.6. A mnemonic for the difference between strong and weak convergence is that strong convergence is about the mean of the error, while weak convergence is about the error of the mean. Lemma 9.7. Let f be globally Lipschitz. Then strong convergence implies weak convergence. The converse does not hold Proof. Since f is globally Lipschitz, < L < such that f(x) f(y) L x y x, y R n. Then E[f( X n )] E[f(X tn )] = E[f( X n ) f(x tn )] E[ f( X n ) f(x tn ) ] LE[ X n X tn ]. Now consider the converse statement. Let X n = X tn, with E[ X n ] = and E[ X n ]. Then E[ X n E[X tn ] = =, but which concludes the proof. E[ X n X tn ] = 2E[ X n ], The next theorem states that Euler s method is strongly (and hence weakly) convergent. Theorem 9.8. Let T > and b, σ in (8.1) satisfy (i) (Global Lipschitz condition): x, y R n, t [, T ], b(x, t) b(y, t) + σ(x, t) σ(y, t) L x y for some constant < L <. (ii) (Growth condition): x, R n and t [, T ], for some < G <. b(x, t) 2 + σ(x, t) 2 G(1 + x 2 ) Then Euler s method is strongly convergent of order γ = 1/2 and weakly convergent of order δ = 1. 4

41 The proof is essentially based on the integral version of Gronwall s lemma (as is common when Lipschitz conditions are involved): Lemma 9.9 (Gronwall Lemma). Let y : [, T ] R be non-negative and integrable such that y(t) A + B for some constants A, B >. Then t y(s) ds, t T, y(t) Ae Bt, t T. Proof. We have dy dt By(t) with initial value y() A. Multiplying the last equation by exp( Bt) yields dy exp( Bt) By(t) exp( Bt) dt and thus d (y(t) exp( Bt)). dt Integrating the last equation from to t [, T ] and multiplying through by exp(bt), it follows that which proves the assertion. y(t) y() exp(bt) A exp(bt) t [, T ], We are now ready to prove that the Euler-Maruyama scheme convergences. Proof of Theorem 9.8. We will show only strong convergence and leave the weak convergence part as an exercise. Without loss of generality, we assume that b(x, t) = b(x) and σ(x, t) = σ(x) are independent of t and that x R is scalar this will greatly simplify the notation. Since L 2 (Ω, P ) L 1 (Ω, P ), i.e., E[ X n X tn ] E[ X n X tn 2 ], it suffices to prove that E[ X n X tn 2 ] C 2 t for sufficiently small t. Now let τ [, T ) and define n τ N by τ [t nτ, t nτ +1) with t k = k t. Further let X τ = X nτ be the piecewise constant interpolant of the time-discrete 41

42 Markov chain X, X1, X2,.... Then ( τ X τ X τ = X nτ x + b(x s ) ds + ( X i+1 X i ) n τ 1 = i= n τ 1 = = τ τ b(x s ) ds (b( X i ) t + σ( X i ) B i+1 ) i= tnτ tnτ σ(x s ) db s ) τ τ b( X tnτ τ s ) ds + σ( X s ) db s tnτ σ(x s ) db s b(x s ) ds τ b(x s ) ds τ = (b( X s ) b(x s )) ds + (σ( X s ) σ(x s )) db s }{{} discretization error τ ( ) τ b(x s ) ds + σ(x s ) db s t nτ t nτ }{{} interpolation error σ(x s ) db s σ(x s ) db s Squaring both sides of the equality and taking the expectation, it follows with the inequality (a + b + c + d) 2 4(a 2 + b 2 + c 2 + d 2 ), [ ( E[ X tnτ ) 2 ] τ X τ 2 ] 4E (b( X s ) b(x s )) ds [ ( tnτ ) 2 ] + 4E (σ( X s ) σ(x s )) db s ( ) 2 ( τ ) 2 τ + 4E b(x s ) ds + 4E σ(x s ) db s t nτ t nτ We will now estimate the right hand side of the inequality term by term, using Lipschitz and growth conditions: (i) Noting that the inner Riemann integral can be interpreted as a scalar product between the functions g(s) = 1 and f(s) = b( X s ) b(x s ), we find [ ( tnτ ) 2 ] E (b( X tnτ s ) b(x s )) ds t nτ E[ b( X s ) b(x s ) 2 ] ds }{{} Cauchy-Schwarz & Fubini tnτ T L 2 E[ X s X s 2 ] ds. }{{} Lipschitz bound 42

43 (ii) For the stochastic integral, the Itô isometry implies [ ( tnτ ) 2 ] E (σ( X tnτ s ) σ(x s )) db s = E[ σ( X s ) σ(x s ) 2 ] ds } {{ } Itô isometry tnτ L 2 E[ X s X s 2 ] ds. }{{} Lipschitz bound (iii) For the interpolation error coming from the drift, we have ( ) 2 τ τ E b(x s ) ds (τ t nτ ) E[ b(x s ) 2 ] ds t nτ t nτ }{{} Cauchy-Schwarz & Fubini τ t G (1 + E[ X s 2 ]) ds t nτ }{{} sublinear growth M 1 ( t) 2, for a constant < M 1 <. In the last step we have used that E[ X s 2 ] is finite by the assumptions on the coefficients b and σ. (iv) Finally, using the Itô isometry again, we can bound the remaining stochastic integral by ( ) 2 τ τ E σ(x s ) db s = E[ σ(x s ) 2 ] ds t nτ t nτ }{{} for a constant < M 2 <. Itô isometry τ G (1 + E[ X s 2 ]) ds t nτ }{{} sublinear growth M 2 t, }{{} E[ X s 2 ]< Setting y(t) = E[ X t X t 2 ], the assertion follows from Gronwall s lemma with A = M t for a M > (M 1 t + M 2 ) and B = L 2 (1 + T ). 9.3 Some practical issues Remark 9.1. Note that the (strong) error bound for Euler s method grows exponentially with T, hence becomes essentially of order one if T = O( log t). Remark We shall briefly comment on some implementation issues. 43

44 (i) The standard implementation of Euler s method is X n+1 = X n + tb( X n, t n ) + tσ( X n, t n )ξ n+1 where the ξ n are standard normal, i.i.d. random variables. (ii) If σ in (8.1) is independent of x (so called additive noise), then the Euler scheme is strongly convergent of order γ = 1. (iii) The simplified Euler method uses i.i.d. ξ n U({±1}) random variables, i.e. P (ξ n = 1) = P (ξ n = 1) = 1/2. The advantage of the simplified Euler scheme is that it is much faster to generate U({±1}) random variables than N(, 1) random variables. The disadvantage of the simplified Euler scheme is that it gives only weak convergence. Exercise Show that the simplified Euler method is weakly convergent. 1 Day 1, : Kolmogorov backward equation Suggested references: [1, 11, 19] 1.1 Further remarks on the Euler-Maruyama scheme For an inhomogeneous Itô SDE of the form dx t = b(x t, t)dt + σ(x t, t)db t, X = x, t [, T ], with b: R n [, T ] R n and σ : R n [, T ] R n m, Euler s method reads X n+1 = X n + t b( X n, t n ) + 2 t σ( X n, t n ) ξ n+1 (n =,..., N 1) where = t < t 1 < t 2... < t N = T with t n+1 t n = t, and ξ n N (, I) i.i.d. Recall that Euler s method is strongly convergent of order 1/2 and weakly convergent of order 1, i.e., sup E[ X n X tn ] C t 1/2 n sup E[f( X n )] E[f(X tn )] D t n for C, D, independent of t. In the last inequality, f : R n R is chosen from a suitable class of functions, e.g., C b = {f : continuous and bounded} or P k = {f : polynomials of degree at most k}. 7 Note that D may depend on f. Remark 1.1. A few remarks are in order. (i) One cannot construct a fully implicit numerical method for SDEs analogous to the backward Euler method for ODEs, due to a lack of measurability of the resulting discretized solution (see Example 5.6 on p. 23). 7 The space C b (R n ) is dual to the space of all finite Borel measures on (R n, B(R n )). 44

45 (ii) For functionals of (X t ) t> of the form E[f(X T )] we can use a Monte-Carlo method to get unbiased estimates, e.g., E [f(x T )] = 1 M M f( X N (ω i )) + O ( t + M 1/2) i=1 Here the error term O ( t + M 1/2) represents the width of the confidence interval for estimating the mean E[ ]. The numerical complexity, if we want to achieve an accuracy ɛ > with a certain confidence level, goes like O(ɛ 3 ). 1.2 Strongly continuous semigroups Starting from we want to compute dx t = b(x t )dt + σ(x t )db t, E[f(X t )] = = f(x t (ω))dp(ω) Ω }{{} X = x avg. over realizations of X t R n f(y)dp Xt (y) }{{} avg. over distribution of X t where f is integrable and P Xt = P Xt 1 denotes the distribution of X t at time t > (i.e., the push forward of the base measure P by X t ). Remark 1.2. Note that, for a fixed initial value X = x, the expectation E[f(X t )] = E[f(X t ) X = x] is always a function of x and t. Example 1.3 (Heat equation). Let X t = B t + x, where (B t ) t> is Brownian motion in R. Then X t N(x, t) with density 1 ρ(y, t) := ρ(y, t; x) = (2πt) exp ( 1/2 ) y x 2. 2t In the following we will assume that P Xt has a smooth density ρ with respect to Lebesgue measure, i.e., dp Xt (y) = ρ(y, t)dy. We seek an evolution equation for the conditional expectation u(x, t) := f(y)ρ(y, t; x)dy = E[f(X t ) X = x]. R n in terms of the functions u or ρ. We will first need a couple of definitions. Definition 1.4 (Transition kernel). Let (X t ) t> R n be a homogeneous Markov process. The transition kernel (also called transition function) is a function p: [, ) R n B(R n ) [, 1] defined by p(t, x, A) = P (X t+s A X s = x) s, t, x R n, A B(R n ) 45

46 that satisfies the Chapman-Kolmogorov equation ( ) p(t + s, x, A) = p(t, y, A)p(s, x, dy) = q(t, y, z)q(s, x, y)dy dz. R n A R n Here q(t, x, ) is the kernel density with respect to the Lebesgue measure (that we tacitly assume to exist if not stated otherwise). Definition 1.5 (Strongly continuous contraction semigroup). Let (K, ) be a Banach space with norm. A strongly continuous contraction semigroup is a one-parameter family of operators S t : K K such that (i) S = Id and S t+u = S t S u for all t, u > (semigroup property) (ii) lim t S t f = f where the limit is understood with respect to the strong operator topology, i.e., S t f f as t (continuity) (iii) S t 1 for all t > (contraction property) Definition 1.6. (Feller process): A Markov process (X t ) t> is called a Feller process if the operators defined by (S t f)(x) := E [f(x t ) X = x] = f(y)q(t, x, y)dy form a strongly continuous contraction semigroup in the Banach space (C, ) where C is the space of bounded functions which vanish at infinity, i.e. the space of bounded functions with compact support, and is the supremum norm. Example 1.7 (Heat equation, cont d). Let (B t ) t> be a Brownian motion. The Markov process (X t ) t> defined by X t = B t + x is a Feller process with transition density 1 q(t, x, y) = (2πt) exp ( 1/2 ) y x 2. 2t 1.3 Feynman-Kac theorem (backward equation) For all practical purposes the solutions to the SDEs of the above form are of Feller type. Let us introduce the shorthand E x [f(x t )] = E[f(X t ) X = x]. We will now state one version of the famous Feynman-Kac theorem 8 that allows for expressing the semigroup S t in terms of the solution to a partial differential equation (PDE); for further information the reader is referred to [19]. Theorem 1.8 (Feynman-Kac formula). Fix T >. Let u(x, t) be the solution of the linear, parabolic PDE ( ) t + L u(x, t) =, u(x, T ) = f(x), (1.1) 8 Richard P. Feynman ( ), American Physicist and Nobel Prize Winner; Mark (Marek) Kac ( ), Polish-American mathematician. 46

47 where Then Lφ = 1 2 σσ : 2 φ + b φ. u(x, t = ) = E x [f(x T )] Proof. The proof uses Itô s formula for Y t = u(x t, t): dy t = u dx t + u t dt + 1 ( σσ : 2 u ) dt ( ) 2 = t + L u(x t, t)dt + ( σ u ) (X t, t) db t. Now integrate both sides of the equation from to T and use (1.1) to obtain T ( ) T u(x T, T ) u(x, ) = }{{} t + L ( u(x t, t)dt + σ u ) (X t, t) db t =f(x T ) }{{} = Note that the rightmost integral is a martingale, given suitable conditions on the coefficients of (1.1) that guarantee that u is continuously differentiable. Taking expectations then yields the assertion: E x [f(x T )] = u(x, ). Remark 1.9. Equation (1.1) is a so-called backward evolution equation, a PDE with a terminal condition, and is known as the Kolmogorov backward equation. By altering the proof of Theorem 1.8, it is now straightforward to derive variants of the Feynman-Kac formula and even a forward evolution equation for the transition density. (i) Let v(x, t) solve (ii) Let Then v(x, t) = E x [f(x t )]. ( ) t L v(x, t) =, v(x, ) = f(x). (1.2) ρ(y, t; x)dy = P (X t [y, y + dy] X = x). We will also use the notation ρ(y, t) := ρ(y, t; x) when the starting point x does not change; note that ρ(y, t; x) = q(t, x, y) is the same as the transition density for fixed starting value X = x. Using integration by parts it can be shown that ρ solves the Kolmogorov forward equation (also known as the Fokker-Planck Equation), ( ) t A ρ(y, t) =, lim ρ(, t) = δ x (1.3) t where A = L is the formal L 2 -adjoint of the operator L, i.e. f, Lg = f(x) (Lg) (x)dx = (Af) (x)g(x)dy = Af, g. 47

48 We leave the proofs as an exercise to the reader. Example 1.1 (Heat equation, cont d). The heat kernel ) 1 y x 2 ρ(y, t) = exp ( 1/2 (2πt) 2t solves the PDE Semigroup interpretation ρ t = 1 2 ρ 2 y 2, lim t ρ(, t) = δ x The Feynman-Kac theorem provides a connection between certain Itô SDEs and partial differential equations (most notably Kolmogorov forward and backward equations). In terms of the solution to (1.2) our semigroup reads (S t f) (x) = E x [f(x t )] = ( e tl f ) (x), where v(x, t) = (e tl f)(x) is the formal solution of the PDE v t = Lv, v(x, ) = f(x) In other words, the semigroup has the suggestive representation We say that S t is generated by L. S t = e tl. 11 Day 11, : Exit times 11.1 Some more semigroup theory Let (X t ) t R n be the solution of the time-homogenous SDE with, e.g., dx t = b(x t )dt + σ(x t )db t, X = x, b(x) = V (x), σ(x) = 2ɛ, with ɛ > and a smooth function V : R n R that is bounded below and sufficiently growing at infinity (see Fig. 8.2 on p. 37). Further let (S t ) t be the corresponding semigroup, defined by (S t f)(x) = E x [f(x t )] where f C (R n ). Here C (R n ) denotes the space of continuous functions that vanish at infinity, and it comes with a natural norm f = sup x R n f(x) ; we have already seen that S t defines a strongly continuous contraction semigroup on (C (R n ), ), and that it can be represented by the exponential S t = exp(tl) of the the second-order differential operator Lφ = 1 2 σσ : 2 φ + b φ. There is more to say about the relation between the semigroup and the diffusion operator L that is expressed in the following definition. 48

49 Definition 11.1 (Infinitesimal generator). Let X = (X t ) t be a continuoustime Markov process with semigroup (S t ) t on a Banach space (K, ). The infinitesimal generator of X is defined as provided that the limit exists in (K, ). (S t φ)(x) φ(x) Lφ(x) = lim, (11.1) t t Remark If X = (X t ) t is time-discrete on a, say, countable state space (i.e., X is a Markov chain with transition matrix P = (p xy ) x,y ), then (S n f)(x) = y f(y)p (n) xy = ( P n f ) x, which suggests to define the discrete generator by Lf = P f f, i.e., L = P Id. Theorem If (X t ) t is the solution of dx t = b(x t )dt + σ(x t )db t, X = x and ϕ C 2 (R n ) then the infinitesimal generator from (11.1) is given by L = 1 2 σσ : 2 + b. Proof. The theorem largely follows from Itô s formula; see [19, Thm ]. Example 11.4 (Heat equation, cont d). The operator in the n-dimensional heat equation ρ t = 1 n 2 ρ 2 x 2 i is the L 2 -adjoint of the generator associated of the n-dimensional Brownian motion B t, i.e., the generator of B t is one half the Laplacian, L = 1 2 i=1 n 2 x 2 i=1 i (Note that the Laplacian is essentially self-adjoint.) Exercise Prove that LS t = S t L, i.e., L (E x [f(x t )]) = E x [Lf(X t )] Exit times: more Feynman-Kac formulae Let Ω R n be an open and bounded domain with smooth boundary, Ω C. In this section, we investigate connections between stopped SDEs and linear boundary value problems (BVPs) of the form Au = f u = g in Ω on Ω where A is a second-order differential operator, f C(Ω), g C( Ω) and the solution u C 2 (Ω) C(Ω). The function g prescribes the values of u on the boundary Ω. The next theorem is a stopping-time version of Itô s formula. 49

50 Figure 11.1: How long does 3-dimensional Brownian motion need on average to leave a sphere of radius R = 1? See Example 11.7 below. Theorem (Dynkin s formula): Let τ be a stopping time with E x [τ] <. Then for f C(R 2 n ), we have [ τ ] E x [f(x τ )] = f(x) + E x Lf(X s )ds. Proof. The assertion follows from Itô s formula and the Martingale property of stopped stochastic integrals; see [19, Lem ]. Example We want to compute the mean first exit time (MFET) of n- dimensional Brownian Motion from the open ball D = {x R n : x < R} of radius R > (see Fig. 11.1). To this end let X t = B t + x where we assume that X = x D. We choose k N and define the stopping time τ k = k τ D, with τ D = inf {t > : X t / D}. By definition our stopping time τ k is a.s. finite, so that Dynkin s formula for the function f(x) = x 2 for x D yields [ τk ] 1 E x [f (X τk )] = f(x) + E x 2 f(x s)ds = x 2 + ne x [τ k ]. In particular, since E x [f(x τk )] E x [f(x τd )] = R 2, it follows that E x [τ k ] R2 x 2. n 5

51 PDE solution Monte Carlo 4 12 V 3 E x [τ] x x Figure 11.2: Tilted double well potential (left panel) and mean first passage time of the light blue region (right panel). The solid line in the right panel shows the finite difference solution of the MFPT using 15 grid points, the dots are the Monte-Carlo estimates for 1 independent realizations, starting from each initial value (from x = 2 to x = 2 in.1 steps). Now τ k τ D as k and hence, by monotone convergence, E x [τ D ] = R2 x 2. n Observe that the MFET is inversely proportional to the dimension of the ball, which can be understood as follows. If the initial point x is ɛ away from the set boundary, i.e., if R 2 x 2 = ɛ 2 then every dimension adds another direction in which the particle can exit. Therefore E x [τ D ] 1/n. Remark If τ is the first exit time of an open, bounded set Ω R n, then E x [τ] <. τ Ω := inf {t > : X t / Ω} Our next result establishes the connection between the mean first exit time from a set with a boundary value problem specified on that set. Lemma 11.9 (Mean first exit time). Let Ω R n be an open, bounded domain with sufficiently nice boundary Ω. τ = inf {t > : X t / Ω} Let ϕ C 2 (Ω) C(Ω) solve the boundary value problem Lϕ(x) = 1, ϕ(x) =, x Ω x Ω. (11.2) Then ϕ(x) = E x [τ]. Proof. We can use Itô s formula (or Dynkin s formula) for ϕ: ϕ(x τ ) = ϕ(x ) + τ Lϕ(X s )ds + τ ( σ ϕ ) (X s )db s. 51

52 Figure 11.3: Process of exocytosis (source: Wikipedia). By the assumption that ϕ is C 2 inside the domain Ω, it follows that the rightmost term is a Martingale; taking expectations and using the PDE for ϕ, we find [ τ ] [ τ ( E x [ϕ(x τ )] = ϕ(x) + E x Lϕ(X s ) ds + E x σ ϕ ) ] (X s )db s, }{{} }{{} = = 1 where the leftmost term follows from the boundary condition ϕ Ω =. Hence which proves the lemma. = ϕ(x) E x [τ] 11.3 Applications of mean first exit time formulae The boundary value problem (11.2) for the MFET can be used to accurately compute the exit time rather when the system is not too high-dimensional (say, at most three-dimensional) and Monte-Carlo sampling would be inefficient (e.g, if the exit times are extremely large). As an example let us consider our onedimensional paradigm, diffusion in a double-well potential: dx t = V (X t )dt + 2ɛ db t, X = x, (11.3) with V as shown in the left panel of Figure 11.2 below and ɛ 1. Further let A = [ 1.1, 1]. We want to compute the mean first passage time (MFPT) of A, i.e., the mean first exit time of the set R\A R. The right panel of Figure 11.2 shows the comparison of a brute force Monte-Carlo and a PDE solution of the MFPT. Here the Monte-Carlo solution was based on an the unbiased estimator N ˆϕ(x) = 1 N i=1 τ x i where τ x denotes the numerical approximation of the first passage time, using an Euler discretization of (11.3) with initial value x { 2, 1.9,..., 1.9, 2} and 52

53 N = 1 independent realizations. The PDE solution was obtained using a finite difference discretization of the mixed BVP Lϕ(x) = 1, x ( 3, 3) \ [ 1.1, 1] ϕ(x) =, x [ 1.1, 1] ϕ (x) =, x { 3, 3} on a fine grid with grid size x =.4 (i.e., 15 grid points). In terms of run time, the PDE solution clearly rules out the Monte-Carlo method (1 sec. vs. 1 hrs.), moreover the PDE solution is much more accurate as can be seen from the figure. However Monte-Carlo becomes competitive if the dimension grows, for the computational costs are independent of the dimension (for a single initial condition), whereas they grow exponentially with the dimension in case of a grid-based PDE solution the number of grid points grows like O(2 n ). We briefly mention some applications in which MFETs are sought. Exocytosis Exocytosis is a process by which a cell ejects the content of secretory vesicles into the extracellular domain (see Fig. 11.3). Here the cell can be modelled by an open bounded domain in R 3, and the elastic plasma membrane by a nice boundary Ω. The whole process is diffusion-dominated but strongly affected by the cell geometry and crowding. A typical question that biologists ask is, e.g., how long it takes on average for a secretory vesicle (that can be well modelled as a point particle) that is produced in the Golgi apparatus to reach the cell membrane. Vacancy diffusion in crystals A vacancy is a missing atom in an atomic lattice. If a vacancy is present, one of the adjacent atoms can move into the vacancy, creating a new vacancy on the former position of the atom. By the symmetry of the crystal lattice there is an equal probability that any of the adjacent atoms will move into the vacancy, including the atom that has just created the vacancy (see Fig. 11.4). It is possible to think of this mechanism as a moving vacancy, rather than the motion of atoms into the vacancy. In fact the vacancy undergoes some kind of random walk or diffusion between lattice sites and. Physicists are interested in measuring the diffusivity of the vacancy, i.e., how long on average it takes for the vacancy to propagate through the crystal, which can be expressed in terms of the mean exit time of a test particle from the periodic unit cell of the crystal. Transition state theory Consider the reaction of a molecule A to a different molecule B by formation of a molecular complex AB, just in the way that is depicted in Figure If the reaction is temperature-activated the situation can be conveniently modelled by a one-dimensional diffusion in a bistable potential as is described by (11.3), with V = G being the thermodynamic free energy associated with the reaction and < ɛ 1 being the temperature in the system. If we let Ω = (, b) denote the product state and Ω = b, the exit time problem (11.2) can be shown to have the explicit solution E x [τ] = 1 x ( ) G(z) G(x) exp dzdy. ɛ ɛ b x 53

Figure 11.4: Migration of atoms into the vacancy in a crystal lattice that can be interpreted as the diffusion of the vacancy (source: University of Cambridge). Figure 11.

54 Figure 11.4: Migration of atoms into the vacancy in a crystal lattice that can be interpreted as the diffusion of the vacancy (source: University of Cambridge). Figure 11.5: Temperature-activated reaction of molecule A to a molecule B by formation of a transition state AB (source: Wikipedia). Now suppose that a is the unique minimum of G in Ω and define G = G(b) G(a). Then it follows by Laplace s method (see [5]) that lim ɛ log E x[τ] = G. (11.4) ɛ Observe that in the limit ɛ the MEFT E x [τ] becomes essentially E a [τ] and hence independent of the actual initial condition. In other words, at small temperature the system is always close to the minimum at x = a. If we define the rate k AB of transitions between reactant and product state as 1/E x [τ] the transition time between the transition state and the product state is negligible and drop the limit, our asymptotic formula turns into ) k AB exp ( G, k B T with ɛ = k B T where T is the physical temperature and k B is Boltzmann s constant. The last equation is known as Kramers-Eyring law in physics or Arrhenius formula in chemistry. 54

55 Yet another Feynman-Kac formula One possible generalization of Lemma 11.9 is here: Let ψ C 2 (Ω) C( Ω) solve the linear boundary value problem Lψ(x) = f, ψ(x) = g, x Ω x Ω. (11.5) on the open bounded set Ω R n for bounded continuous functions f and g. Letting τ denote the first exit time of X t (generated by L) from Ω, we have τ ] ψ(x) = E x [g(x τ ) f(x s )ds. Again, the formula follows by using Dynkin s formula: [ τ ] E x [ψ(x τ )] = ψ(x) + E x Lψ(X s )ds τ ] ψ(x) = E x [g(x τ ) f(x s )ds. 12 Day 12, : Fokker-Planck equation 12.1 Propagation of densities We begin our discussion by considering again the SDE dx t = V (X t )dt + 2ɛdB t (12.1) with a bistable potential V : R n R. Let τ A be the stopping time τ A = inf {t > : X t / Ω A }, of an open bounded set Ω A R n (see Fig below). Recall the small noise asymptotics for τ A, which should be read as in (11.4): E x [τ A ] e V/ɛ as ɛ. Let (X t ) t be an infinitely long realization of (12.1) and consider the situation depicted in Figure 12.1 where the wells Ω A, Ω B may represent, e.g., biologically relevant conformations of a molecule, ice age and interglacial periods of our climate system, or reactant and product of a catalytic reaction. An interesting question now is how much time does X t spend in either Ω A or Ω B? To this end recall that the distribution of X t at time t is given by p(t, x, A) = P (X t A X = x), A R n Borel set with density ρ(y, t; x), i.e. p(t, x, Ω B ) = ρ(y, t; x)dy. Ω B 55

ΔV Ω A Ω B Figure 12.1: Bistable potential V : R 2 R with basins of attraction Ω A and Ω B ; V > is the potential barrier between the left well and the saddle point. 12.2 Adjoint semigroup: transfer operator We seek an evolution equation for the density ρ and study its properties for the long time limit t (and for ɛ being small).

56 ΔV Ω A Ω B Figure 12.1: Bistable potential V : R 2 R with basins of attraction Ω A and Ω B ; V > is the potential barrier between the left well and the saddle point Adjoint semigroup: transfer operator We seek an evolution equation for the density ρ and study its properties for the long time limit t (and for ɛ being small). When we do not need to consider the initial value x, we will use ρ(y, t) := ρ(y, t; x) to refer to the density. Theorem Let (X t ) t solve the SDE dx t = b(x t )dt + σ(x t )db t, X = x with the initial value x being randomly distributed according to the density ρ. Then the density ρ(, t; x) = ρ(, t) solves the Fokker-Planck equation ( ) t L ρ(y, t) = where ρ(y, ) = ρ (y) and L is the L 2 -adjoint of the infinitesimal generator L = 1 2 σσ : 2 + b. Proof. Let u(x, t) = E x [f(x t )] = (S t f)(x) where we recall that (S t ) t> is the one-parameter family of semigroups S t = e tl. Then E [f(x t )] = E x [f(x t )] ρ (x)dx R n = u(x, t)ρ (x)dx R n = (S t f)(x)ρ (x)dx R n = f(x) ( S ) t ρ (x)dx. R n 56

57 Note that u(x, t)ρ (x)dx = (Lu)(x, t)ρ (x)dx t R t n R n = (S t Lf)(x)ρ (x)dx R n = f(x) ( L S ) t ρ (x)dx, R n which implies that (exp(tl)) = exp(tl ), i.e., L generates the evolution semigroup St : (C (R n )) (C (R n )) on the dual space (C (R n )) this the space of (nonnegative) finite Borel measures. 9 By definition of ρ, we also have E [f(x t )] = f(y) dp Xt (y) = f(y)ρ(y, t)dy. Therefore, f(y) ( e tl ρ ) (y)dy = f(y)ρ(y, t)dy and, since f C (R n ) was arbitrary, it follows that ( e tl ρ ) (y) = ρ(y, t) with ρ(y, ) = ρ (y). This yields so ρ solves the Fokker-Planck equation. ρ t = L ( e L t ) ρ = L ρ Example 12.2 (Ornstein-Uhlenbeck Process). The Ornstein-Uhlenbeck (OU) process is described by the SDE dx t = AX t dt + BdW t, X = x, where we assume that X = x is fixed, A R n n and B R n m. The infinitesimal generator of the process X t satisfies (for all scalar-valued twice continuously differentiable functions h in its domain) Lh(x) = 1 2 BB : 2 h(x) + (Ax) h(x) and the formal L 2 -adjoint of L, L satisfies L u, v L 2 = (L u) vdx = u (Lv) dx = u, Lv L 2. 9 Note that the adjoint is defined with respect to the inner product in the space L 2 (R n ), but using a density argument it can be shown that it is sufficient to consider f C (R n ), since C (R n ) is a dense subspace of L 2 (R n ). 57

58 We calculate L using the above and integration by parts: ( ) 1 u (Lv) dx = u 2 BB : 2 v + (Ax) v dx = 1 ( ) 1 u 2 2 BB v dx v (uax)dx = 1 ( ) 1 v 2 2 BB : 2 u dx v ( u (Ax) + u (Ax)) dx ( ) 1 = v 2 BB : 2 u (Ax) u tr(a)u dx which implies that (here tr( ) denotes the trace operator) L h(x) = 1 2 BB : 2 h(x) (Ax) h(x) tr(a)h(x). This gives us the Fokker-Planck equation, ρ t 1 2 BB : 2 ρ + (ρax) =, the solution of which we know already (cf. Example 8.5), namely, ( t ) X t N e At x, e As BB e A s ds In other words, the OU process at time t > is Gaussian with mean and covariance E[X t ] = e At x E [ (X t E [X t ]) (X t E [X t ]) ] = In particular we observe that lim ρ(y, t) = δ x(y) t t (weakly). e At BB e A tdt, If the eigenvalue of A lie entirely in the left complex half-plane C (i.e., have strictly negative real part), then both mean and covariance converge as t : and E[X t ] E [ (X t E [X t ]) (X t E [X t ]) ] e At BB e A tdt, i.e., for almost all initial conditions, the solution ρ(x, t) of the Fokker-Planck equation converges to the Gaussian density ( ) ρ = N, e As BB e A s ds Exercise Compute the solution of the Fokker-Planck equation ( ρ t = 2 σ 2 x 2 ) x 2 2 ρ x (µxρ), lim ρ(x, t) = δ a(x) t for some a >. (Hint: write down the corresponding SDE.) 58

59 12.3 Invariant measures We now ask whether the solution (X t ) t associated with a given SDE has an invariant measure, and if so, whether it is approached independently of the initial distribution of X. Let us make these questions precise. Definition 12.4 (Invariant measure). A finite, nonnegative Borel measure µ is called an invariant measure if X µ implies that X t µ for all t >. It readily follows from the properties of the Fokker-Planck equation, that if ρ is the Lebesgue density associated with the invariant measure µ, then L ρ =. In other words, stationary distributions are steady-state solutions of the Fokker- Planck equation. Example 12.5 (Ornstein-Uhlenbeck process, cont d). If λ(a) C, i.e., all eigenvalues of A have strictly negative real part, then ( ρ (x) = (det(2πc )) 1/2 exp 1 ) 2 x C 1 x, where C = (If C is not invertible, then C 1 Boltzmann-Gibbs measures e As BB e A s ds. is the Moore-Penrose pseudo-inverse.) An interesting special class of SDEs with unique known invariant measure are gradient systems of the form dx t = V (X t )dt + 2ɛdB t, X = x. For this class of SDEs, the density of the invariant measure is given by the ρ = 1 Z e V/ɛ, Z = e V (x)/ɛ dx, R n provided that the integral exists. The corresponding invariant measure is called Boltzmann distribution or Gibbs measure. Recall that our initial question was to ask how much time does the process (X t ) t> spend in Ω B. We now answer that question: 1 T µ (Ω B ) = lim χ ΩB (X t )dt T T a.s. for µ -almost all initial conditions X = x. (We are anticipating results from the forthcoming lectures.) The above expression means that the invariant measure of Ω B is given by the average time the process (X t ) t> spends in Ω B, where the average is taken over an infinitely long realization of time. Note that we 59

60 can also express the invariant measure of Ω B by a spatial average using the associated density of the invariant measure, µ (Ω B ) = 1 e V (x)/ɛ dx. Z Ω B The spatial average is usually difficult to compute in dimensions greater than two or three, but the time average can be computed in a Markov Chain Monte- Carlo (MCMC) fashion by running a long realization of the SDE. 13 Day 13, : Long-term behaviour of SDEs Suggested references: [7, 21] 13.1 Invariant measures, cont d We confine our attention to gradient-type systems of the form dx t = V (X t )dt + 2ɛdB t, (13.1) with a smooth potential V : R n R that is bounded below and sufficiently growing at infinity (e.g., coercive). For sufficiently smooth functions (e.g., C(R 2 n )), the infinitesimal generator of X t and its formal L 2 -adjoint are L = ɛ = V, L = ɛ + V + V. The corresponding Fokker-Planck equation ρ t = ɛ ρ + (ρ V ), ρ(x, ) = ρ (x) (13.2) with ρ has the unique stationary solution ρ = 1 Z e V/ɛ, Z = e V (x)/ɛ dx, R n where Z < by the assumptions on V. Example 13.1 (OU process with symmetric stiffness matrix). The SDE dx t = KX t dt + 2ɛdW t, X = x, with symmetric positive definite (s.p.d.) stiffness matrix K R n n is a gradient system with quadratic potential V (x) = 1 2 x Kx. Its unique invariant measure is Gaussian with mean zero and covariance C = ɛk 1 and has full topological support (i.e., has a strictly positive density ρ ). Remark A few remarks are in order. 6

61 V, 3 V, t ΔV x x Figure 13.1: Bistable potential V : R 2 R with basins of attraction Ω A and Ω B ; V > is the potential barrier between the left well and the saddle point. (i) The Fokker-Planck equation can be written in divergence form. In our case it reads ρ = J, J = ɛ ρ + ρ V. t Integrating ρ(x, t) over x R n and using Gauss theorem, it follows that d ρ(x, t)dx =, dt R n which implies that ρ(, t) L 1 (R n ) = ρ L 1 (R n ) = 1. That is, normalized densities stay properly normalized throughout time. (ii) It follows by PDE arguments (strong maximum principle for elliptic equations) that ρ is the only stationary solution of (13.2). (iii) It is often helpful to think of ρ as the solution to the eigenvalue problem L v = λv for the simple eigenvalue λ = (it must be simple, because ρ is unique). Equivalently, we may think of ρ as the solution to the fixed-point equation S t ρ = ρ, where (St ) t is the adjoint semigroup or transfer operator (propagator). In other words, ρ is the eigenfunction of St = exp(tl ) to the eigenvalue σ = 1. Note that St and L have the same eigenvectors; their eigenvalues are related by σ(t) = exp(λt), i.e., the eigenvalues of the transfer operator depend on time, with the sole exception of the simple eigenvalue σ = Convergence to equilibrium We want to study the long-term behaviour of solutions to (13.2). In particular, we want to know at which speed the stationary distribution is approached (if at all). Let us start with some preliminary considerations: 61

62 Contraction property. Recall that the semigroup (S t ) t : C (R n ) C (R n ) associated with (X t ) t is a contraction, i.e., S t 1 for all t in the operator norm that is induced by the (supremum) norm on C (R n ). We further know that S t is bounded from below by the spectral radius r σ (S t ). Since the eigenvalue σ = 1 is simple and the generator L has real spectrum (as we will see below), it follows that r σ (S t ) = 1 and that all other eigenvalues are strictly less than 1 in modulus. The same goes for S t and L. Hence we can expect that all solutions of (13.2) converge to the stationary solution ρ, i.e., ρ = S t ρ ρ in some suitable norm (e.g., L 1 ). This is the good news. Relaxation time scale. Now comes the bad news. Imagine a situation as in Figure 13.1 with small noise and recall the asymptotic expression (11.4) for the mean first exit time from the left well of the potential V : if τ = inf{t > : X t / left well} denotes the first exit time from the left energy well of V as shown in Figure 13.1, then the mean first exit time satisfies lim ɛ log E x[τ] = V. ɛ This implies that for the initial density in the left well to propagate to the right well and relax to the stationary distribution, we have to wait at least ( T = O e V/ɛ). Hence when V ɛ, as is the case in most interesting applications, the typical time scale on which the convergence ρ ρ takes place is exponentially large in the dominant barrier height of the potential V. Theorem 13.3 (Bakry & Emery, 1983). Let V satisfy the convexity condition 2 V αi for some α >. Further let ρ t = ρ(, t) denote the solution to the Fokker-Planck equation (13.2) with initial density ρ L 1 (R n ). Then ρ t ρ L 1 (R n ) e αɛt ρ ρ L 1 (R n ). The proof of the theorem is quite involved (e.g., see [15]), and we will prove a slightly simpler version instead. This will suffice to get the idea of the general proof that is based on entropy estimates. We will need the following definition. Definition 13.4 (Relative entropy). Let µ be a probability measure on (R n, B(R n )) that is absolutely continuous with respect to dµ = ρ dx. The quantity ( ) dµ H(µ µ ) = log dµ R dµ n is called relative entropy or Kullback-Leibler divergence of µ and µ. declare that H = if µ is not absolutely continuous with respect to µ. We 62

63 We use the agreement that log =. It can be shown, using Jensen s inequality, that H is nonnegative, with H(µ µ ) = if and only if µ = µ a.e. Even though H is not symmetric in its arguments, it can be used to bound the variation distance between µ and µ : if ρ and ρ denote the Lebesgue densities with respect to µ and µ, then ρ ρ L1 (R n ) 8H(µ µ ). (13.3) The inequality is known as Csiszár-Kullback inequality [4, 13]. Lemma Let ρ t = ρ(x, t) be the solution of the Fokker-Planck equation for a quadratic potential V (x) = α 2 x 2, α >. Then ρ t ρ L1 (R n ) e αɛt ρ ρ L1 (R n ). Proof. The sketch of proof is based on a Gronwall estimate of the entropy production rate dh/dt. Specifically, let (in a slight abuse of notation) ( ) ρt H(ρ t ρ ) = log ρ t dx. R ρ n Further let ρ L 1 (R n ) = 1. Using integration by parts and the norm conservation property of the Fokker-Planck equation, it follows that dh dt = ρ t R t log ρ t dx + α x 2 ρ t 2ɛ n R t dx n = ( log ρ t ) (ɛ ρ t + αxρ t ) dx α x (ɛ ρ t + αxρ t ) dx R ɛ n R n = (ɛ ρ t 2 ) + 2αx ρ t + α2 R ρ n t ɛ x 2 ρ t dx = ɛ log ρt + α 2 x ɛ ρ t dx, R n which shows that the entropy H is nonincreasing along solutions of the Fokker- Planck equation, i.e., dh/dt. (This is to say that H is a Lyapunov function for ρ t.) Now comes a bit of magic: by using a functional inequality for the entropy, called logarithmic Sobolev inequality (e.g., see [15]), it can be shown that the last integral can be bounded from above by 2αɛH, i.e., d dt H(ρ t ρ ) 2αɛH(ρ t ρ ). The inequality is a (functional) differential inequality, and it follows by the Gronwall Lemma on page 41 that H(ρ t ρ ) e 2αɛt H(ρ ρ ). The Csiszár-Kullback inequality (13.3) then yields the desired result. 63

64 Most of the rigorous results that prove exponential convergence of ρ t towards the stationary distribution are for strictly convex potentials. From the properties of the semigroup S t or its adjoint S t we may nonetheless expect that ρ t ρ Ce λt holds true in some suitable norm and for some constants C, λ >. Let us briefly explain why Spectral gaps and the space L 2 (R n, µ ) We begin with a short excursus on the spectral properties of the generator and its semigroup. To this end, let L 2 (R n, µ ) denote the space of functions that are square-integrable with respect to the invariant measure µ, i.e., { } L 2 (R n, µ ) = u: R n R : u(x) 2 ρ (x) dx <. R n This is a Hilbert space with scalar product u, v µ = u(x)v(x) ρ (x) dx R n and induced norm µ. It can be readily seen that L is symmetric as an operator on C(R 2 n ) L 2 (R n, µ ): Lu, v µ = ɛ u(x) v(x)ρ (x) dx = u, Lv µ. (13.4) R n Technical details aside, any (bounded) symmetric operator on a Hilbert space has real eigenvalues and orthogonal eigenvectors. 1 Moreover, by the above, Lf, f µ = ɛ f µ, which proves that L is negative definite on L 2 (R n, µ ). eigenvalue problem Lφ n = λ n φ n, n =, 1, 2,... is well posed with discrete eigenvalues The corresponding λ > λ 1 λ 2..., where λ = is simple, and pairwise orthogonal eigenvectors {φ n } n= that (if properly normalized) form an orthonormal basis of L 2 (R n, µ ), i.e., φ i, φ j µ = δ ij. (13.5) We can now represent any function f(, t) L 2 (R n, µ ) in terms of the eigenfunctions by the expansion f(x, t) = c j (t)φ j (x), c j (t) = f(, t), φ j µ. j= 1 Clearly L here is an unbounded operator, but it can be shown that it is essentially selfadjoint on a suitable subspace of C 2 (Rn ) L 2 (R n, µ ) which implies that its spectrum is discrete and real, with eigenfunctions that form an orthonormal basis of L 2 (R n, µ ). 64

65 Fokker-Planck equation in L 2 (R n, µ ) For our purposes it is convenient to normalize the distribution by ρ ; instead of ρ t, we consider the evolution of a function ψ t = ψ(x, t) defined by ρ(x, t) = ψ(x, t)ρ (x). Inserting the last expression into the Fokker-Planck equation (13.2) it is easy to show that ψ t solves the backward Kolmogorov equation ψ t = Lψ, ψ = ρ /ρ. We assume that ψ L 2 (R n, µ ), which is the case whenever ρ L 1 (R n ) is continuous and, e.g., compactly supported, such that ψ admits the expansion ψ (x) = ψ, φ j µ φ j (x). j= Also ψ t = exp(tl)ψ can be expanded into eigenfunctions. This yields ψ t = ψ, φ j µ e tl φ j = ψ, φ j µ e tλj φ j j= where we have tacitly assumed that the sum is uniformly convergent and used that L and exp(tl) have the same eigenvectors φ j. Now recall that φ = 1 is the unique eigenvector to the eigenvalue λ =, corresponding to the stationary distribution ρ, and observe that ψ, φ µ = Therefore, with (13.5) and ψ = 1, j= R n ρ ρ ρ dx = ρ L 1 (R n ) = 1. ψ t ψ 2 L 2 (R n,µ = ) e 2λjt ψ, φ j 2 µ e 2λ1t ψ 2 L 2 (R n,µ, ) j=1 which shows that ψ t ψ in L 2 (R n, µ ), exponentially fast with rate given by the first eigenvalue or spectral gap λ 1 <. In other words, with C = ψ L 2 (R n,µ ). ψ t ψ L 2 (R n,µ ) Ce λ1t, (13.6) Remark If we remove the scaling by the invariant density and translate (13.6) into probability densities again, we see that ρ t ρ 2 ρ 1 dx Ce λ1t. R n This is not quite what we would like to have as the assumption C < requires that ρ is square integrable with respect to ρ exp(v/ɛ); a more natural norm would be L 1, but at least our short calculation should have made clear that convergence to equilibrium is related to the first nonzero eigenvalue(s) of the generator L. Unfortunately, one can show that λ 1 e V/ɛ as ɛ, that is, in the presence of large energy barriers, the convergence is rather slow as λ 1 (even though it is still exponential). 65

66 14 Day 14, : Markov chain approximations Recall that the probability distribution associated to an SDE of the form satisfies for all Borel sets A R n where dx t = V (X t )dt + 2ɛdB t, X ρ (14.1) lim P(X t A) = µ (A) t µ (A) = A ρ (x)dx. As we have seen, the convergence above is exponentially fast with a rate λ 1 < given by the first nonzero eigenvalue of the generator L = ɛ V. Unfortunately, for small ɛ >, λ 1 (E x [τ]) 1, with τ being the first exit time from the deepest well in the potential energy landscape, in other words, λ 1 may be arbitrarily close to zero Ergodicity We can evaluate ρ (x) e V (x)/ɛ but we cannot sample from it. There are two central issues related to sampling the stationary from simulations of (14.1). How do we compute expectation value with respect to ρ? Put differently, given a sufficiently long realization (X t ) t [,T ] of (14.1), can we assume that fdµ 1 T f(x t ) dt, R T n i.e., does X t sample ρ? And if so, how large must T be? This brings us to the second question: Can we say anything about the small-noise asymptotics for the distribution of X t and time t? If yes, it might tell us something about a reasonable choice of T, as the typical relaxation time is exponentially in the eigenvalue that depends on ɛ. 11 Remark In the case of a two-well potential with V (x) = at the two minima x = ±1, it can be shown that µ 1 2 δ δ +1, i.e., that the invariant density converges (in distribution, or in the weak- sense) to the weighted sum of two Dirac measures concentrated at the minima. Theorem 14.2 (Strong Law of Large Numbers). Let f C b (R n ). Then (given the usual growth, smoothness and boundedness conditions on V ), 1 E [f] = lim T T for almost all initial conditions X = x. T f(x t )dt 11 We may expect that this choice will have to do with Kramer s formula, equation (11.4). a.s. a.s. 66

67 12 x Euler M Euler ρ x Figure 14.1: Empirical distribution and exact Boltzmann density (red curve) for the symmetric double-well potential. Proof. See [8, Thm. 2.5]. The interpretation of the above statement is that if the SDE has a unique invariant measure, then almost all realizations converge to the same invariant distribution, in the sense that expectations agree, a property that is known as ergodicity. Remark The Law of Large Numbers does not hold for general SDEs. For example, the infinitesimal generator L may have pure imaginary eigenvalues or the invariant density ρ may not be everywhere positive. In these cases, the system may not be ergodic or there may depend on the initial conditions Markov Chain Monte Carlo (MCMC) Now we consider two possible methods for sampling ρ, i.e., for computing expectation values of an integrable function f. First attempt: We use the Euler-Maruyama scheme and take the sample mean X n+1 = X n t V ( X n ) + 2ɛ B n+1, E [f] 1 N N f( X i ). i=1 We make the following observations: 67

68 (i) We know that on any finite time interval [, T ] the weak error satisfies max E[f(X t n )] E[f( X n )] C t n where C = Ae BT for suitable constants A, B > (see Section 9.2). This implies that the error in this attempt is essentially of order 1 as T. (ii) We may think of ( X n ) n N transition probability kernel as a homogeneous Markov chain with the Q(x, A) = P( X n+1 A Xn = x) for a Borel set A R n, with the Gaussian transition density ) q(x, y) = (4πɛ t) n/2 y x + t V (x) 2 exp (. 4ɛ t The associated invariant distribution satisfies µ (A) = Q(x, A)d µ (x) (14.2) R n or equivalently ρ (y) = q(x, y) ρ (x)dx. R n (Recall that in the discrete-state space setting the invariant distribution is the left eigenvector of the transition matrix to the eigenvalue λ = 1, i.e., we have µ (x) = y µ (y)q xy ; equation (14.2) is simply the continuous analogue of the last identity.) It is easy to see that (see Fig. 14.1) µ (A) Q(x, A)dµ (x), (14.3) R n i.e., µ is not invariant under the Euler scheme. 12 In fact for an arbitrary potential V, the Markov chain with transition probability Q must not have an invariant measure at all (i.e., the chain may be transient). Remark It is not advisable to let t and N in the attempt above to reduce the numerical error, because even with large time steps, simulations take a long time. Decreasing the time steps would make simulations too long to be practical. Second attempt: By slightly altering the transition rule of the Euler scheme we can construct a Markov chain (Y n ) n that is similar to ( X n ) n with transition probability P (x, A) satisfying µ (A) = P (x, A)dµ (x). R n We will do this by Metropolizing the euler scheme. To this end let us recall some facts from the theory for Markov chains: 12 Try V (x) = x 2. 68

69 (i) A sufficient condition for invariance is detailed balance, i.e. ρ (x)p(x, y) = ρ (y)p(y, x) x, y R n where p(x, y) is the Lebesgue density of the transition kernel P (x, ). Then ρ (x)p(x, y)dy = ρ (x)p (x, R n ) = ρ (x) = ρ (y)p(y, x)dy, R n R n where the last equality is exactly the condition that ρ must fulfil in order to be the invariant density under P. (ii) If the Markov chain is irreducible and aperiodic with strictly positive invariant measure µ >, then the Markov chain is ergodic, i.e. for almost all initial conditions x it follows that lim sup P n (x, A) µ (A) = n A R n where P n (x, A) := P (X n A X = x) and the supremum runs over all measurable sets A R n (total variation norm); e.g., see [17] Metropolis-Adjusted Langevin Algorithm (MALA) We combine the Euler scheme for equation (14.1), also known as Langevin algorithm, with the Metropolis Hastings scheme. Algorithm MALA proceeds by iterating the following two steps (proposal and acceptance-rejection step): 1. Let Y n = x and generate a proposal according to y = x t V (x) + 2ɛ B n+1. (The proposal step x y has transition probability density q(x, y).) 2. Accept Y n+1 = y with probability otherwise, set Y n+1 = x. a(x, y) := min { 1, ρ } (y)q(y, x) ; ρ (x)q(x, y) The above acceptance criterion yields a Markov chain with transition probability density p(x, y) = q(x, y)a(x, y) for x y. The total rejection probability, i.e., the probability that the chain remains at x then is r(x) = 1 q(x, y)a(x, y)dy R n The transition kernel of MALA can be expressed as ( ) P (x, dy) = q(x, y)a(x, y)dy + 1 q(x, y)a(x, y)dy δ x (dy). R n 69

70 Euler exact Metropolized Euler exact density density x x Figure 14.2: Empirical distributions and exact Boltzmann density (orange curve) for the symmetric double-well potential: The left panel shows the systematic bias of Euler s method, the right panel shows the Metropolized simulation. The Markov chain described by the transition density above satisfies detailed balance with respect to the invariant density ρ and inherits all the nice properties from the Euler scheme (such as irreducibility and aperiodicity), so the Markov chain is ergodic. In particular, for any stable time step t >, sup P n (x, A) µ (A) A R n as n and 1 E [f] = lim N N N f(y i ) Example Figure 14.2 shows a comparison between the ordinary and the Metropolis-adjusted Euler scheme for a quadratic potential. We observe that for a reasonable time step, the Euler scheme shows a systematic bias that is corrected by the acceptance-rejection procedure of MALA. In some cases the convergence can be shown to be geometric, which is the analogue of the exponential convergence in the time-continuous case. Theorem 14.7 (Geometric ergodicity of MALA). If V (x) x q for x with 1 q < 2, then for some < ϱ < 1. Proof. See [2] i=1 a.s. sup P n (x, A) µ (A) Cϱ n A R n Remark We conclude with some comments on MALA. (i) Convergence of MALA holds for all stable step sizes t and under fairly mild conditions on the potential V, basically polynomial growth is enough. (ii) The speed of convergence does depend on the details of the potential landscape; for potentials that do not fall under the category of Theorem 14.7, the convergence can be shown to be slower than geometric (see [2]). 7

Hence there is a trade-off between many integration steps and small rejection rate ( t small) and fewer integration steps with a possibly large rejection rate ( t large). 15 Day 15, 5.2.

71 Figure 15.1: 3 3 lattice with states o and x, also known as Tic Tac Toe. The total number of lattice configurations is S = 2 9 (source: Wikipedia). (iii) Even though MALA converges independently of t, the acceptance probability decreases when t is increased. Hence there is a trade-off between many integration steps and small rejection rate ( t small) and fewer integration steps with a possibly large rejection rate ( t large). 15 Day 15, : Small-noise asymptotics Suggested references: [2] 15.1 The Ising model as an illustration We consider the approximation of SDEs by jump processes. The motivation for this is the 2-dimensional Ising model, which is a physical model for ferromagnetism and that is one of the best-studied models in statistical physics. 13 The model consists of an n n lattice consisting of N = n 2 sites, with a spin s i {±1} for each i = 1,..., N. In this case, the process takes values in the state space consisting of all possible configurations s = (s 1,..., s N ) of spins on the lattice (2 N possible configurations; see Fig. 15.1). The interesting element comes in via interactions between spins on adjacent lattice sites; these interactions are determined by the Hamiltonian H(s) := i,j Js i s j for J > constant. The summation index i, j indicates that the sum goes over all nearest neighbours as we consider only interactions between nearest neighbours connected by an edge on the lattice. The dynamics of the process are described by a Metropolis Markov chain with, e.g., single-spin flips which are uniformly chosen and accepted with probability min {1, exp( H/ɛ)}, H = H(s ) H(s) 13 Ernst Ising ( ), German physicist and mathematician 71

Lecture notes for Numerik IVc - Numerics for Stochastic Processes, Wintersemester 2012/2013. Instructor: Prof. Carsten Hartmann

Lecture notes for Numerik IVc - Numerics for Stochastic Processes, Wintersemester 22/23. Instructor: Prof. Carsten Hartmann Scribe: H. Lie Course outline. Probability theory (a) Some basics: stochastic