Solving stochastic differential equations and Kolmogorov equations by means of deep learning

Size: px

Start display at page:

Download "Solving stochastic differential equations and Kolmogorov equations by means of deep learning"

Winfred Cummings
5 years ago
Views:

1 Solving stochastic differential equations and Kolmogorov equations by means of deep learning Christian Beck 1, Sebastian Becker 2, Philipp Grohs 3, Nor Jaafari 4, and Arnulf Jentzen 5 arxiv: v1 [math.na] 1 Jun Department of Mathematics, ETH Zurich, Zurich, Switzerland, christian.beck@math.ethz.ch 2 ZENAI AG, Zurich, Switzerland, sebastian.becker@zenai.ch 3 Faculty of Mathematics and Research Platform Data Science, University of Vienna, Vienna, Austria, philipp.grohs@univie.ac.at 4 ZENAI AG, Zurich, Switzerland, nor.jaafari@zenai.ch 5 Department of Mathematics, ETH Zurich, Zurich, Switzerland, arnulf.jentzen@sam.math.ethz.ch June 4, 218 Abstract Stochastic differential equations (SDEs) and the Kolmogorov partial differential equations (PDEs) associated to them have been widely used in models from engineering, finance, and the natural sciences. In particular, SDEs and Kolmogorov PDEs, respectively, are highly employed in models for the approximative pricing of financial derivatives. Kolmogorov PDEs and SDEs, respectively, can typically not be solved explicitly and it has been and still is an active topic of research to design and analyze numerical methods which are able to approximately solve Kolmogorov PDEs and SDEs, respectively. Nearly all approximation methods for Kolmogorov PDEs in the literature suffer under the curse of dimensionality or only provide approximations of the solution of the PDE at a single fixed space-time point. In this paper we derive and propose a numerical approximation method which aims to overcome both of the above mentioned drawbacks and intends to deliver a numerical approximation of the Kolmogorov PDE on an entire region [a, b] d without suffering from the curse of dimensionality. Numerical results on examples including the heat equation, the Black-Scholes model, the stochastic Lorenz equation, and the Heston model suggest that the proposed approximation algorithm is quite effective in high dimensions in terms of both accuracy and speed. 1

2 Contents 1 Introduction 2 2 Derivation and description of the proposed approximation algorithm Kolmogorov partial differential equations (PDEs) On stochastic differential equations and Kolmogorov PDEs Formulation as minimization problem Discretization of the stochastic differential equation Deep artificial neural network approximations Stochastic gradient descent-type minimization Description of the algorithm in a special case Description of the algorithm in the general case Examples Setting Heat equation Geometric Brownian motions Black-Scholes model with correlated noise Stochastic Lorenz equations Heston model Python source codes Python source code for the algorithm A Python source code associated to Subsection A Python source code associated to Subsection A Python source code associated to Subsection A Python source code associated to Subsection A Python source code associated to Subsection Introduction Stochastic differential equations (SDEs) and the Kolmogorov partial differential equations (PDEs) associated to them have been widely used in models from engineering, finance, and the natural sciences. In particular, SDEs and Kolmogorov PDEs, respectively, are highly employed in models for the approximative pricing of financial derivatives. Kolmogorov PDEs and SDEs, respectively, can typically not be solved explicitly and it has been and still is an active topic of research to design and analyze numerical methods which are able to approximately solve Kolmogorov PDEs and SDEs, respectively (see, e.g., [17], [2], [21], [27], [28], [29], [32], [4], [41], [42], [44], [45], [46], [47], [49], [52]). In particular, there are nowadays several different types of numerical approximation methods for Kolmogorov 2

3 PDEs in the literature including deterministic numerical approximation methods such as finite differences based approximation methods (cf., for example, [7], [8], [23], [43], [58], [56]) and finite elements based approximation methods (cf., for example, [9], [1], [59]) as well as random numerical approximation methods based on Monte Carlo methods (cf., for example, [17], [2]) and discretizations of the underlying SDEs (cf., for example, [21], [27], [28], [29], [32], [4], [41], [42], [44], [45], [46], [47], [49], [52]). The above mentioned deterministic approximation methods for PDEs work quite efficiently in one or two space dimensions but cannot be used in the case of high-dimensional PDEs as they suffer from the so-called curse of dimensionality (cf. Bellman [5]) in the sense that the computational effort of the considered approximation algorithm grows exponentially in the PDE dimension. The above mentioned random numerical approximation methods involving Monte Carlo approximations typically overcome this curse of dimensionality but only provide approximations of the Kolmogorov PDE at a single fixed space-time point. The key contribution of this paper is to derive and propose a numerical approximation method which aims to overcome both of the above mentioned drawbacks and intends to deliver a numerical approximation of the Kolmogorov PDE on an entire region [a, b] d without suffering from the curse of dimensionality. The numerical scheme, which we propose in this work, is inspired by recently developed deep learning based approximation algorithms for PDEs in the literature (cf., for example, [3], [4], [14], [15], [16], [24], [26], [51], [57]). To derive the proposed approximation scheme we first reformulate the considered Kolmogorov PDE as a suitable infinite dimensional stochastic optimization problem (see items (ii) (iii) in Proposition 2.7 below for details). This infinite dimensional stochastic optimization problem is then temporally discretized by means of suitable discretizations of the underlying SDE and it is spatially discretized by means of fully connected deep artificial neural network approximations (see (17) in Subsection 2.6 as well as Subsections below). The resulting finite dimensional stochastic optimization problem is then solved by means of stochastic gradient descent type optimization algorithms (see (19) in Subsection 2.6, Framework 2.9 in Subsection 2.7, Framework 2.1 in Subsection 2.8, as well as (124) (125) in Subsection 3.1). We test the proposed approximation method numerically in the case of several examples of SDEs and PDEs, respectively (see Subsections below for details). The obtained numerical results indicate that the proposed approximation algorithm is quite effective in high dimensions in terms of both accuracy and speed. The remainder of this article is organized as follows. In Section 2 we derive the proposed approximation algorithm (see Subsections below) and we present a detailed description of the proposed approximation algorithm in a special case (see Subsection 2.7 below) as well as in the general case (see Subsection 2.8 below). In Section 3 we test the proposed algorithm numerically in the case of several examples of SDEs and PDEs, respectively. The employed source codes for the numerical simulations in Section 3 are postponed to Section 4. 3

4 2 Derivation and description of the proposed approximation algorithm In this section we describe the approximation problem which we intend to solve (see Subsection 2.1 below) and we derive (see Subsections below) and specify (see Subsections below) the numerical scheme which we suggest to use to solve this approximation problem (cf., for example, E et al. [14], Han et al. [24], Sirignano & Spiliopoulos [57], Beck et al. [3], Fujii, Takahashi, A., & Takahashi, M. [16], and Henry-Labordere [26] for related derivations and related approximation schemes). 2.1 Kolmogorov partial differential equations (PDEs) Let T (, ), d N, let µ: R d R d and σ : R d R d d be Lipschitz continuous functions, let ϕ: R d R be a function, and let u = (u(t, x)) (t,x) [,T ] R d C 1,2 ([, T ] R d, R) be a function with at most polynomially growing partial derivatives which satisfies for every t [, T ], x R d that u(, x) = ϕ(x) and u (t, x) = 1 Trace ( t 2 R σ(x)[σ(x)] (Hess d x u)(t, x) ) + µ(x), ( x u)(t, x) R d. (1) Our goal is to approximately calculate the function R d x u(t, x) R on some subset of R d. To fix ideas we consider real numbers a, b R with a < b and we suppose that our goal is to approximately calculate the function [a, b] d x u(t, x) R. 2.2 On stochastic differential equations and Kolmogorov PDEs In this subsection we provide a probabilistic representation for the solutions of the PDE (1), that is, we recall the classical Feynman-Kac formula for the PDE (1) (cf., for example, Øksendal [5, Chapter 8]). Let (Ω, F,P) be a probability space with a normal filtration (F t ) t [,T ], let W : [, T ] Ω R d be a standard (Ω, F,P, (F t ) t [,T ] )-Brownian motion, and for every x R d let X x = (X x t ) t [,T ] : [, T ] Ω R d be an (F t ) t [,T ] -adapted stochastic process with continuous sample paths which satisfies that for every t [, T ] it holds P-a.s. that X x t = x + t µ(x x s ) ds + t σ(x x s ) dw s. (2) The Feynman-Kac formula (cf., for example, Hairer et al. [22, Corollary 4.17 and Remark 4.1]) and (1) hence yield that for every x R d it holds that u(t, x) = E [ u(, X x T ) ] = E [ ϕ(x x T ) ]. (3) 4

5 2.3 Formulation as minimization problem In the next step we exploit (3) to formulate a minimization problem which is uniquely solved by the function [a, b] d x u(t, x) R (cf. (1) above). For this we first recall the L 2 -minimization property of the expectation of a real-valued random variable (see Lemma 2.1 below). Then we extend this minimization result to certain random fields (see Proposition 2.2 below). Thereafter, we apply Proposition 2.2 to random fields in the context of the Feynman-Kac representation (3) to obtain Proposition 2.7 below. Proposition 2.7 provides a minimization problem (see, for instance, (93) below) which has the function [a, b] d x u(t, x) R as the unique global minimizer. Our proof of Proposition 2.7 is based on the elementary auxiliary results in Lemmas For completeness we also present the proofs of Lemmas here. The statement and the proof of Lemma 2.3 are based on the proof of Da Prato & Zabczyk [13, Lemma 1.1]. Lemma 2.1. Let (Ω, F,P) be a probability space and let X : Ω R be an F/B(R)- measurable random variable which satisfies E[ X 2 ] <. Then (i) it holds for every y R that E [ X y 2] = E [ X E[X] 2] + E[X] y 2, (4) (ii) it holds that there exists a unique real number z R such that and E [ X z 2] = inf y R E[ X y 2], (5) (iii) it holds that E [ X E[X] 2] = inf y R E[ X y 2]. (6) Proof of Lemma 2.1. Observe that the fact that E[ X ] < ensures that for every y R it holds that E [ X y 2] = E [ X E[X] + E[X] y 2] = E [ X E[X] 2 + 2(X E[X])(E[X] y) + E[X] y 2] = E [ X E[X] 2] + 2(E[X] y)e [ X E[X] ] + E[X] y 2 = E [ X E[X] 2] + E[X] y 2. This establishes item (i). Item (ii) and item (iii) are immediate consequences of item (i). The proof of Lemma 2.1 is thus completed. (7) 5

6 Proposition 2.2. Let a R, b (a, ), let (Ω, F,P) be a probability space, let X = (X x ) x [a,b] d : [a, b] d Ω R be a (B([a, b] d ) F)/B(R)-measurable function, assume for every x [a, b] d that E[ X x 2 ] <, and assume that the function [a, b] d x E[X x ] R is continuous. Then (i) it holds that there exists a unique continuous function u: [a, b] d R such that E [ X x u(x) 2] ( dx = inf E [ X x v(x) 2] ) dx [a,b] d v C([a,b] d,r) [a,b] d and (8) (ii) it holds for every x [a, b] d that u(x) = E[X x ]. Proof of Proposition 2.2. Observe that item (i) in Lemma 2.1 and the hypothesis that x [a, b] d : E[ X x 2 ] < ensure that for every function u: [a, b] d R and every x [a, b] d it holds that E [ X x u(x) 2] = E [ X x E[X x ] 2] + E[X x ] u(x) 2. (9) Fubini s theorem (see, e.g., Klenke [39, Theorem 14.16]) hence proves that for every continuous function u: [a, b] d R it holds that E [ X x u(x) 2] dx = E [ X x E[X x ] 2] dx + E[X x ] u(x) 2 dx. (1) [a,b] d [a,b] d [a,b] d The hypothesis that the function [a, b] d x E[X x ] R is continuous therefore demonstrates that E [ X x E[X x ] 2] dx [a,b] d ( inf E [ X x v(x) 2] ) dx v C([a,b] d,r) [a,b] ( d = inf E [ X x E[X x ] 2] ) dx + E[X x ] v(x) 2 dx v C([a,b] d,r) [a,b] d [a,b] ( d inf E [ X x E[X x ] 2] ) dx v C([a,b] d,r) [a,b] d = E [ X x E[X x ] 2] dx. [a,b] d Hence, we obtain that [a,b] d E [ X x E[X x ] 2] dx = (11) ( inf E [ X x v(x) 2] ) dx. (12) v C([a,b] d,r) [a,b] d 6

7 Again the fact that the function [a, b] d x E[X x ] R is continuous therefore proves that there exists a continuous function u: [a, b] d R such that E [ X x u(x) 2] ( dx = inf E [ X x v(x) 2] ) dx. (13) [a,b] d v C([a,b] d,r) [a,b] d Next observe that (1) and (12) yield that for every continuous function u: [a, b] d R with E [ X x u(x) 2] ( dx = inf E [ X x v(x) 2] ) dx (14) [a,b] d v C([a,b] d,r) [a,b] d it holds that E [ X x E[X x ] 2] dx [a,b] d ( = inf E [ X x v(x) 2] ) dx = E [ X x u(x) 2] dx v C([a,b] d,r) [a,b] d [a,b] d ] = E [ X x E[X x ] 2] dx + E[X x ] u(x) 2 dx. [a,b] d [a,b] d (15) Hence, we obtain that for every continuous function u: [a, b] d R with E [ X x u(x) 2] ( dx = inf E [ X x v(x) 2] ) dx [a,b] d v C([a,b] d,r) [a,b] d (16) it holds that [a,b] d E[X x ] u(x) 2 dx =. (17) This and again the hypothesis that the function [a, b] d x E[X x ] R is continuous yield that for every continuous function u: [a, b] d R with E [ X x u(x) 2] ( dx = inf E [ X x v(x) 2] ) dx (18) [a,b] d v C([a,b] d,r) [a,b] d and every x [a, b] d it holds that u(x) = E[X x ]. Combining this with (13) completes the proof of Proposition 2.2. Lemma 2.3 (Projections in metric spaces). Let (E, d) be a metric space, let n N, e 1, e 2,..., e n E, and let P : E E be the function which satisfies for every x E that Then P (x) = e min{k {1,2,...,n}: d(x,ek )=min{d(x,e 1 ),d(x,e 2 ),...,d(x,e n)}}. (19) 7

8 (i) it holds for every x E that and d(x, P (x)) = min d(x, e k) (2) k {1,2,...,n} (ii) it holds for every A E that P 1 (A) B(E). Proof of Lemma 2.3. Throughout this proof let D = (D 1,..., D n ): E R n be the function which satisfies for every x E that D(x) = (D 1 (x), D 2 (x),..., D n (x)) = (d(x, e 1 ), d(x, e 2 ),..., d(x, e n )). (21) Note that (19) ensures that for every x E it holds that d(x, P (x)) = d(x, e min{k {1,2,...,n}: d(x,ek )=min{d(x,e 1 ),d(x,e 2 ),...,d(x,e n)}}) = min d(x, e k). k {1,2,...,n} (22) This establishes item (i). It thus remains to prove item (ii). For this observe that the fact that the function d: E E [, ) is continuous ensures that the function D : E R n is continuous. Hence, we obtain that the function D : E R n is B(E)/B(R n )-measurable. Next note that item (i) demonstrates that for every k {1, 2,..., n}, x P 1 ({e k }) it holds that d(x, e k ) = d(x, P (x)) = min d(x, e l ). (23) l {1,2,...,n} Hence, we obtain that for every k {1, 2,..., n}, x P 1 ({e k }) it holds that k min{l {1, 2,..., n}: d(x, e l ) = min{d(x, e 1 ), d(x, e 2 ),..., d(x, e n )}}. (24) Moreover, note that (19) ensures that for every k {1, 2,..., n}, x P 1 ({e k }) it holds that { } min l {1, 2,..., n}: d(x, e l ) = min d(x, e u) u {1,2,...,n} { (25) } { } l {1, 2,..., n}: e l = e k k, k + 1,..., n. Therefore, we obtain that for every k {1, 2,..., n}, x P 1 ({e k }) with e k / ( l N [,k) {e l }) it holds that { } min l {1, 2,..., n}: d(x, e l ) = min d(x, e u) k. (26) u {1,2,...,n} Combining this with (24) yields that for every k {1, 2,..., n}, x P 1 ({e k }) with e k / ( l N [,k) {e l }) it holds that { } min l {1, 2,..., n}: d(x, e l ) = min u) u {1,2,...,n} = k. (27) 8

9 Hence, we obtain that for every k {1, 2,..., n} with e k / ( l N [,k) {e l }) it holds that { { } } P 1 ({e k }) x E : min l {1, 2,..., n}: d(x, e l ) = min d(x, e u) = k. (28) u {1,2,...,n} This and (19) show that for every k {1, 2,..., n} with e k / ( l N [,k) {e l }) it holds that { { } } P 1 ({e k }) = x E : min l {1, 2,..., n}: d(x, e l ) = min d(x, e u) = k. (29) u {1,2,...,n} Combining (21) with the fact that the function D : E R n is B(E)/B(R n )-measurable therefore demonstrates that for every k {1, 2,..., n} with e k / ( l N [,k) {e l }) it holds that P 1 ({e k }) { { } } = x E : min l {1, 2,..., n}: d(x, e l ) = min d(x, e u) = k u {1,2,...,n} { { } } = x E : min l {1, 2,..., n}: D l (x) = min D u(x) = k u {1,2,...,n} { ( )} l N [, k): Dk (x) < D = x E : l (x) and l {1, 2,..., n}: D k (x) D l (x) k 1 n = {x E : D k (x) < D l (x)} {x E : D }{{} k (x) D l (x)} B(E). }{{} l=1 l=1 B(E) B(E) Hence, we obtain that for every f {e 1, e 2,..., e n } it holds that Therefore, we obtain that for every A E it holds that (3) P 1 ({f}) B(E). (31) P 1 (A) = P 1 (A {e 1, e 2,..., e n }) = f A {e1,e 2,...,e n} P 1 ({f}) B(E). (32) }{{} B(E) This establishes item (ii). The proof of Lemma 2.3 is thus completed. Lemma 2.4. Let (E, d) be a separable metric space, let (E, δ) be a metric space, let (Ω, F) be a measurable space, let X : E Ω E be a function, assume for every e E that the function Ω ω X(e, ω) E is F/B(E)-measurable, and assume for every ω Ω that the function E e X(e, ω) E is continuous. Then it holds that the function X : E Ω E is (B(E) F)/B(E)-measurable. 9

10 Proof of Lemma 2.4. Throughout this proof let (e m ) m N E be a sequence which satisfies that {e m : m N} = E, let P n : E E, n N, be the functions which satisfy for every n N, x E that P n (x) = e min{k {1,2,...,n}: d(x,ek )=min{d(x,e 1 ),d(x,e 2 ),...,d(x,e n)}}, (33) and let X n : E Ω E, n N, be the functions which satisfy for every n N, x E, ω Ω that X n (x, ω) = X(P n (x), ω). (34) Note that (34) shows that for all n N, B B(E) it holds that (X n ) 1 (B) = {(x, ω) E Ω: X n (x, ω) B} = ( [(Xn ) 1 (B) ] [ (P n ) 1 ({y}) Ω ] ) y Im(P n) = y Im(P n) = y Im(P n) { [ ]} (x, ω) E Ω: X n (x, ω) B and x (P n ) 1 ({y}) { [ ]} (x, ω) E Ω: X(P n (x), ω) B and x (P n ) 1 ({y}). (35) Item (ii) in Lemma 2.3 hence implies that for all n N, B B(E) it holds that (X n ) 1 (B) = { [ ]} (x, ω) E Ω: X(y, ω) B and x (P n ) 1 ({y}) y Im(P n) = y Im(P n) = y Im(P n) ( {(x, ω) E Ω: X(y, ω) B} [ (P n ) 1 ({y}) Ω ] ) ( [ E ( (X(y, )) 1 (B) ) } {{ } (B(E) F) ] [ ] ) (Pn ) 1 ({y}) Ω (B(E) F). }{{} (B(E) F) (36) This proves that for every n N it holds that the function X n is (B(E) F)/B(E)- measurable. In addition, note that item (i) in Lemma 2.3 and the hypothesis that for every ω Ω it holds that the function E x X(x, ω) E is continuous imply that for every x E, ω Ω it holds that lim X n(x, ω) = lim X(P n (x), ω) = X(x, ω). (37) n n Combining this with the fact that for every n N it holds that the function X n : E Ω E is (B(E) F)/B(E)-measurable shows that the function X : E Ω E is (B(E) F)/B(E)- measurable. The proof of Lemma 2.4 is thus completed. 1

11 Lemma 2.5. Let (Ω, F,P) be a probability space, let (E, d) and (E, δ) be separable metric spaces, let X n : Ω E, n N, be random variables which satisfy for every ε (, ) that lim supp(d(x n, X ) ε) =, (38) n and let Φ: E E be a continuous function. Then it holds for every ε (, ) that lim supp(δ(φ(x n ), Φ(X )) ε) =. (39) n Proof of Lemma 2.5. Note that (38), e.g., Cox et al. [12, Lemma 2.4], and, e.g., Hutzenthaler et al. [35, Lemma 4.2] establish (39). The proof of Lemma 2.5 is thus completed. Lemma 2.6. Let d, m N, T (, ), L, a R, b (a, ), let µ: R d R d and σ : R d R d m be functions which satisfy for every x, y R d that max{ µ(x) µ(y) R d, σ(x) σ(y) HS(R m,r )} L x y d R d, let Φ: C([, T ], R d ) R be an at most polynomially growing continuous function, let (Ω, F,P) be a probability space with a normal filtration (F t ) t [,T ], let ξ : Ω [a, b] d be a continuous uniformly distributed F /B([a, b] d )-measurable random variable, let W : [, T ] Ω R m be a standard (F t ) t [,T ] -Brownian motion, for every x [a, b] d let X x = (Xt x ) t [,T ] : [, T ] Ω R d be an (F t ) t [,T ] -adapted stochastic process with continuous sample paths which satisfies that for every t [, T ] it holds P-a.s. that X x t = x + t µ(x x s ) ds + t σ(x x s ) dw s, (4) and let X: [, T ] Ω R d be an (F t ) t [,T ] -adapted stochastic process with continuous sample paths which satisfies that for every t [, T ] it holds P-a.s. that Then X t = ξ + t µ(x s ) ds + t σ(x s ) dw s. (41) (i) it holds for every x [a, b] d that the functions Ω ω Φ((X x t (ω)) t [,T ] ) R and Ω ω Φ((X t (ω)) t [,T ] ) R are F/B(R)-measurable, (ii) it holds for every p [2, ), x, y [a, b] d that ( [ p ) 1/p E sup Xt x X y t R d] ( 2 exp L 2 T [ p + T ] ) 2 x y R d, (42) t [,T ] (iii) it holds for every x [a, b] d that E [ Φ((X x t ) t [,T ] ) + Φ((X t ) t [,T ] ) ] <, (iv) it holds that the function [a, b] d x E[Φ((X x t ) t [,T ] )] R is continuous, and 11

12 (v) it holds that E [ Φ((X t ) t [,T ] ) ] = 1 E (b a) ( [a,b] [ Φ((X d t x ) t [,T ] ) ] ) dx. (43) d Proof of Lemma 2.6. Throughout this proof let c [1, ) be a real number which satisfies for every w C([, T ], R m ) that [ Φ(w) c c 1 + sup w t R d], (44) t [,T ] let p t : C([, T ], R m ) R m, t [, T ], be the functions which satisfy for every t [, T ], w = (w s ) s [,T ] C([, T ], R m ) that p t (w) = w t, and let Ψ N x,w : [, T ] R d, N N, x R d, w C([, T ], R m ), be the functions which satisfy for every w C([, T ], R m ), x R d, N N, n {, 1,..., N 1}, t [ nt, (n+1)t ] that N N ΨN x,w() = x and Ψ N x,w(t) = Ψ N x,w( nt N ) ( [ + nt n) µ ( Ψ N nt )) T T x,w( + σ( ( Ψ N nt )) ( N N x,w N w (n+1)t N ) ] w nt. (45) N Observe that the fact that the Borel sigma-algebra B(C([, T ], R m )) is generated by the set t [,T ] A B(R m ) {(p t ) 1 (A)} (cf., for example, Klenke [39, Theorem 21.31]), the hypothesis that for every x [a, b] d it holds that X x : [, T ] Ω R d is an (F t ) t [,T ] -adapted stochastic process with continuous sample paths, and the hypothesis that X: [, T ] Ω R d is an (F t ) t [,T ] -adapted stochastic process with continuous sample paths demonstrate that the functions Ω ω (X x t (ω)) t [,T ] C([, T ], R d ) (46) and Ω ω (X t (ω)) t [,T ] C([, T ], R d ) (47) are F/B(C([, T ], R d ))-measurable. Combining this with the fact that the function Φ: C([, T ], R d ) R is B(C([, T ], R d ))/B(R)-measurable implies that for every x [a, b] d it holds that the functions Ω ω Φ((Xt x (ω)) t [,T ] ) R and Ω ω Φ((X t (ω)) t [,T ] ) R are F/B(R)-measurable. This proves item (i). Next observe that (45), the hypothesis that µ: R d R d and σ : R d R d m are globally Lipschitz continuous, and the fact that for every p (, ) it holds that E[ ξ p ] < ensure that for every p (, ) it holds R d that sup sup N N x [a,b] d = sup sup N N x [a,b] d ( [ E sup ( E t [,T ] [ ] [ ]) Ψ N x,w (t) p + E sup Ψ N R d ξ,w (t) p R d t [,T ] [ max n {,1,...,N} Ψ N x,w ( nt N ) p R d ] + E 12 max n {,1,...,N} Ψ N ξ,w ( nt N ) p R d ]) < (48)

13 (cf., for example, Kloeden & Platen [41, Section 1.6]). Next note that (4), (41), (45), the fact that µ: R d R d and σ : R d R d m are locally Lipschitz continuous functions, and e.g., Hutzenthaler & Jentzen [32, Theorem 3.3] ensure that for every x [a, b] d, ε (, ) it holds that ( ) and lim supp N sup Xt x Ψ N x,w (t) R d ε t [,T ] = (49) ( ) lim supp N sup t [,T ] X t Ψ N ξ,w (t) R d ε =. (5) Combining (46), (47), and, e.g., Hutzenthaler & Jentzen [32, Lemma 3.1] hence demonstrates that for every x [a, b] d, p (, ) it holds that [ ] [ ] E sup Xt x p lim inf E sup Ψ N x,w (t) p t [,T ] N t [,T ] [ ] (51) sup E sup Ψ N x,w (t) p N N t [,T ] and [ ] E sup X t p lim inf t [,T ] N E [ sup E N N [ ] sup Ψ N ξ,w (t) p t [,T ] ] sup Ψ N ξ,w (t) p. t [,T ] (52) This and (48) assure that for every p (, ) it holds that ( [ ] [ ]) E sup Xt x p + E X R d t p <. (53) R d t [,T ] sup x [a,b] d sup t [,T ] Combining (44) and the fact that r (, ), a, b R: a + b r 2 r ( a r + b r ) therefore demonstrates that for every p (, ) it holds that ( [ sup E Φ((X x t ) t [,T ] ) p] + E [ Φ((X t ) t [,T ] ) p]) x [a,b] d ( sup cp E [ 1 + sup t [,T ] Xt x R d cp] + c p E [ 1 + sup t [,T ] X t R d cp]) x [a,b] d ( ) 2 cp c p <. sup E [ ] 2 + sup t [,T ] Xt x cp + sup R d t [,T ] X t cp R d x [a,b] d 13 (54)

14 This establishes item (iii). In the next step we observe that (4) ensures that for every x, y [a, b] d, t [, T ] it holds P-a.s. that X x t X y t = x y + t (µ(x x s ) µ(x y s )) ds + t (σ(x x s ) σ(x y s )) dw s. (55) The triangle inequality hence ensures that for every x, y [a, b] d, t [, T ] it holds P- a.s. that sup Xs x Xs y R d s [,t] x y R d + sup x y R d + L = x y R d + L s s [,t] [ s sup s [,t] t µ(x x r ) µ(x y r ) R d dr + sup X x r X y r R d dr X x r X y r R d dr + sup ] s [,t] + sup t s [,t] s s [,t] s (σ(x x r ) σ(x y r )) dw r R d (σ(x x r ) σ(x y r )) dw r R d R (σ(xr x ) σ(xr y )) dw r. d Therefore, we obtain for every p [1, ), x, y [a, b] d, t [, T ] that ( [ t ( [ ]) E E X x r Xr y p 1/p dr R d ]) 1/p sup Xs x Xs y p x y R d R d + L s [,t] + ( [ E sup s [,t] s (σ(x x r ) σ(x y r )) dw r p R d ]) 1/p. The Burkholder-Davis-Gundy type inequality in Da Prato & Zabczyk [13, Lemma 7.2] hence shows that for every p [2, ), x, y [a, b] d, t [, T ] it holds that ( [ ]) 1/p t ( [ ]) E sup Xs x Xs y p x y R d R d + L E X x r Xr y p 1/p dr R d s [,t] [ t ( [ ] ) (58) ]1/2 + p E σ(x x r ) σ(xr y ) p 2/p HS(R d,r m ) dr. This demonstrates that for every p [2, ), x, y [a, b] d, t [, T ] it holds that ( [ ]) 1/p t ( [ ]) E sup Xs x Xs y p x y R d R d + L E X x r Xr y p 1/p dr R d s [,t] [ t ]1/2 ( [ ]) + Lp E X x r Xr y p 2/p dr. R d 14 (56) (57) (59)

15 Hölder s inequality hence proves that for every p [2, ), x, y [a, b] d, t [, T ] it holds that ( [ ]) 1/p E sup Xs x Xs y p x y R d R d + L [ t ]1/2 ( [ ]) t E X x r Xr y p 2/p dr R d s [,t] [ t ]1/2 ( [ ]) + Lp E X x r Xr y p 2/p dr R d x y R d + L [ p + T ] [ t ( [ ]) E X x r Xr y p 2/p dr R d The fact that v, w R: v + w 2 2v 2 + 2w 2 therefore shows that for every p [2, ), x, y [a, b] d, t [, T ] it holds that ( [ ]) 2/p E sup Xs x Xs y p R d s [,t] 2 x y 2 R d + 2L 2[ p + T ] 2 2 x y 2 R d + 2L 2[ p + T ] 2 t t ( E [ X x r X y r p R d ]) 2/p dr ( [ ]) 2/p E sup Xs x Xs y p dr. R d s [,r] Combining the Gronwall inequality (cf., e.g., Andersson et al. [2, Lemma 2.6] (with α =, β =, a = 2 x y 2, b = 3L 2[ p+ T ], e = ([, T ] t ( E [ ]) sup R d s [,t] Xs x Xs y p 2/p R d [, ]) in the notation of Lemma 2.6)) and (53) hence establishes that for every p [2, ), x, y [a, b] d, t [, T ] it holds that ( E [ ]) 2/p ( sup Xs x Xs y p 2 x y 2 R d R exp d s [,t] ]1/2. (6) (61) 2L 2 t [ p + T ] 2 ). (62) Therefore, we obtain that for every p [2, ), x, y [a, b] d it holds that ( [ ]) 1/p E sup Xt x X y t p ( 2 exp L 2 T [ p + T ] ) 2 x y R d. (63) t [,T ] This establishes item (ii). Next observe that item (ii) and Jensen s inequality imply that for every p (, ) it holds that (( [ ]) E supt [,T ] Xt x X y t p 1/p ) R d sup x,y [a,b] d,x y x y R d 15 <. (64)

16 The hypothesis that the function Φ: C([, T ], R d ) R is continuous and Lemma 2.5 hence ensure that for every ε (, ), (x n ) n N [a, b] d with lim sup n x x n R d = it holds that lim sup n P ( Φ((X x t ) t [,T ] ) Φ((Xt xn ) t [,T ] ) ε ) =. (65) Combining (54) with, e.g., Hutzenthaler et al. [35, Proposition 4.5] therefore implies that for every (x n ) n N [a, b] d with lim sup n x x n R d = it holds that lim sup N lim sup n E [ Φ((X x t ) t [,T ] ) Φ((Xt xn ) t [,T ] ) ] =. (66) This establishes item (iv). In the next step we observe that (49) and (5) ensure that for every ε (, ), x [a, b] d it holds that [ ( ) P sup Ψ N x,w (t) Xt x R d ε t [,T ] ( )] + P sup t [,T ] Ψ N ξ,w (t) X t R d ε =. (67) The hypothesis that the function Φ is continuous and Lemma 2.5 therefore demonstrate that for every ε (, ), x [a, b] d it holds that [ ( lim sup P Φ(Ψ N x,w ) Φ((Xt x ) t [,T ] ) + Φ(Ψ N ξ,w ) Φ((X t ) t [,T ] ) ε )] =. (68) N Next observe that (44) assures that for every N N, x [a, b] d, p (, ) it holds that E [ Φ(Ψ N x,w ) Φ((X x t ) t [,T ] ) p + Φ(Ψ N ξ,w ) Φ((X t ) t [,T ] ) p] 2 p E [ Φ(Ψ N x,w ) p + Φ((X x t ) t [,T ] ) p] + 2 p E [ Φ(Ψ N ξ,w ) p + Φ((X t ) t [,T ] ) p] 2 p c p E [ 1 + supt [,T ] Ψ N x,w (t) R cp d supt [,T ] Xt x R ] cp d + 2 p c p E [ 1 + supt [,T ] Ψ N ξ,w (t) R d cp supt [,T ] X t R d 4 p c p E [ 2 + sup t [,T ] Ψ N x,w (t) cp R d + sup t [,T ] X x t cp R d ] + 4 p c p E [ 2 + sup t [,T ] Ψ N ξ,w (t) cp R d + sup t [,T ] X t cp R d ]. ] cp Combining (48) and (53) hence shows that for every p (, ) it holds that ( [ sup sup E Φ(Ψ N x,w ) Φ((Xt x ) t [,T ] ) p + Φ(Ψ N ξ,w ) Φ((X t ) t [,T ] ) p]) <. (7) N N x [a,b] d This, (68), and, e.g., Hutzenthaler et al. [35, Proposition 4.5] imply that for every x [a, b] d it holds that lim sup N (69) ( E [ Φ(Ψ N x,w ) Φ((X x t ) t [,T ] ) ] + E [ Φ(Ψ N ξ,w ) Φ((X t ) t [,T ] ) ]) =. (71) 16

17 Combining (7) with Lebesgue s dominated convergence theorem therefore demonstrates that ( lim sup E [ Φ(Ψ N x,w ) Φ((Xt x ) t [,T ] ) ] ) dx =. (72) N [a,b] d In addition, observe that (69), (48), and (53) prove that for all p (, ) it holds that sup sup E [ Φ(Ψ N x,w ) p] <. (73) N N x [a,b] d Next observe that (71) and the fact that ξ and W are independent imply that E [ Φ((X t ) t [,T ] ) ] = lim E[ Φ ( ) ] Ψ N ξ,w N [ ) ] = lim Φ (Ψ Nξ(ω),(Wt(ω))t [,T N ] P(dω) Ω [ = lim Φ ( ) ( Ψ ) ] N x,w (ξ, W )(P) (dx, dw) N [a,b] d C([,T ],R d ) [ = lim Φ ( ( Ψx,w) N (ξ(p)) (W (P)) ) ] (dx, dw). N [a,b] d C([,T ],R d ) (74) Combining Fubini s theorem, (72), and (73) with Lebesgue s dominated convergence theorem therefore assures that E [ Φ((X t ) t [,T ] ) ] [ ( = lim Φ ( ) ] Ψx,w) N (W (P))(dw) (ξ(p))(dx) N [a,b] d C([,T ],R d ) [ ( = lim Φ ( ) ] ) Ψ N x,w P(dw) (ξ(p))(dx) N [a,b] d Ω [ = lim E [ Φ ( ) ] ] Ψ N x,w (ξ(p))(dx) N [a,b] d = lim E[ Φ ( ) ] (75) Ψ N x,w (ξ(p))(dx) [a,b] d N = E [ Φ((Xt x ) t [,T ] ) ] (ξ(p))(dx) [a,b] d = 1 (b a) d ( E [ Φ((Xt x ) t [,T ] ) ] ) dx. [a,b] d This establishes item (v). The proof of Lemma 2.6 is thus completed. Proposition 2.7. Let d, m N, T (, ), a R, b (a, ), let µ: R d R d and σ : R d R d m be globally Lipschitz continuous functions, let ϕ: R d R be a function, let u = (u(t, x)) (t,x) [,T ] R d C 1,2 ([, T ] R d, R) be a function with at most polynomially 17

18 growing partial derivatives which satisfies for every t [, T ], x R d that u(, x) = ϕ(x) and u t (t, x) = 1 2 Trace R d ( σ(x)[σ(x)] (Hess x u)(t, x) ) + µ(x), ( x u)(t, x) R d, (76) let (Ω, F,P) be a probability space with a normal filtration (F t ) t [,T ], let W : [, T ] Ω R m be a standard (F t ) t [,T ] -Brownian motion, let ξ : Ω [a, b] d be a continuous uniformly distributed F /B([a, b] d )-measurable random variable, and let X = (X t ) t [,T ] : [, T ] Ω R d be an (F t ) t [,T ] -adapted stochastic process with continuous sample paths which satisfies that for every t [, T ] it holds P-a.s. that Then X t = ξ + t µ(x s ) ds + t σ(x s ) dw s. (77) (i) it holds that the function ϕ: R d R is twice continuously differentiable with at most polynomially growing derivatives, (ii) it holds that there exists a unique continuous function U : [a, b] d R such that and E [ ϕ(x T ) U(ξ) 2] = (iii) it holds for every x [a, b] d that U(x) = u(t, x). inf E [ ϕ(x T ) v(ξ) 2], (78) v C([a,b] d,r) Proof of Proposition 2.7. Throughout this proof let X x = (X x t ) t [,T ] : [, T ] Ω R d, x [a, b] d, be (F t ) t [,T ] -adapted stochastic processes with continuous sample paths a) which satisfy that for every t [, T ], x [a, b] d it holds P-a.s. that and X x t = x + t µ(x x s ) ds + t σ(x x s ) dw s (79) b) which satisfy that for every ω Ω it holds that the function [a, b] d x XT x (ω) Rd is continuous (cf., for example, Cox et al. [11, Theorem 3.5] and item (ii) in Lemma 2.6). Note that the assumption that x R d : u(, x) = ϕ(x) and the assumption that u C 1,2 ([, T ] R d, R) has at most polynomially growing partial derivatives establish item (i). Next note that item (i) and the fact that for every p (, ), x [a, b] d it holds that sup E[ Xt x p ] < (8) R d t [,T ] 18

19 assure that for every x [a, b] d it holds that E [ ϕ(x x T ) 2 ] <. (81) Item (i), the assumption that for every ω Ω it holds that the function [a, b] d x X x T (ω) Rd is continuous, and, e.g., Hutzenthaler et al. [35, Proposition 4.5] hence ensure that the function [a, b] d x E[ϕ(X x T )] R (82) is continuous. In the next step we combine the fact that for every x [a, b] d it holds that the function Ω ω ϕ(xt x (ω)) R is F/B(R)-measurable, the fact that for every ω Ω it holds that the function [a, b] d x ϕ(xt x (ω)) R is continuous, and Lemma 2.4 to obtain that the function [a, b] d Ω (x, ω) ϕ(x x T (ω)) R (83) is (B([a, b] d ) F)/B(R)-measurable. Combining this, (81), (82), and Proposition 2.2 demonstrates A) that there exists a unique continuous function U : [a, b] d R which satisfies that E [ ϕ(xt x ) U(x) 2] ( dx = inf E [ ϕ(x [a,b] d v C([a,b] d T x ) v(x) 2] ) dx,r) [a,b] d and (84) B) that it holds for every x [a, b] d that U(x) = E[ϕ(X x T )]. (85) Next note that for every continuous function V : [a, b] d R it holds that sup V (x) <. (86) x [a,b] d Item (i) hence implies that for every continuous function V : [a, b] d R it holds that C([, T ], R d ) (z t ) t [,T ] ϕ(z T ) V (z ) 2 R (87) is an at most polynomially growing continuous function. Combining Lemma 2.6 (with Φ = (C([, T ], R d ) (z t ) t [,T ] ϕ(z T ) V (z ) 2 R) for V C([a, b] d, R d ) in the notation of Lemma 2.6), (77), item (i), and (79) hence ensures that for every continuous function V : [a, b] d R it holds that E [ ϕ(x T ) V (ξ) 2] [ 1 = E [ ϕ(x (b a) d T x ) V (x) 2] ] dx. (88) [a,b] d 19

20 Hence, we obtain that for every continuous function V : [a, b] d R with E[ ϕ(x T ) V (ξ) 2 ] = inf v C([a,b] d,r) E[ ϕ(x T ) v(ξ) 2 ] it holds that E [ ϕ(xt x ) V (x) 2] dx [a,b] d ( [ = (b a) d 1 E [ ϕ(x (b a) d T x ) V (x) 2] ]) dx [a,b] d = (b a) ( d E [ ϕ(x T ) V (ξ) 2]) ( = (b a) d inf E [ ϕ(x T ) v(ξ) 2]) (89) v C([a,b] d,r) ( ( [ = (b a) d 1 inf E [ ϕ(x v C([a,b] d,r) (b a) d T x ) v(x) 2] ])) dx [a,b] [ d = inf E [ ϕ(x v C([a,b] d T x ) v(x) 2] ] dx.,r) [a,b] d Combining this with (84) proves that for every continuous function V : [a, b] d R with E[ ϕ(x T ) V (ξ) 2 ] = inf v C([a,b] d,r) E[ ϕ(x T ) v(ξ) 2 ] it holds that U = V. (9) Next observe that (88) and (84) demonstrate that E [ ϕ(x T ) U(ξ) 2] ( 1 = E [ ϕ(x (b a) d T x ) U(x) 2] ) dx [a,b] [ d ( 1 = inf E [ ϕ(x v C([a,b] d,r) (b a) d T x ) v(x) 2] ) ] dx [a,b] d = inf E [ ϕ(x T ) v(ξ) 2]. v C([a,b] d,r) (91) Combining this with (9) proves item (ii). Next note that (76), (85), and the Feynman- Kac formula (cf., for example, Hairer et al. [22, Corollary 4.17]) imply that for every x [a, b] d it holds that U(x) = E[ϕ(XT x )] = u(t, x). This establishes item (iii). The proof of Proposition 2.7 is thus completed. In the next step we use Proposition 2.7 to obtain a minimization problem which is uniquely solved by the function [a, b] d x u(t, x) R. More specifically, let ξ : Ω [a, b] d be a continuously uniformly distributed F /B([a, b] d )-measurable random variable, and let X: [, T ] Ω R d be an (F t ) t [,T ] -adapted stochastic process with continuous sample paths which satisfies that for every t [, T ] it holds P-a.s. that X t = ξ + t µ(x s ) ds + 2 t σ(x s ) dw s. (92)

21 Proposition 2.7 then guarantees that the function [a, b] d x u(t, x) R is the unique global minimizer of the function C([a, b] d, R) v E [ ϕ(x T ) v(ξ) 2] R. (93) In the following two subsections we derive an approximated minimization problem by discretizing the stochastic process X: [, T ] Ω R d (see Subsection 2.4 below) and by employing a deep neural network approximation for the function R d x u(t, x) R (see Subsection 2.5 below). 2.4 Discretization of the stochastic differential equation In this subsection we use the Euler-Maruyama scheme (cf., for example, Kloeden & Platen [41] and Maruyama [44]) to temporally discretize the solution process X of the SDE (92). More specifically, let N N, let t, t 1,..., t N [, ) be real numbers which satisfy that = t < t 1 <... < t N = T. (94) Note that (92) implies that for every n {, 1,..., N 1} it holds P-a.s. that X tn+1 = X tn + tn+1 t n µ(x s ) ds + tn+1 t n σ(x s ) dw s. (95) This suggests that for sufficiently small mesh size sup n {,1,...,N 1} (t n+1 t n ) it holds that X tn+1 X tn + µ(x tn ) (t n+1 t n ) + σ(x tn ) (W tn+1 W tn ). (96) Let X : {, 1,..., N} Ω R d be the stochastic process which satisfies for every n {, 1,..., N 1} that X = ξ and X n+1 = X n + µ(x n ) (t n+1 t n ) + σ(x n ) (W tn+1 W tn ). (97) Observe that (96) and (97) suggest, in turn, that for every n {, 1, 2,..., N} it holds that X n X tn (98) (cf., for example, Theorem 2.8 below for a strong convergence result for the Euler-Maruyama scheme). Theorem 2.8 (Strong convergence rate for the Euler-Maruyama scheme). Let T (, ), d N, p [2, ), let (Ω, F,P) be a probability space with a normal filtration (F t ) t [,T ], let W : [, T ] Ω R d be a standard (F t ) t [,T ] -Brownian motion, let ξ : Ω R d be a random variable which satisfies that E[ ξ p R d ] <, let µ: R d R d and σ : R d R d d be Lipschitz 21

22 continuous functions, let X: [, T ] Ω R d be an (F t ) t [,T ] -adapted stochastic process with continuous sample paths which satisfies that for every t [, T ] it holds P-a.s. that X t = ξ + t µ(x s ) ds + t σ(x s ) dw s, (99) for every N N let t N, t N 1,..., t N N [, T ] be real numbers which satisfy that = t N < t N 1 <... < t N N = T, (1) and for every N N let X N : {, 1,..., N} Ω R d be the stochastic process which satisfies for every n {, 1,..., N 1} that X N = ξ and X N n+1 = X N n + µ ( X N n ) (t N n+1 t N n ) + σ ( X N n )( Wt N n+1 W t N n ). (11) Then there exists a real number C (, ) such that for every N N it holds that ( [ sup E X t N n {,1,...,N} X ] ) [ ] 1/2 N n n p 1/p R C max t d n+1 t n. (12) n {,1,...,N 1} The proof of Theorem 2.4 is well-known in the literature (cf., for instance, Kloeden & Platen [41], Milstein [46], Hofmann, Müller-Gronbach, & Ritter [3], Müller-Gronbach & Ritter [49], and the references mentioned therein). 2.5 Deep artificial neural network approximations In this subsection we employ suitable approximations for the solution R d x u(t, x) R of the PDE (1) at time T. More specifically, let ν N and let U = (U(θ, x)) (θ,x) R ν R : d Rν R d R be a continuous function. For every suitable θ R ν and every x [a, b] d we think of U(θ, x) R as an appropriate approximation U(θ, x) u(t, x) (13) of u(t, x). We suggest to choose the function U: R ν R d R as a deep neural network (cf., for example, Bishop [6]). For instance, let L d : R d R d be the function which satisfies for every x = (x 1, x 2,..., x d ) R d that ( ) exp(x1 ) L d (x) = exp(x 1 ) + 1, exp(x 2 ) exp(x 2 ) + 1,..., exp(x d ) (14) exp(x d ) + 1 (multidimensional version of the standard logistic function), for every k, l N, v N = {} N, θ = (θ 1,..., θ ν ) R ν with v + l(k + 1) ν let A θ,v k,l : Rk R l be the function 22

23 which satisfies for every x = (x 1,..., x k ) R k that θ v+1 θ v+2... θ v+k θ v+k+1 θ v+k+2... θ v+2k θ v+2k+1 θ v+2k+2... θ v+3k A θ,v k,l (x) =.... θ v+(l 1)k+1 θ v+(l 1)k+2... θ v+lk x 1 x 2 x 3. x k + θ v+kl+1 θ v+kl+2 θ v+kl+3. θ v+kl+l, (15) let s {3, 4, 5, 6,...}, assume that (s 1)d(d + 1) + d + 1 ν, and let U: R ν R d R be the function which satisfies for every θ R ν, x R d that U(θ, x) = ( A θ,(s 1)d(d+1) d,1 L d A θ,(s 2)d(d+1) d,d... L d A θ,d(d+1) d,d L d Ad,d) θ, (x). (16) The function U: R ν R d R in (16) describes an artificial neural network with s+1 layers (1 input layer with d neurons, s 1 hidden layers with d neurons each, and 1 output layer with d neurons) and standard logistic functions as activation functions (cf., for instance, Bishop [6]). 2.6 Stochastic gradient descent-type minimization As described in Subsection 2.5 for every suitable θ R ν and every x [a, b] d we think of U(θ, x) R as an appropriate approximation of u(t, x) R. In this subsection we intend to find a suitable θ R ν as an approximate minimizer of the function R ν θ E [ ϕ(x N ) U(θ, ξ) 2] R. (17) To be more specific, we intend to find an approximate minimizer of the function in (17) through a stochastic gradient descent-type minimization algorithm (cf., for instance, Ruder [53, Section 4], Jentzen et al. [37], and the references mentioned therein). For this we approximate the derivative of the function in (17) by means of the Monte Carlo method. More precisely, let ξ (m) : Ω [a, b] d, m N, be independent continuously uniformly distributed F /B([a, b] d )-measurable random variables, let W (m) : [, T ] Ω R d, m N, be independent standard (F t ) t [,T ] -Brownian motions, for every m N let X (m) = (X n (m) ) n {,1,...,N} : {, 1,..., N} Ω R d be the stochastic process which satisfies for every n {, 1,..., N 1} that X (m) = ξ (m) and X (m) n+1 = X (m) n + µ(x (m) n ) (t n+1 t n ) + σ(x (m) ) (W (m) t n+1 W (m) t n ), (18) let γ (, ), and let Θ: N Ω R ν be a stochastic process which satisfies for every m N that Θ m+1 = Θ m 2γ (U(Θ m, ξ (m) ) ϕ(x (m) N )) ( θ U)(Θ m, ξ (m) ). (19) 23 n

24 Under appropriate hypotheses we think for every sufficiently large m N of the random variable Θ m : Ω R ν as a suitable approximation of a local minimum point of the function (17) and we think for every sufficiently large m N of the random function [a, b] d x U(Θ n, x) R as a suitable approximation of the function [a, b] d x u(t, x) R. 2.7 Description of the algorithm in a special case In this subsection we give a description of the proposed approximation method in a special case, that is, we describe the proposed approximation method in the specific case where a particular neural network approximation is chosen and where the plain-vanilla stochastic gradient descent method with a constant learning rate is the employed stochastic minimization algorithm (cf. (19) above). For a more general description of the proposed approximation method we refer the reader to Subsection 2.8 below. Framework 2.9. Let T, γ (, ), a R, b (a, ), d, N N, s {3, 4, 5,...}, let ν = sd(d + 1), let t, t 1,..., t N [, T ] be real numbers with = t < t 1 <... < t N = T, (11) let µ: R d R d and σ : R d R d d be continuous functions, let (Ω, F,P, (F t ) t [,T ] ) be a filtered probability space, let ξ (m) : Ω [a, b] d, m N, be independent continuously uniformly distributed F /B([a, b] d )-measurable random variables, let W (m) : [, T ] Ω R d, m N, be i.i.d. standard (F t ) t [,T ] -Brownian motions, for every m N let X (m) : {, 1,..., N} Ω R d be the stochastic process which satisfies for every n {, 1,..., N 1} that X (m) = ξ (m) and X (m) n+1 = X (m) n + µ(x (m) n ) (t n+1 t n ) + σ(x (m) ) (W (m) t n+1 W (m) t n ), (111) let L d : R d R d be the function which satisfies for every x = (x 1, x 2,..., x d ) R d that ( ) exp(x1 ) L d (x) = exp(x 1 ) + 1, exp(x 2 ) exp(x 2 ) + 1,..., exp(x d ), (112) exp(x d ) + 1 for every k, l N, v N = {} N, θ = (θ 1,..., θ ν ) R ν with v + l(k + 1) ν let A θ,v k,l : Rk R l be the function which satisfies for every x = (x 1,..., x k ) R k that θ v+1 θ v+2... θ v+k x 1 θ v+kl+1 θ v+k+1 θ v+k+2... θ v+2k x 2 θ v+kl+2 θ v+2k+1 θ v+2k+2... θ v+3k x 3 + θ v+kl+3, (113) A θ,v k,l (x) =.... θ v+(l 1)k+1 θ v+(l 1)k+2... θ v+lk n. x k. θ v+kl+l 24

25 let U: R ν R d R be the function which satisfies for every θ R ν, x R d that U(θ, x) = ( A θ,(s 1)d(d+1) d,1 L d A θ,(s 2)d(d+1) d,d... L d A θ,d(d+1) d,d L d Ad,d) θ, (x), (114) and let Θ: N Ω R ν be a stochastic process which satisfies for every m N that Θ m+1 = Θ m 2γ (U(Θ m, ξ (m) ) ϕ(x (m) N )) ( θ U)(Θ m, ξ (m) ) (115) Under appropriate hypotheses we think for every sufficiently large m N and every x [a, b] d of the random variable U(Θ m, x): Ω R in Framework 2.9 as a suitable approximation U(Θ m, x) u(t, x) of u(t, x) R where u = u(t, x) (t,x) [,T ] R d C 1,2 ([, T ] R d, R) is a function with at most polynomially growing partial derivatives which satisfies for every t [, T ], x R d that u(, x) = ϕ(x) and u (t, x) = 1 Trace ( t 2 R σ(x)[σ(x)] (Hess d x u)(t, x) ) + µ(x), ( x u)(t, x) R d (116) (cf. (1) above). 2.8 Description of the algorithm in the general case In this subsection we provide a general framework which covers the approximation method derived in Subsections above and which allows, in addition, to incorporate other minimization algorithms (cf., for example, Kingma & Ba [38], Ruder [53], E et al. [14], Han et al. [24], and Beck et al. [3]) than just the plain vanilla stochastic gradient descent method. The proposed approximation algorithm is an extension of the approximation algorithm in E et al. [14], Han et al. [24], and Beck et al. [3] in the special case of linear Kolmogorov partial differential equations. Framework 2.1. Let T (, ), N, d, ϱ, ν, ς N, let H : [, T ] 2 R d R d R d, ϕ: R d R be functions, let (Ω, F,P, (F t ) t [,T ] ) be a filtered probability space, let W m,j : [, T ] Ω R d, m N, j N, be independent standard (F t ) t [,T ] -Brownian motions on (Ω, F,P), let ξ m,j : Ω R d, m N, j N, be i.i.d. F /B(R d )-measurable random variables, let t, t 1,..., t N [, T ] be real numbers with = t < t 1 <... < t N = T, for every θ R ν, j N, s R ς let U θ,j,s : R d R be a function, for every m N, j N let X m,j = (Xn m,j ) n {,1,...,N} : {, 1,..., N} Ω R d be a stochastic process which satisfies for every n {, 1,..., N 1} that X m,j = ξ m,j and X m,j n+1 = H(t n, t n+1, X m,j n, W m,j t n+1 W m,j t n ), (117) let (J m ) m N N be a sequence, for every m N, s R ς let φ m,s : R ν Ω R be the function which satisfies for every (θ, ω) R ν Ω that φ m,s (θ, ω) = 1 J m J m j=1 [ U θ,j,s ( ξ m,j (ω)) ϕ ( X m,j N (ω))] 2, (118) 25

26 for every m N, s R ς let Ψ m,s : R ν Ω R ν be a function which satisfies for every ω Ω, θ {η R ν : φ m,s (, ω): R ν R is differentiable at η} that Ψ m,s (θ, ω) = ( θ φ m,s )(θ, ω), (119) let S : R ς R ν (R d ) N R ς be a function, for every m N let Φ m : R ϱ R ν and Ψ m : R ϱ R ν R ϱ be functions, let Θ: N Ω R ν, S: N Ω R ς, and Ξ: N Ω R ϱ be stochastic processes which satisfy for every m N that S m+1 = S ( S m, Θ m, (X m,i N ) i N), Ξm+1 = Ψ m (Ξ m, Ψ m,s m+1 (Θ m )), (12) and Θ m+1 = Θ m Φ m (Ξ m+1 ). (121) Under appropriate hypotheses we think for every sufficiently large m N and every x [a, b] d of the random variable U Θm,1,Sm (x): Ω R in Framework 2.1 as a suitable approximation U Θm,1,Sm (x) u(t, x) of u(t, x) R where u = u(t, x) (t,x) [,T ] R d C 1,2 ([, T ] R d, R) is a function with at most polynomially growing partial derivatives which satisfies for every t [, T ], x R d that u(, x) = ϕ(x) and u (t, x) = 1 Trace ( t 2 R σ(x)[σ(x)] (Hess d x u)(t, x) ) + µ(x), ( x u)(t, x) R d, (122) where µ: R d R d and σ : R d R d d are sufficiently regular functions (cf. (1) above). 3 Examples In this section we test the proposed approximation algorithm (see Section 2 above) in the case of several examples of SDEs and Kolmogorov PDEs, respectively. In particular, in this section we apply the proposed approximation algorithm to the heat equation (cf. Subsection 3.2 below), to independent geometric Brownian motions (cf. Subsection 3.3 below), to the Black-Scholes model (cf. Subsection 3.4 below), to stochastic Lorenz equations (cf. Subsection 3.5 below), and to the Heston model (cf. Subsection 3.6 below). In the case of each of the examples below we employ the general approximation algorithm in Framework 2.1 above in conjunction with the Adam optimizer (cf. Kingma & Ba [38]) with mini-batches of size 8192 in each iteration step (see Subsection 3.1 below for a precise description). Moreover, we employ a fully-connected feedforward neural network with one input layer, two hidden layers, and one one-dimensional output layer in our implementations in the case of each of these examples. We also use batch normalization (cf. Ioffe & Szegedy [36]) just before the first linear transformation, just before each of the two nonlinear activation functions in front of the hidden layers as well as just after the last linear transformation. For the two nonlinear activation functions we employ the multidimensional version of the function R x tanh(x) ( 1, 1). All weights in the neural network are initialized by means of the Xavier initialization (cf. Glorot & Bengio [18]). All computations were 26

27 performed in single precision (float32) on a NVIDIA GeForce GTX 18 GPU with 1974 MHz core clock and 8 GB GDDR5X memory with MHz clock rate. The underlying system consisted of an Intel Core i7-68k CPU with 64 GB DDR memory running Tensorflow 1.5 on Ubuntu Setting Framework 3.1. Assume Framework 2.1, let ε = 1 8, β 1 = 9, β 1 1 = 999, (γ 1 m) m N (, ), let Pow r : R ν R ν, r (, ), be the functions which satisfy for every r (, ), x = (x 1,..., x ν ) R ν that Pow r (x) = ( x 1 r,..., x ν r ), (123) assume for every m N, i {, 1,..., N} that J m = 8192, t i = it N, ϱ = 2ν, T = 1, γ m = [,25] (m) (25,5] (m) (5, ) (m), assume for every m N, x = (x 1,..., x ν ), y = (y 1,..., y ν ), η = (η 1,..., η ν ) R ν that Ψ m (x, y, η) = (β 1 x + (1 β 1 )η, β 2 y + (1 β 2 ) Pow 2 (η)) (124) and ( [ ] 1 [ ] ) 1 y ψ m (x, y) = 1 γ m x 1 1 (β 2 + ε ) m 1 (β 1 ),..., y ν γ m x ν m 1 (β 2 + ε. (125) ) m 1 (β 1 ) m Equations (124) and (125) in Framework 3.1 describe the Adam optimizer (cf. Kingma & Ba [38], e.g., E et al. [24, (32) (33) in Section 4.2 and (9) (91) in Section 5.2], and line 84 in Python code 1 in Section 4 below). 3.2 Heat equation In this subsection we apply the proposed approximation algorithm to the heat equation (see (126) below). Assume Framework 2.1, assume for every s, t [, T ], x, w R d, m N that N = 1, d = 1, ν = d(2d) + (2d) 2 + 2d = 2d(3d + 1), ϕ(x) = x 2, H(s, t, x, w) = x + 2 Id R d R d w, assume that ξ,1 : Ω R d is continuous uniformly distributed on [, 1] d, and let u = (u(t, x)) (t,x) [,T ] R d C 1,2 ([, T ] R d, R) be an at most polynomially growing function which satisfies for every t [, T ], x R d that u(, x) = ϕ(x) and ( u t )(t, x) = ( xu)(t, x). (126) Combining, e.g., Lemma 3.2 below with, e.g., Hairer et al. [22, Corollary 4.17 and Remark 4.1] shows that for every t [, T ], x R d it holds that u(t, x) = x 2 R d + t d. (127) 27

Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations

Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations arxiv:1706.04702v1 [math.na] 15 Jun 2017 Weinan E 1, Jiequn