arxiv: v1 [math.pr] 25 Jul 2017

Size: px

Start display at page:

Download "arxiv: v1 [math.pr] 25 Jul 2017"

Alexandrina Hunt
6 years ago
Views:

1 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ arxiv: v1 [mah.pr] 25 Jul 217 Abrac. We analye a verion of he policy ieraion algorihm for he dicouned infiniehorizon problem for conrolled mulidimenional diffuion procee, where boh he drif and he diffuion coefficien can be conrolled. We prove ha, under aumpion on he problem daa, he payoff generaed by he algorihm converge monoonically o he value funcion and an accumulaion poin of he equence of policie i an opimal policy. The algorihm i aed and analyed in coninuou ime and ae, wih dicreiaion feauring neiher in heorem nor he proof. A key echnical ool ued o how ha he algorihm i well-defined i he mirror coupling of Lindvall and Roger. 1. Inroducion Howard policy ieraion algorihm PIA [13] i a well-known ool for olving conrol problem for Markov deciion procee ee e.g. [4] for a recen urvey of approximae policy ieraion mehod for finie ae, dicree ime, ochaic dynamic programming problem. The algorihm can be reca for general ae-pace coninuou-ime conrol problem, where allowed acion can be choen from a Polih pace. In hi paper we inveigae he convergence of he PIA for an infinie horizon dicouned co problem in he conex of conrolled diffuion procee in R d, where he conrol ake place in an arbirary compac meric pace. The main aim of he paper i o analye he convergence of a equence of policie and he correponding payoff funcion produced by he PIA under aumpion ha are a lea in principle verifiable in erm of he model daa. Our conrol eing i imilar o ha of [1], where an ergodic co crierion wa conidered. The main difference, beyond he co crierion, are a follow: 1 we allow he conroller o modulae he drif a well a he diffuion coefficien; 2 we conider a generalied verion of he PIA where an arbirary caling funcion can be ued o implify he algorihm; 3 we inveigae he convergence no only of payoff bu alo of policie produced by he PIA; 4 we work wih Markov policie ha are defined for every x R d, no almo everywhere, and obain a locally uniform convergence of a ubequence o he opimal policy. Thi la poin in hi li i paricularly imporan in our eing, a our aim i o deign and analye an algorihm ha can in principle a lea be ued in o conruc an opimal policy. Thi requiremen force u o work in he conex of claical oluion of PDE, raher han Dae: July 26, Mahemaic Subjec Claificaion. 93E2, 49L99, 6H3. Key word and phrae. Policy ieraion algorihm, policy improvemen, opimal ochaic conrol, conrolled diffuion procee, dicouned infinie horizon problem, mirror coupling. SD uppored by he EPSRC gran EP/P377X/1; AM uppored by he EPSRC gran EP/P3818/1 and he Programme on Daa-Cenric Engineering funded by Lloyd Regier Foundaion; DŠ uppored by he Slovene Human Reource Developmen and Scholarhip Fund. 1

2 2 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ relying on generalied oluion of he Poion equaion in appropriae Sobolev pace. The laer approach, followed in [1], i baed on he fac ha here exi a canonical oluion o he Poion equaion in W 2,p loc Rd, ee [2]. The analyi of he PIA can han be performed uing Krylov exenion [15] of Iô formula o funcion in he Sobolev pace W 2,p loc Rd. Our mehod for olving he Poion equaion in he claical ene i baed on he coupling of Lindvall and Roger [18]. Thi coupling play a crucial role in he proof of Propoiion 1, guaraneeing ha a payoff funcion for a locally Lipchiz Markov policy i he claical oluion of he correponding Poion equaion. The main echnical conribuion of he paper i he reul in Lemma 7, which how ha he mirror coupling from [18] i ucceful wih very high probabiliy, when he diffuion procee are ared ufficienly cloe ogeher. Inereingly, he condiion in [18] for he coupling o be ucceful i no aified in our eing in general. The proof of Lemma 7 i baed on a local pah-wie comparion of a ime-change of he diance beween he coupled diffuion and an appropriaely choen Beel proce. The convergence of he policie and payoff in he PIA i obained in everal ep. Fir we how ha he PIA alway improve he payoff. Then we prove, uing a diagonaliaion argumen and an Arzela-Acoli ype reul, ha a ubequence of he policie produced by he PIA converge locally uniformly o a locally Lipchiz limiing policy. The final age of he argumen how ha hi limiing policy i indeed an opimal policy wih payoff equal o he poinwie limi of he payoff produced by he PIA. Thee ep are deailed in Secion 2 and proved in Secion 5. The lieraure on he PIA for Markov deciion procee in variou eing i exenive ee e.g. [7], [11], [12], [17], [2], [16], [19], [22], [23] and [24] and he reference herein. Our approach i in ome ene cloe o he coninuou ime analyi in general ae pace in [7], where he convergence of he ubequence of he policie i eablihed in he cae of finie acion pace. In [24] hi rericion i removed, bu he conrolled procee conidered do no include diffuion. A recen applicaion of he PIA o impule conrol in coninuou ime i given in [3]. Finally, a menioned above, we oberve ha he PIA can be generalied by muliplying he expreion o be minimied by an arbirary poiive caling funcion ha can depend boh on he ae and he conrol acion ee gpia below. A choice of he caling funcion clearly influence he equence of policie produced by he algorihm. In paricular, in he one-dimenional cae, he caling funcion can be ued o eliminae he econd derivaive of he payoff from he algorihm. Thi idea i decribed in Secion 3. A numerical example of he PIA i repored in Secion 4. In hi example a lea, he convergence of he PIA i very fa a he algorihm find an opimal policy in fewer han ix ieraion. 2. Mulidimenional conrolled diffuion procee Le A, d A be a compac meric pace of available conrol acion and, for ome d, m N, le σ : R d A R d m and µ : R d A R d be meaurable funcion. Le he e Ax of admiible policie a x R d coni of procee Π = Π aifying he following: Π i A- valued, adaped o a filraion F and here exi an F -adaped proce X Π,x = X Π,x aifying he SDE 1 X Π,x = x + σ X Π,x, Π db + µ X Π,x, Π d,,

3 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 3 where B = B i an m-dimenional F -Brownian moion. Noe ha he filraion F and indeed he enire filered probabiliy pace depend on he policy Π in Ax. Pick meaurable funcion α : R d A R and f : R d A R. For any x R d and Π Ax, define he payoff of he policy Π by V Π x := E e αxπ,x,π d f Conrol problem. Conruc he value funcion V, defined by V x := inf V Πx, x R d, Π Ax X Π,x, Π d. and, if i exi, an opimal conrol {Π x Ax : x R d }, aifying V x = V Π xx. Noe fir ha he problem i pecified compleely by he deerminiic daa σ, µ, α and f. In order o define an algorihm ha olve hi problem, he funcion σ, µ, α, f are aumed o aify Aumpion 1 below hroughou hi ecion. Aumpion 1. The funcion σ, µ, α and f are bounded, and Lipchiz on compac in R d A, i.e. for every compac e K R d A here exi a conan C K > uch ha 2 hx, p hy, r C K x y 2 + d A p, r hold for every x, p, y, r K, and h {σ, µ, α, f}. x, p R d A, and here exi λ > uch ha In addiion, αx, p > ɛ > for all 3 σx, pσx, p T v, v λ v 2 for all x R d, p A, v R d. Remark 1. Here, and hroughou he paper, and, denoe he Euclidean norm and inner produc repecively. The norm M = up { Mv / v : v R m \{}} = λ max MM T, for a marix M R d m, i ued in 2 for h = σ, where λ max MM T i he large eigenvalue of he non-negaive definie marix MM T R d d and M T R m d denoe he ranpoe of M. Remark 2. The uniform ellipiciy condiion in 3 i he mulidimenional analogue of he volailiy being bounded away from. Hence, for all x R d and p A, he malle eigenvalue of σx, pσx, p T R d d i a lea of ize λ and, in paricular, m d cf. Remark 3 below. A meaurable funcion π : R d A i a Markov policy or ynonymouly Markov conrol if for x R d here exi an R d -valued proce X π,x = X π,x ha aifie he following SDE: 4 X π,x = x + σ X π,x, π X π,x db + µ X π,x, π X π,x d,. Le F be a filraion wih repec o which X π,x, B i F -adaped and B i an F - Brownian moion. Such a filraion F exi by he definiion of a oluion of SDE 4, ee e.g. [14, Def , p. 3]. Then F can be aken o be he filraion in he definiion of he policy πx π,x Ax. Moreover, wihou lo of generaliy, we may aume ha F aifie he uual condiion. For any funcion π : R d A ha i Lipchiz on compac i.e. locally Lipchiz in R d, Aumpion 1 implie ee e.g. [5, p. 45] and he reference herein ha he SDE in 4 ha a unique, rong non-exploding oluion, hu making π a Markov policy. The payoff funcion of a locally Lipchiz Markov policy i a claical oluion of a linear PDE, a fac key for gpia o work. To ae hi formally, recall ha h : R d R k for any k N i 1/2-Hölder coninuou

4 4 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ on a compac D R d if hx hx K D x x 1/2 hold for ome conan K D > and all x, x D. Sreamline he noaion for Markov policie a follow: 5 V π := V πx π, and L π h := 1 2 Tr σ T π Hh σ π + µ T π h for h C 2 R d, σ π := σ, π, µ π := µ, π, α π := α, π, f π := f, π, where Hh and h are he Heian and gradien of h, repecively, and TrM denoe he race of any marix M R m m. Propoiion 1. Le Aumpion 1 hold. For a locally Lipchiz Markov policy π we have: V π C 2 R d i he unique bounded oluion of he Poion equaion L π V π α π V π + f π = and HV π i 1/2-Hölder on compac in R d. Remark 3. The proof of Propoiion 1, given in Secion 5.2, depend crucially on he coupling in Lindvall and Roger [18] of wo d-dimenional diffuion via a coupling of he correponding d-dimenional driving Brownian moion ee Lemma 7 below. Moreover, i i eay o ee ha he probabiliy of coupling could in general be zero if he dimenion m of he Brownian noie i ricly le han d. In hi cae he conrolled diffuion in R d, ared a diinc poin, could remain forever on dijoin ubmanifold of poiive co-dimenion in R d for any choice of conrol. In paricular, Propoiion 1 fail in he cae m < d, a demonraed by Example 1 below. Example 1. Le m = 1, d = 2, A = [ 1, 1], fx 1, x 2, a := x 1 + x 2 + x 1 x 2 + a 2 /2, αx 1, x 2, a 1, σx 1, x 2, a 1, 1 T and µx 1, x 2, a = a1, 1 T. Then, for a conan policy π a a A, he conrolled proce ared a x R 2 i given by X π,x = x+1, 1 T B +a1, 1 T, for. In paricular, for a = and any x = x 1, x 2 T, we have V π x = x 1 x 2 + gx 1 + x 2, where gy := E y + 2B e1, y R, and e 1 i an exponenial random variable wih mean 1, independen of B. Since B e1 ha a mooh deniy, i follow ha g i alo mooh, implying ha V π canno aify he concluion of Propoiion 1. Remark 4. A we are uing he andard weak formulaion of he conrol problem i.e. he filered probabiliy pace i no pecified in advance, all ha maer for a Markov policy π i he law of he conrolled proce X π,, which olve he maringale problem in 4. Moreover, hi law i uniquely deermined by he ymmeric-marix valued funcion x, a σx, aσx, a T R d d and of coure he drif x, a µx, a. Since he ymmeric quare roo of σx, aσx, a T in R d d aifie Aumpion 1, in he remainder of he paper we aume, wihou lo of generaliy, ha he noie and he conrolled proce are of he ame dimenion, i.e. d = m cf. Remark 2. Remark 5. For any locally Lipchiz Markov policy π, he proce X π, i rong Markov. Hence [8, Thm. 1.7] implie ha V π i in he domain DA π of he generaor A π of X π, and ha he Poion equaion A π V π α π V π + f π = hold. Recall ha for a bounded coninuou g : R d R in DA π he limi A π gx := lim EgX π,x gx/ exi for all x R d. Furhermore, if g i alo in C 2 R d, i i known ha A π g = L π g. However, [8, Thm. 1.7] doe no imply ha V π i in C 2 R d. The PDE in Propoiion 1, key for gpia o work, i eablihed via he coupling argumen in Secion 5.2. If a policy π : R d A i conan i.e. π p A, wrie σ p, µ p, α p, f p, L p and V p inead of σ π, µ π, α π, f π, L π and V π, repecively. Le S : R d A, be a coninuou funcion and, for any p A, denoe S p x := Sx, p. Under Aumpion 1, he funcion

5 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 5 p S p xl p hx α p xhx + f p x, p A, i coninuou for any x R d and h C 2 R d. Since A i compac, here exi I h x A, which minimie hi funcion. Aumpion 2. For any h C 2 R d, he funcion I h : R d A can be choen o be locally Lipchiz on R d. The coninuou caling funcion S aifie ɛ S < S < M S for ome ɛ S, M S,. Generalied Policy Ieraion Algorihm Inpu: σ, µ, α, f, S aifying Aumpion 1 2, conan policy π and N N. for n = o N 1 do Compue V πn from he PDE in Propoiion 1. Chooe he policy π n+1 a follow: gpia π n+1 x argmin {S p x L p V πn x α p xv πn x + f p x}, x R d. p A end Oupu: Approximaion V πn of he value funcion V. Remark 6. By Aumpion 2, he policy π n+1 defined in gpia i locally Lipchiz for all n N 1. Hence, by Propoiion 1, he algorihm i well defined. Remark 7. In he claical cae of gpia we ake S 1. A non-rivial caling funcion S, which play an imporan role in he one-dimenional conex ee Secion 3 below, make he algorihm ino a generalied Policy Ieraion Algorihm. Thm 2 require only he poiiviy and coninuiy of S. The uniform bound on S in Aumpion 2 are ued only in he proof of Thm 5. gpia alway lead o an improved payoff Theorem 2 i proved in Secion 5.2 below: Theorem 2. Under Aumpion 1 2, he inequaliy V πn+1 V πn hold on R d for all n {,..., N 1}. The equence {V πn } N N, obained by running gpia from a given policy π for each N N, i non-increaing and bounded. Hence we may define 6 V lim x := lim N V π N x, x R d. However, he equence of he correponding Markov policie {π N } N N need no converge and, even if i did, he limi may be diconinuou and hence no necearily a Markov policy. Remark 8. If he algorihm op before N, i.e. π n+1 = π n for ome n < N, hen clearly V πn = V πn and π N = π n. A hi hold for any N > n, wih gpia ared a a given π, we may proceed direcly o he verificaion lemma Theorem 5 below o conclude ha V πn i he value funcion wih an opimal policy π n. Conrolling he convergence of he policie require he following addiional aumpion. Inroduce he e S B,K := {h C 2 R d : hx < B 1, Hhx < B 2 for x D K }, where D K := {x R d : x K} i a ball of radiu K > and B := B 1, B 2, 2 are conan. Aumpion 3. For any K >, here exi conan B K and C K aifying he following: if h S BK,K, hen I h defined in Aumpion 2 aifie di h x, I h y C K x y for all x, y D K. Aumpion 4. For any K >, le B K, C K be a in Aumpion 3. Then for a locally Lipchiz Markov policy π : R d A, uch ha dπx, πy C K x y, x, y D K, he following hold: V π S BK,K.

6 6 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Remark 9. Non-rivial problem daa ha aify Aumpion 1 4 are decribed in Secion 4 below. I i preciely hee ype of example ha moivaed he form Aumpion 2 4 ake. Aumpion 1 i andard and Aumpion 2 3 concern only he deerminiic daa pecifying he problem. Aumpion 4 eenially ae ha HV π ha a precribed bound on he ball D K if he coefficien of he PDE in Propoiion 1 have a precribed Lipchiz conan. Schauder boundary eimae for ellipic PDE [1, p. 86] ugge ha hi requiremen i boh naural and feaible. In fac, Aumpion 4 may follow from aumpion of he ype 1 3 on he problem daa. Thi i lef for fuure reearch. Propoiion 3 and Theorem 4 and 5, proved in Secion 5.2 below, how ha gpia converge. Propoiion 3. Le Aumpion 1 4 hold. Then here exi a ubequence of {π N } N N ha converge uniformly on every compac ube of R d o a locally Lipchiz Markov policy. Le π lim : R d A denoe a locally Lipchiz Markov policy ha i a locally uniform limi of a ubequence of {π N } N N. By Propoiion 1, V πlim i a well-defined funcion in C 2 R d ha olve he correponding Poion equaion. However, ince π lim clearly depend on i defining ubequence, o may V πlim. Furhermore, V lim may depend on he choice of π in gpia. Bu hi i no o, ince V lim equal boh he value funcion V and he payoff for he policy π lim. Theorem 4. Under Aumpion 1 4, he equaliy V lim = V πlim hold on R d for a policy π lim. Theorem 5. Le Aumpion 1 4 hold. Then for every x R d and Π Ax he inequaliy V lim x V Π x hold. Hence V lim equal he value funcion V, doe no depend on he choice of π in gpia and π lim i an opimal locally Lipchiz Markov policy for he conrol problem. Remark 1. The key echnical iue in he proof of Theorem 5 i ha he policie in he convergen ubequence conruced in he proof of Propoiion 3 are no improvemen of heir predeceor cf. gpia. The idea of he proof i o work wih a convergen ubequence of he pair of policie {π N, π N+1 } N N, where {π N } N N i produced by he gpia ee Secion 5.2 for deail. 3. The one-dimenional cae There are wo reaon for conidering he one-dimenional conrol problem in i own righ. A The canonical choice for he caling funcion S := 1/σ 2 implifie he gpia o { 7 π n+1 x argmin µx, pv π n x αx, pv πn x + fx, p/σ 2 x, p }, p A by removing he econd derivaive of he payoff funcion V πn from he minimiaion procedure. Thi reducion appear o make he numerical implemenaion of he gpia converge exremely fa: in he example in Secion 4.2 below he opimal payoff and policy are obained in fewer han half a dozen ieraion. B I i naural o conrol he proce X Π,x only up o i fir exi from an inerval a, b, where a, b [, ], and generalie he payoff a follow: τ b a X Π,x V Π x := E e α Π XΠ,x d f Π X Π,x d + e τ b a XΠ,x α Π X Π,x d gx Π,x. τ b a XΠ,x Here µ, σ, α, f : a, b A R are meaurable wih he ame noaional convenion a in Secion 2 f p x = fx, p ec.. Furhermore, Π Ax if X Π,x follow SDE 1 on he ochaic inerval

7 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 7 [, τ b ax Π,x, where τ b ax Π,x := inf{ ; X Π,x {a, b}} inf =, and X Π,x τa bxπ,x = XΠ,x for τ b ax Π,x i.e. Π, [τ b ax Π,x,, are irrelevan for V Π x. Pick an arbirary funcion g : {a, b} R R and e he conrol problem a in Secion 2 wih R d ubiued by a, b. Remark 11. In Aumpion 1 4 we ubiue R d wih a, b. In paricular, inequaliy 3 in Aumpion 1 ake he form σ 2 x, p λ for all x a, b, p A. Aumpion 1 hence implie he requiremen on he caling funcion S = 1/σ 2 in Aumpion 2. In Aumpion 3 4, he family of cloed ball D K K> in R d i ubiued wih a family of compac inerval D K K> in a, b, uch ha K> D K = a, b and, if K < K, D K i conained in he inerior of D K. Remark 12. On he even {τax b Π,x = } we ake gx Π,x τa bxπ,x / exp τa bxπ,x α Π X Π,x d =, ince by Aumpion 1 he inegral i infinie. Noe alo ha on hi even X Π,x may no τax b Π,x be defined. Moreover, if {a, b} R =, hen by Aumpion 1 we have τax b Π,x = a.. A meaurable funcion π : a, b A i a Markov policy if for every x a, b here exi a proce X π,x = X π,x aifying SDE 4 on he ochaic inerval [, τax b π,x and X π,x = X π,x τax b π,x if τ ax b π,x <. If π i a Markov policy, hen πx π,x := πx π,x Ax for every x a, b, where we pick arbirary elemen in A for he value πa and πb if a > and b <, repecively. We ue analogou noaion o ha in 5, e.g. L π h := 1 2 σ2 πh + µ π h for any h C 2 a, b. Any locally Lipchiz funcion π : a, b A i a Markov policy, ince he Engelber-Schmid condiion for he exience and uniquene of he oluion of he correponding SDE ee e.g. [14, Sec. 5.5] are aified by Aumpion 1. By ubiuing he ae pace R d wih he inerval a, b in Propoiion 1, Theorem 2, Propoiion 3 and Theorem 4 and 5 of Secion 2, we obain he reul of he preen ecion, which hu olve he conrol problem in he one-dimenional cae. In he inere of breviy, we omi heir aemen. We re ha he main difference lie in he fac ha he proof, in Secion 5.3 and 5.4 below, rely on he heory of ODE and calar SDE. In paricular, we need o prove ha he payoff ha a coninuou exenion o a finie boundary poin of he ae pace. 4. Example 4.1. Daa aifying Aumpion 1 4. We now decribe a cla of model ha provably aifie Aumpion 1 4. The main aim in he preen ecion i no o be exhauive, bu merely o demonrae ha he form paricularly of Aumpion 3 4 i naural in he conex of conrol problem conidered here. The example we give i in dimenion one. Bu i i clear from he conrucion below ha i can eaily be generalied. Le A := [ a, a], for ome conan a >, and σ, µ, f, α : R [ a, a] R be given by 8 σx, p := σ 1 x, µx, p := µ 1 x + pµ 2, fx, p := f 1 x + f 2 p, αx, p α, where σ 1, µ 1, f 1 C 1 R, f 2 C 2 a, a i convex and ymmeric i.e. f 2 p = f 2 p for all p A and µ 2 and α are conan. For any h {σ 1, µ 1, f 1, f 2 } rep. h {σ 1, µ 1, f 1, f 2 } le he poiive conan C h rep. C h aify h C h rep. h C h. In paricular, we aume ha he derivaive of σ 1, µ 1, f 1, f 2 are bounded. Moreover we may and do ake C f 2 := f 2 a. Aume alo ha α > and σ 2 1 > λ > o ha A 1 i aified and he caling funcion S 1. Then he following propoiion hold.

8 8 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Propoiion 6. Aume α > C µ 1 + µ 2 2+C f 1 /C f 2 and µ 2 B 2 < L f2 := inf p A f 2 p, where B 2 := [ 2C f1 + C f2 + C µ1 + a µ 2 B 1 ] /λ and B 1 := C f 1 + C f 2 /α C µ 1 µ 2, hen Aumpion 1 4 hold. Moreover, in Aumpion 3 4 we have B K = B, 1 B 2 and C K = 1 for any K >. Remark 13. I i clear ha he aumpion in Propoiion 6 define a non-empy ubcla of model 8. Moreover, hee aumpion are much ronger han wha i required by our general Aumpion 3 4 ince he propoiion yield global raher han local bound on he derivaive of he payoff funcion and he Lipchiz coefficien of he policie ariing in gpia. Proof. Pick h C 2 R, uch ha h x < B 1 and h x < B 2 for all x R. Then he funcion I h in A 3 aifie I h x = argmin{pµ 2 h x + f 2 p}. By aumpion we have p A µ 2 h x µ 2 B 1 < C f 2 = f 2a, implying I h x = f 2 1 µ 2 h x x R. Differeniae I h o obain I h x µ 2 B 2 /L f2 < 1, x R, and noe ha Aumpion 2 3 follow. We now eablih A 4. The idea i o ar wih any policy π : R A in C 2 R, uch ha i derivaive aifie π 1 on all of R e.g. a conan policy, and apply ochaic flow of diffeomeorphim [21, Sec. V.1] o deduce he neceary regulariy of he payoff funcion V π. In he noaion from 5, we have µ π C µ 1 + µ 2 and σ π = σ 1 C σ 1. Hence, for each x R, he ochaic exponenial Y = Y R+, given by Y = 1 + µ πx π,x Y d + σ πx π,x Y dw, exi we uppre he dependence on x and π from he noaion. Since he coefficien SDE 4 are in C 1 R wih bounded and locally Lipchiz fir derivaive, [21, Sec. V.1, Thm 49] implie ha he flow of conrolled procee {X π,x } x R may be conruced on he ingle probabiliy pace o ha i i mooh in he iniial condiion x wih x Xπ,x = Y. The upho here i ha, by he argumen in he proof of [9, Prop. 3.2], we obain a ochaic repreenaion for he derivaive x Ef πx π,x = E[Y f πx π,x ] for every R +. Since Y = M exp µ πx π,x d, where he ochaic exponenial M = E σ πx π,x dw i a rue maringale by Novikov condiion, he following inequaliy hold: EY expc µ 1 + µ 2. Since α > C µ 1 + µ 2 by aumpion and he inequaliy f π C f 1 + C f 2 hold, we have E e α Y f πx π,x d < B 1 for all x R. Recall ha V π x = E e α f π X π,x d. By [21, Sec. V.8, Thm 43], he family of random variable indexed by δ, 1, 1 δ e α f π X π,x+δ f π X π,x d 1 δ C f 1 + C f 2 e α X π,x+δ i uniformly inegrable. Hence lim δ V π x + δ V π x/δ ake he form V πx = E Thi inequaliy, he fac σ 2 1 X π,x d, e α Y f πx π,x d, implying V πx < B 1 for all x R. > λ and Propoiion 1 imply V π < B 2, concluding he proof.

9 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 9 Remark 14. The proce Y in he proof of Propoiion 6 exi in he mulidimenional eing, ee [21, Sec. V.1, Thm 49]. Hence he ame argumen work in higher dimenion if we can deduce a bound on he Heian of he payoff funcion from he PDE in Propoiion Numerical example. Conider he one-dimenional conrol problem: A = [ 1, 1], a = 1, b = 1, ga := a 2, gb := b 2, σx, p := 1, µx, p := p, αx, p := 1 and fx, p := x 2 + p 2, which i in he cla dicued in Secion 4.1. Explicily, we eek o compue inf Π Ax V Π x for every x 1, 1, where he payoff V Π x of a policy Π i defined in Secion 3. We implemened gpia, wih he main ep given by 7, in Malab. The payoff funcion a each ep i obained a he oluion o he differenial equaion from Propoiion 1 wih he boundary condiion given by he funcion g. The new policy a each ep can be calculaed explicily cf. he proof of Propoiion 6 above. Figure 1 and 2 graph he payoff funcion and he policie colour coded. The iniial policy π 1 and i payoff correpond o he blue graph Figure 1. The graph of V πn for n {, 1, 2, 3, 4}. Figure 2. The graph of π n for n {, 1, 2, 3, 4}. The graph ugge ha convergence effecively occur in ju a few ep. Figure 3 and 4, conaining he graph of he difference of he conecuive payoff and policie on he logarihmic cale, confirm hi. In Figure 1 and 2 i eem ha fewer graph are preened han i aed in he capion. The reaon for hi i ha he final few graph coincide. Moreover, he policie only differ on a ubinerval 2, 2, becaue ouide of i hey coincide a i i opimal o choe one of he boundary poin of A = [ 1, 1]. Finally, here i no numerical indicaion ha he equence of policie have more han one accumulaion poin a hey appear o converge very fa indeed. 5. Proof 5.1. Auxiliary reul - he mulidimenional cae The reflecion coupling of Lindvall and Roger [18] and he coninuiy of he payoff V π. We now eablih he coninuiy of he payoff funcion for a locally Lipchiz Markov policy π under Aumpion 1. The reflecion coupling of Lindvall and Roger [18] play a crucial role in hi. In fac, he coninuiy of V π hinge on he following propery of he coupling in [18]: copie of X π,x and X π,x, ared very cloe o each oher, will mee wih high probabiliy before moving apar by a cerain diance greaer han x x ee Lemma 7 below.

10 1 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Figure 3. The graph of log V πn+1 V πn for n {, 1, 2, 3, 4} Figure 4. The graph of log π n+1 π n for n {, 1, 2, 3, 4}. We fir how ha he coupling from [18] can be applied o he diffuion X π,. A explained in Remark 4, we may and do aume ha he dimenion of he noie and he conrolled proce are equal, i.e. d = m. By Aumpion 1 above σ π and µ π are bounded and hence [18, A. 12ii] hold. Inequaliy 3 in Aumpion 1 implie ha λ max σπ 1 σ T we have σ 1 π π 1/λ. Hence, by Remark 1, 1/ λ and [18, A. 12ii] alo hold. The aumpion in [18, 12i] require ha σ π and µ π are globally Lipchiz. Bu hi aumpion i only ued in [18] a a guaranee ha he correponding SDE ha a unique rong oluion, which i he cae in our eing under he locally Lipchiz condiion in Aumpion 1. Hence, for any x, x R d, he coupling from [18] can be applied o conruc he proce X π,x, X π,x o ha X π,x follow SDE 4 and X π,x aifie X π,x = x + µ π X π,x d + σ π X π,x H db, for [, ρ Y, where ρ Y := inf{ : Y = } inf := i he coupling ime, Y := X π,x X π,x, and 9 H := I 2u u T, defined via u := σπ X 1 π,x σπ 1 X π,x Y Y for [, ρ Y, i he reflecion in R d abou he hyperplane orhogonal o he uni vecor u. Moreover, we have X π,x = X π,x for all [ρ Y,. Noe alo ha H Od i an orhogonal marix for [, ρ Y and he proce B = B R+, given by B := I {<ρ Y }H + I { ρ Y }IdB, i a Brownian moion by he Lévy characeriaion heorem. Hence X π,x aifie he SDE dx π,x = σ π X π,x db + µ π X π,x d wih X π,x = x ee [18, Sec. 3] for more deail. Lemma 7. Fix a locally Lipchiz Markov policy π : R d A and x R d. Then for every ɛ, 1 here exi ϕ, 1] wih he propery: ϕ, ϕ ϕ, ϕ uch ha Pρ ϕ Y < ρ Y < ɛ if x x < ϕ, where ρ c Y := inf { ; Y = c} inf = for any c >. Remark 15. Noe ha he main aumpion in [18, Thm. 1] i no aified in Lemma 7, a we have no aumpion on he global variabiliy of σ π. Hence he coupling X π,x, X π,x i no necearily ucceful even if he aring poin x and x are very cloe o each oher, i.e. poibly Pρ Y < < 1 even if Y = x x i very cloe o zero. However, by Lemma 7, he

11 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 11 coupling will occur wih probabiliy a lea 1 ɛ before he diffuion are more han ϕ = ϕɛ away from each oher, implying he coninuiy of V π cf. Lemma 8 and Remark 16 below. Proof. Le S := Y 2, δ := σ π X π,x σ π X π,x and β := µ π X π,x µ π X π,x. Define 1 α := σ π X π,x σ π X π,x H and v := Y / Y for [, ρ Y. In hi proof x R d i fixed and x R d i arbirary in he ball of radiu one cenred a x. Recall ha hz = 2z and Hhz = 2I for hz := z 2, z R d, and apply Iô lemma o S: S = x x S v T α db + 2 S v T β + Tr α α T 11 d, [, ρ Y. Our ak i o udy he behaviour of S when ared very cloe o zero. To do hi, we fir eablih he fac in 12 and 14 below, which in urn allow u o apply ime-change and coupling echnique o prove he lemma. We ar by proving he following: 12 Tr α α T v T α 2 = Tr δ δ T v T δ 2 M 2 x Y 2 for [, ρ Y ρ 1 Y, where M x > 1 i a Lipchiz conan for σ π and µ π in he ball around x of radiu one. The fir inequaliy in 12 follow ince he race i he um of he eigenvalue of α α T, which are all non-negaive, while v T α 2 i a mo he large eigenvalue. The econd inequaliy follow ince σ π i Lipchiz on any ball around x and Y < 1 for < ρ 1 Y. To eablih he equaliy in 12 noe ha, a v T A 2 = TrAA T v v T = Trv v T AA T for any A R d d, we have Trα α T v T α 2 Trδ δ T v T δ 2 = TrI v v T α α T δ δ T. Recall ha H 1 = H T = H. We herefore find 13 α α T δ δ T = σ π X π,x = 2 v u T σ π X π,x I H σ π X π,x T + σ π X π,x u v T T + σ π X π,x I H σ π X π,x Y / σ π X π,x 1 Y, where he econd equaliy follow by definiion 9 and ideniy v = Y / Y. Hence 12 follow. Since Trv u T σ π X π,x T = σ π X π,x σ π X π,x 1 v, v Y / σ π X π,x 1 Y hold for ime [, ρ Y, equaliie 12 and 13 yield: v T α 2 v T α 2 v T δ 2 = 4 σπ X π,x σ π X π,x 1 v, v Y 2 / σ π X π,x 1 Y 2. Inequaliy 3 in Aumpion 1 implie σπ 1 1/ λ cf. he econd paragraph of hi ecion. Hence Y / σ π X π,x 1 Y λ. By he definiion of δ above we ge v T α 2 4λ1 + δ σ π X π,x For any ɛ, 1, define 1 v, v 4λ1 δ σ π X π,x 1 4λ1 δ / λ. ϕ := min{1, ɛ λ/m x, ɛ1 ɛλ/m 2 x}, where M x i a in 12 above. Then, if x x < ϕ and [, ρ ϕ Y, we have Y < ϕ and hence δ M x Y ɛ λ. In paricular, we ge 14 v T α 2 4λ1 ɛ > for any [, T, where T := ρ Y ρ ϕ Y. Le M > denoe a global upper bound on σ π and µ π, which exi by Aumpion 1. Since he inequaliie v T α α σ π X π,x + σ π X π,x 2M hold for all [, ρ Y, he increaing proce [N] = [N] R+, given by [N] := I {<ρ Y } v T α 2 d, i well-define and [N] < for every R +. Hence N = N R+, given by N := I {<ρ Y }v T α db, T

12 12 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ i a well-defined local maringale wih a quadraic variaion proce given by [N]. Le τ = τ R+ and W = W R + be he Dambi Dubin-Schwarz DDS ime-change and Brownian moion, repecively, for he local maringale N ee [14, Thm 3.4.6, p. 174]. More preciely, le τ := inf{ R + : [N] > } wih inf = be he invere of [N]. Then W aifie W [N] = N for all R +. Moreover i hold ha τ < for < [N] := lim [N]. If [N] < wih poiive probabiliy, we have o exend he probabiliy pace o uppor W ee e.g. [14, Prob , p. 175]. Thi exenion however ha no bearing on he coupling X π,x, X π,x. Le ˆα := α τ, ˆδ := δ τ, ˆβ := β τ, ˆv := v τ and Ŝ := S τ for [, [N] ρ Y, cf. 1 above. Aume x x < ϕ and ime-change he inegral in 11 ee [14, Prop , p. 176] o ge u u Ŝ u = x x Ŝ dw ν d, for any u [, [N] T, where ν := 2 Ŝˆv T ˆβ + Trˆδ ˆδT ˆv T ˆδ 2 / ˆv T ˆα 2 and T i defined in 14. By 14 i hold ha ˆv T ˆα 2 4λ1 ɛ for all [, [N] T. Then 12 and he definiion of ˆβ, ˆδ and ν imply he inequaliie ν < M 2 x Y τ 2 /λ1 ɛ for all [, [N] T. Any ϕ, ϕ aifie ϕ < ɛ1 ɛλ/m 2 x and R := ρ Y ρ ϕ Y T. Hence he Lipchiz propery of σ π and ν π on he ball of radiu ϕ around x implie 15 ν < ɛ for all [, [N] R. The SDE S = x x + 2 S r dw r ɛ, R +, for he quared Beel proce of dimenion 1 + ɛ ha a pahwie unique and hence rong oluion S = S R+, ee [6, App. A.3, p. 18]. Noe ha S i driven by he DDS Brownian moion W inroduced above. Hence he coupling S, Ŝ on he ochaic inerval [, [N] R allow u o compare he wo procee pahwie. Aume x x < ϕ. Then he following equaliy hold: S Ŝ = 1 ɛ/ S r ν r / Ŝ r dr for any [, [N] R. 2 Almo urely, he pah of he proce S Ŝ [,[N]R i coninuouly differeniable and, by 15, i derivaive i ricly poiive a every zero of he pah. Since he derivaive i coninuou, i mu be ricly poiive on a neighbourhood of each zero. Thi implie ha he only zero i a = i.e. S = Ŝ, and i hold ha 16 S Ŝ for all [, [N] R. We now conclude he proof of he lemma. Aume a before ha x x < ϕ and define Υ ϕ Ŝ := inf{ [, [N] T : Ŝ = ϕ} wih inf =. Noe ha he even {Υ ϕ Ŝ < } and {[N] ρϕy < [N] ρ Y } coincide, ince on eiher even we have Υ ϕ Ŝ = [N] R = [N] ρϕy. Hence, 17 {ρ ϕ Y < ρ Y } = {Υ ϕ Ŝ < } {S exi inerval, ϕ a ϕ}, where he incluion follow by 16. Recall ha z = z 1 ɛ/2, z R +, i a cale funcion of he diffuion S. Hence PS exi inerval, ϕ a ϕ = x x /ϕ. Define ϕ := ɛϕ and noe ha by 17 we have: Pρ ϕ Y < ρ Y < ɛ for any x R d aifying x x < ϕ. Lemma 8. Pick a locally Lipchiz Markov policy π : R d A and le Aumpion 1 hold. Then he correponding payoff funcion in 5, V π : R d R +, i coninuou.

13 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 13 Proof. Fix x R d and pick arbirary ε, 1. By Aumpion 1 here exi ɛ >, uch ha α π ɛ, and a conan M > 1 ha imulaneouly bound α π, f π < M and i a Lipchiz conan on he ball of radiu one around x for α π and f π. Apply Lemma 7 o x, ɛ := εɛ /6M and π o obain ϕ, 1] uch ha ϕ, ϕ ϕ, ϕ uch ha Pρ ϕ < ρ < ɛ for every x R d aifying x x < ϕ here ρ ϕ, ρ and for ρ ϕ Y, ρ Y, rep.. Specifically, define 18 ϕ := min{ ϕ/2, ε/31 + M/ɛ M/ɛ, εeɛ /3M 2 } and fix ϕ, ϕ uch ha he concluion of Lemma 7 hold. Throughou hi proof we ue he noaion and noion from Lemma 7. In paricular, X π,x, X π,x denoe he coupling of wo conrolled procee ared a x, x and we aume ha x x < ϕ. Recall ha V π x = EF X π,x for any x R d, where F X π,x i given in 19. By decompoing he probabiliy pace ino complemenary even {ρ ϕ > ρ } and {ρ ϕ < ρ }, we obain he following inequaliy V π x V π x A + A + A, where ρ A := E I {ρϕ>ρ } e απxπ,x d f π X π,x e απ X π,x A := E I {ρϕ>ρ } απxπ,x d f π X π,x e απ X π,x A := E I {ρϕ<ρ } e ρ e απxπ,x d f π X π,x e απ d fπ d fπ X d π,x fπ X π,x X π,x X π,x d, d, d. Hence, by Lemma 7, we have A Pρ ϕ < ρ 2M/ɛ < ɛ2m/ɛ = ε/3. Since in he ummand A and A he coupling ucceed before he componen of X π,x, X π,x grow a lea ϕ apar, we can conrol hee erm uing he local regulariy of α π and f π. Conider A. Add and ubrac e απxπ,x d f π X π,x o obain he bound: ρ A EI {ρϕ>ρ } e ɛ f π X π,x f π X π,x + M e απxπ,x d e απ X π,x d d. On he even {ρ ϕ > ρ }, for < ρ ϕ i hold ha X π,x X π,x < ϕ. Since z e z ha a poiive derivaive bounded above by one for z R +, α π ɛ and boh f π and α π are Lipchiz wih conan M on he ball of radiu ϕ around x, we ge ρ A EI {ρϕ>ρ } Mϕe ɛ + Me ɛ α π X π,x M α π X π,x d d < ϕ + M 2 ɛ ɛ 2 ε 3, where he la inequaliy follow from 18. Furhermore, ince X π,x hold ha he following expecaion equal A : EI {ρϕ>ρ } e ρ ɛ ρ e απxπ,x ρ ɛ d e ρ απxπ,x ɛ d e = X π,x ρ α πx π,x for all ρ, i d fπ X π,x Since f π < M, α π ɛ, e z e y z y for z, y R + and e ɛ e 1 ɛ for R + we find ρ A E I {ρϕ>ρ }Me ρ ɛ α π X π,x α π X π,x d M 2 ϕe I {ρϕ>ρ}e ρ ɛ M 2 ρ ϕ, eɛ which i by 18 le han ε/3. Hence, for any x x ϕ, we proved ha V π x V π x < ɛ, which conclude he proof of he lemma. Remark 16. The proof of Lemma 7 and 8 how ha if he locally Lipchiz propery in Aumpion 1 i ubiued by he globally Lipchiz requiremen, we can conclude ha he payoff funcion V π i in fac uniformly coninuou. However, he coupling from [18] may ill no be d.

14 14 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ ucceful, ince he global Lipchiz condiion conrol globally he local variabiliy of he coefficien. The coupling may fail becaue he aumpion in [18, Thm. 1] conrain he global variabiliy of σ π. In fac, he idea of he proof of Lemma 7 can be ued o conruc an example where Pρ Y < < 1 by bounding he norm of Y 2 from below by a quared Beel proce of dimenion greaer han wo on an even of poiive probabiliy A verion of he Acoli-Arzela Theorem. The following fac i key for proving he exience of he opimal raegy and howing ha a ubequence of {π N } N N in gpia converge o i. Lemma 9. Le M 1, d 1 and M 2, d 2 be compac meric pace, and for every n N le f n : M 1 M 2. If he equence {f n } n N i equiconinuou, i.e. ɛ > δ > x, y M 1 n N : d 1 x, y < δ = d 2 f n x, f n y < ɛ, hen here exi a uniformly convergen ubequence {f nk } k N, i.e. f : M 1 M 2 uch ha for every ɛ > here exi N N uch ha up x M1 d 2 f nk x, fx < ɛ for all k N. Proof. Le Bx, 1/m := {y M 1 : d 1 x, y < 1/m} be a ball of radiu 1/m, m N, cenred a x M 1. Since M 1 i compac and meric, i i oally bounded: S m M 1 finie aifying M 1 = x Sm Bx, 1/m. Then S := m N S m = {x n M 1 ; n N} i counable and dene in M 1. We now apply he andard diagonaliaion argumen o find he ubequence in he lemma. Le ι 1 : N N be an increaing funcion defining a ubequence {f ι1 n} n N ha converge a x 1, i.e. lim n f ι1 nx 1 exi in M 2. Such a funcion ι 1 exi ince M 2 i compac. Aume now ha we have conruced an increaing ι k : N N uch ha {f ιk n} n N converge on he e {x 1,..., x k } for ome k N. Then here exi an increaing ι : N N uch ha he equence of funcion {f ιk+1 n} n N, where ι k+1 := ι k ι, converge a x k+1 a well a on he e {x 1,..., x k }, a i i a ubequence of {f ιk n} n N. Since k N wa arbirary, we have defined a equence of ubequence of {f n } n N, uch ha he k-h ubequence converge on {x 1,..., x k }. Conider he diagonal ubequence {f nk } k N, f nk := f ιk k for any k N. By conrucion i converge on S. We now prove ha i i uniformly Cauchy, which implie uniform convergence ince M 2 i complee. Pick any ɛ >. By equiconinuiy m N uch ha for any k N and x, y M 1 aifying d 1 x, y < 1/m, i hold ha d 2 f nk x, f nk y < ɛ/3. Furhermore, ince S m i finie, N N uch ha for all naural number k 1, k 2 N we have d 2 f nk1 y, f nk2 y < ɛ/3 for all y S m. Finally, for any x M 1 here exi y S m uch ha d 1 x, y < 1/m. Hence, for any k 1, k 2 N i hold ha d 2 f nk1 x, f nk2 x d 2 f nk1 x, f nk1 y + d 2 f nk1 y, f nk2 y + d 2 f nk2 y, f nk2 x < ɛ. Since x M 1 wa arbirary, he lemma follow A uniformly inegrable maringale. If he proce X π,x in 4, conrolled by a Markov policy π, exi for all x R d, hen X π, i a rong Markov proce [14, Thm 4.3, p. 322], ince σ and µ are bounded by Aumpion 1. Define he addiive funcional F X π,x = F X π,x [, ], 19 F X π,x := e u απxπ,x d f π Xu π,x du for [, ]. Remark 17. Noe ha V π x = EF X π,x and, by Aumpion 1, he proce F X π,x i bounded by ome conan C >. Hence F X π,x < C and V π x < C.

15 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 15 Lemma 1. The following hold for every Markov policy π, x R d and F -opping ime T : E F X π,x F T = FT X π,x + I {T < } e T απxπ,x d V π X π,x. In paricular, he proce M = M r r [, ] i a uniformly inegrable maringale, where M r := F r X π,x + I {r< } e r απxπ,x d V π Xr π,x. T Proof. The following calculaion imply he lemma: E F X π,x F T = FT X π,x + E I {T < } e +T α πx π,x d f π X π,x +T d F T = F T X π,x + I {T < } e T απxπ,x d E e απxπ,x +T d f π X π,x +T d F T = F T X π,x + I {T < } e T απxπ,x d V π X π,x T, where we applied he rong Markov propery of X π, in he la ep Proof of reul in Secion 2. Proof of Propoiion 1. Aume m = d, cf. Remark 4. I uffice o prove ha he PDE hold on he ball D := {y R d : y x < 1} for any x R d. Fix x D and define τ := inf{ R + : X π,x D} wih inf = o be he fir ime he proce X π,x hi he boundary D := {y R d : y x = 1} of D. Noe ha, by Aumpion 1, we have τ <. Le v C 2 D C D, where D := D D, denoe a oluion of he boundary value problem L π v α π v + f π = in D, where v = V π on D. Since π i locally Lipchiz and 2 in Aumpion 1 hold, he coefficien σ π, µ π, f π, α π are 1/2-Hölder in fac Lipchiz on D. The boundary daa V π D i coninuou by Lemma 8, α π and σ π aifie 3. Hence, by [1, Thm 19, p. 87], he funcion v exi, i unique and Hv i 1/2-Hölder. Noe ha, for all [, ], we have X π,x τ D. Hence we can define 2 Y := F τ X π,x + e τ α πxr π,x dr v X π,x τ, for [, ], where he proce F X π,x i given in 19 above. The proce Y = Y [, ] i bounded by a conan and by definiion converge almo ure lim Y = Y. Since v olve he boundary value problem above and X π,x aifie SDE 4, Iô formula on he ochaic inerval [, τ] R + yield Y = vx + τ e απxπ,x r dr v X π,x T σ π X π,x db, [, ], making Y ino a local maringale. Since Y i bounded, i i a uniformly inegrable maringale aifying vx = Y = E[Y ]. Since v = V π on D and X π,x τ D, he definiion of Y in 2 and Lemma 1 applied o he opping ime T := τ yield Y = F τ X π,x + e τ απxπ,x r dr V π Xτ π,x = E F X π,x F τ, implying vx = EF X π,x = V π x. The uniquene follow imilarly: le v be anoher bounded oluion of he Poion equaion on R d. Define he proce Y a in 2 wih τ and <. A above we have vx = EY for all R +. Then he DCT, applicable ince v i bounded, yield vx = lim EY = V π x.

16 16 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Proof of Theorem 2. Le π n and π n+1 be a in gpia, n N {}. Define Y = Y R+ by 21 Y := F X πn+1,x + e απ n+1x π n+1,x r dr π V πn X n+1,x, R+. where F X πn+1,x i given in 19 above. Define τ m := inf{ : X π n+1,x = m} for any fixed m > x and noe ha τ m < by Aumpion 1. Iô formula, applicable by Propoiion 1, yield Y τm = V πn x + M + τm e απ n+1x π n+1,x r dr f πn+1 + L πn+1 V πn α πn+1 V πn X π n+1,x d, where M = M R+, M := τ m e r dr Vπ T n σ πn+1 X π n+1,x db, i a local maringale. Since he funcion σ πn+1 and V πn are bounded on he ball {y R d : y m} by Aumpion 1 and Propoiion 1, repecively, and α πn+1 > ɛ >, he quadraic variaion of M i bounded above by a conan. Hence M i a uniformly inegrable maringale. In paricular, EM = for all R +. By gpia and Propoiion 1, we have fπn+1 + L πn+1 V πn α πn+1 V πn Sπn+1 f πn + L πn V πn α πn V πn S πn = on R d. απ n+1 Xπ n+1,x Since S πn+1 >, we have E Y τm V πn x. Hence 21, Aumpion 1 and he DCT, a, yield V πn x EF τm X πn+1,x π + EV πn X n+1,x τ m e τm α πn+1 X π n+1,x r dr. Hence, by Remark 17, we have V πn x EF τm X πn+1,x C Ee ɛ τ m. Since X πn+1,x aifie SDE 4 for all R +, we have lim m τ m =. The DCT and Remark 17 yield V πn+1 x = EF X πn+1,x = lim m EF τm X πn+1,x C Ee ɛ τ m V πn x, which conclude he proof. Proof of Propoiion 3. Run gpia o produce a equence of policie {π N } N N, aring from a conan policy π. Fix an arbirary K > and conider he rericion of hi equence ono he cloed ball D K. Since he Lipchiz conan of π i equal o zero and hence maller han C K, Aumpion 4 implie V π S BK,K. Aumpion 3 implie ha he Lipchiz conan of π 1 i alo a mo C K. Ieraing hi argumen implie ha all he policie in he equence {π N } N N have he ame Lipchiz conan on D K, making i equiconinuou on D K. By Lemma 9 above, here exi a ubequence ha converge uniformly on D K o a funcion π : D K A. Moreover, π i alo Lipchiz wih a conan bounded above by C K. Le K 1 := 2K and repea he argumen above for K 1 and he ubequence of {π N } N N conruced in he previou paragraph. Thi yield a furher ubequence of he policie ha converge uniformly o a Lipchiz funcion π 1 : D K1 A wih he Lipchiz conan bounded above by C K1. Since he equence we ared wih converge poinwie o π on D K D K1, o mu i every ubequence. Hence i hold ha π x 1 = π x for all x D K. For k N, le K k := 2K k 1 and conruc inducively π k : D Kk A a above. Then he funcion π lim : R d A, given by π lim x := π x n for any n N uch ha x D Kn, i welldefined and locally Lipchiz. Le he policy π nk : R d A be he k-h elemen of he convergen ubequence ued o define π k : D Kk A. Then, by conrucion, he diagonal ubequence {π nk } k N of {π N } N N converge uniformly o π lim on D K for any K >. Proof of Theorem 4. Le {π nk } k N be a ubequence of he oupu of gpia, {π N } N N, ha converge locally uniformly o a policy π lim = lim k π nk. By 6 and Theorem 2, V πnk V lim

17 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 17 a k. Fix K > and le τ K := inf{ R + : X π lim,x x D K } be he fir ime X π lim,x hi he boundary of he cloed ball x + D K wih radiu K, cenred a an arbirary x R d. Pick k N, R + and define S k := e απn k Xπ lim,x r dr f πnk X π lim,x d + e απn k Xπ lim,x r dr V πnk X π lim,x. Apply Iô formula o he proce S k = S k on he ochaic inerval [, τ K o ge S k τ K = V πnk x + τk e απn k Xπ lim,x r τk + e απn k Xπ lim,x T dr V πnk σπlim X π lim,x db r dr f πnk + L πlim V πnk α πnk V πnk X π lim,x d. Noe ha σ πlim and V πnk are bounded on D K by Aumpion 1 and Propoiion 1, repecively, and α πnk > ɛ >. Hence he quadraic variaion of he ochaic inegral i bounded, making i ino a rue maringale. Thi fac and he equaliy α πnk V πnk f πnk = L πnk V πnk Prop. 1 yield τk ES τ k K = V πnk x + E 22 = V πnk x + E τk e απn k Xπ lim,x e απn k Xπ lim,x r dr f πnk + L πlim V πnk α πnk V πnk X π lim,x d r dr L πlim V πnk L πnk V πnk X π lim,x d. Noe [L πlim L πnk ]V πnk = µ πlim µ πnk T V πnk Trσ π lim + σ πnk T HV πnk σ πlim σ πnk. Since, for every k, V πnk olve he correponding Poion equaion in Propoiion 1 and, by Aumpion 1 and Remark 17, he family of funcion {σ πnk, µ πnk, α πnk, f πnk, V πnk : k N} i uniformly bounded on he ball x+d K, Schauder boundary eimae for ellipic PDE [1, p. 86] implie ha he equence { V πnk } k N and {HV πnk } k N are alo uniformly bounded on x + D K. Since α πnk > ɛ > for all k N and he limi lim k µ πnk = µ πlim and lim k σ πnk = σ πlim are uniform on x + D K, he DCT and he equaliy in 22 imply lim k ES τ k K = V lim x. Hence, he definiion of S k above, Aumpion 1, Remark 17 and a furher applicaion of he DCT yield 23 V lim x = E τk e απ limx π lim,x r dr f πlim X π lim,x d + E τk, where E τk := Ee τ K α πlim X π lim,x r dr V lim X π lim,x τ K. By 6 and Remark 17, he inequaliy V lim y C hold for all y R d. By Aumpion 1 we hence ge lim up E K C lim up Ee ɛ τ K =, K K ince τ K a K. The DCT applied o he fir ummand in 23, a K, yield he equaliy V lim x = V πlim x. Since x R d wa arbirary, he heorem follow. Proof of Theorem 5. The econd aerion in he heorem follow from he fir one and Theorem 4. We now eablih he fir aerion of Theorem 5. Equip A A wih a produc meric, e.g. d p 1, p 2, a 1, a 2 := max{d A a 1, p 1, d A a 2, p 2 }, and le {π N } N N be conruced by he gpia. A in he proof of Propoiion 3, {π N+1, π N : R d A A} N N are Lipchiz on a cloed ball D K of radiu K > wih he Lipchiz conan C K, independen of N. Hence a in he proof of Propoiion 3, here exi a ubequence {π 1+nk, π nk } k N ha converge uniformly on every compac ube of R d o a locally Lipchiz funcion π lim, π lim : R d A A.

18 18 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Pick any x R d, a policy Π Ax, K > and le τ K := inf{ R + : X Π,x x D K } be he fir ime he conrolled proce X Π,x hi he boundary of he cloed ball x + D K wih radiu K cenred a x. Since Π A for all R +, he gpia implie he inequaliy 24 S Π f Π + L Π V πnk α Π V πnk S πnk +1 f π nk +1 + L π nk +1 V π nk α πnk +1 V π nk on R d. Denoe L π h := L π h α π h + f π for any policy π and h C 2 R d. Then, for k N, we find ha τk E e α ΠrXr Π,x dr f Π X Π,x d + e τ K α Πr Xr Π,x dr V πnk X Π,x τ K 25 Iô = V πnk x + E V πnk x + ɛ S M S E τk e α ΠrXr Π,x τk e α ΠrXr Π,x dr X Π,x f Π + L Π V πnk α Π V πnk d dr X L V Π,x πnk +1 π nk d, where he la inequaliy follow from Aumpion 2 and inequaliy 24. The nex ak i o ake he limi a k on boh ide of inequaliy 25. Since he equence {π 1+nk } k N converge locally uniformly o he locally Lipchiz policy π lim rep. π lim, Theorem 4 implie V πlim = V lim rep. V πlim = V lim. Propoiion 1 implie L πlim V lim = = L πlim V lim. Hence we can expre L V πnk +1 π nk = L V πnk +1 π nk L πlim V πnk +L πlim V πnk L πlim V lim. By Schauder boundary eimae for ellipic PDE [1, p. 86], he equence { V πnk } k N and {HV πnk } k N are uniformly bounded on x + D K. By Aumpion 1 and Remark 17, he bounded equence {σ, µ πnk +1 π, α nk +1 π, f nk +1 π, V nk +1 π } nk +1 k N end o he limi σ πlim, µ πlim, α πlim, f πlim, V πlim uniformly on x + D K a k. Hence, o doe 26 L πnk +1 V π nk L πlim V πnk = [L πnk +1 L π lim ]V πnk α πnk +1 α π lim V πnk +f πnk +1 f π lim. By he ellipic verion of Theorem 15 in [1, p. 8] applied o he family of PDE L πnk V πnk =, k N, here exi a ubequence of {V πnk } k N again denoed by {V πnk } k N, uch ha he correponding equence {V πnk, V πnk, HV πnk } k N converge uniformly on he cloed ball x+d K o V πlim, V πlim, HV πlim = V lim, V lim, HV lim. Hence i follow ha 27 L πlim V πnk L πlim V lim = L πlim V πnk V lim α πlim V πnk V lim, a k. Equaion 26 and 27 imply ha L πnk +1 V π nk a k uniformly on he ball x + D K. Apply he DCT o he righ-hand ide of 25 and Aumpion 1 and Remark 17 o i lefhand ide: τk V lim x E e α ΠrXr Π,x dr f Π X Π,x d + C Ee ɛ τ K. Since hi inequaliy hold for all K, > and τ K a K, he inequaliy V lim x V Π x follow by he DCT a K cf. he la paragraph of he proof of Theorem Auxiliary reul - he one-dimenional cae. Throughou Secion 5.3 and 5.4, define τc d Z := inf{ ; Z {c, d}} inf = for any coninuou ochaic proce Z R+ in R and c < d. Lemma 11. For any Markov policy π : a, b A, he payoff funcion V π : a, b R can be coninuouly exended by defining V π a := ga if a > and V π b := gb if b <.

19 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 19 Proof. Le {x n } n N be a decreaing equence in a, b ha converge o a >. We now prove ha lim n V π x n = ga he argumen for b i analogou. Pick arbirary ɛ >. Since µ i bounded and σ 2 bounded and bounded away from, a imple coupling argumen yield ha he proce X π,xn can be bounded by a Brownian moion wih drif o ha P τa X π,xn > ɛ < ɛ and P τa X π,xn > τ b X π,xn < ɛ hold for large n N. Hence, here exi n N uch ha P τa X π,xn > ɛ τ b X π,xn P τa for all n n. Define he quaniie Ba b := e τa b Xπ,xn τ b a X π,xn e απxπ,xn Then we have X π,xn > ɛ + P τa X π,xn > τ b X π,xn < 2ɛ α πx π,xn d gx π,xn τ b a Xπ,xn ga and Ab a := d f π X π,xn d and he even C := {τ a X π,xn ɛ τ b X π,xn }. V π x n ga E A b a + BaI b Ω\C + A b a + BaI b C. We now how ha here exi M >, which doe no depend on ɛ, uch ha V π x n ga i bounded above by 4Mɛ for all n n. The expecaion on he even Ω\C, which ha probabiliy le han 2ɛ, i maller han 2Mɛ ince f, g are bounded and α ɛ >. On he even C we have τax b π,xn ɛ, which implie EA b ai C < Mɛ. On C i hold ha X π,xn = a. Hence, τa bxπ,xn he elemenary inequaliy 1 e x x for x, yield an upper bound on EBaI b C of he form ga E ɛ α π X π,xn d. Thi conclude he proof. Lemma 12 i he analogue of Lemma 1 wih an analogou proof, which we omi for breviy. Lemma 12. The following hold for every Markov policy π, x a, b and opping ime ρ: E F τ b a X π,x X π,x + e τa b Xπ,x α πx π,x d g X π,x I τax b π,x {τ b a X π,x < } F ρ = M ρ, where M r := F r τ b a X π,x X π,x + I {r< } e r τa b Xπ,x α πx π,x d V π X π,x, for r [, ]. r τa bxπ,x In paricular, he proce M = M r r [, ] i a uniformly inegrable maringale Proof of reul in Secion 3. Proof of Propoiion 1 in dimenion one. Recall ha Aumpion 1 hold. We need o how ha for any locally Lipchiz Markov policy π : a, b A we have V π C 2 a, b and L π V π α π V π + f π =. Le a < a < a < x < b < b < b, and for any c < d denoe τc d := τc d X π,x. Le v C 2 a, b C[a, b ] be he unique oluion of he boundary value problem L π v α π v + f π =, va = V π a, vb = V π b, guaraneed o exi by Theorem 19 in [1, p. 87], which i τ b a applicable by Aumpion 1. Le S a,b := F τ b X π,x + e α πxr π,x a by Iô formula on [, τa b ] and he definiion of v, he proce S a,b = S a,b S a,b = vx + τ b a e απxπ,x r dr σ π v X π,x db. dr vx π,x τ b. Then, a aifie Hence S a,b i clearly a uniformly inegrable maringale and he following equaliie hold: lim E S a,b S a,b = and vx = ES a,b. Define S a,b by ubiuing τa b in he

Introduction to SLE Lecture Notes

Introduction to SLE Lecture Notes Inroducion o SLE Lecure Noe May 13, 16 - The goal of hi ecion i o find a ufficien condiion of λ for he hull K o be generaed by a imple cure. I urn ou if λ 1 < 4 hen K i generaed by a imple curve. We will