arxiv: v2 [cs.lg] 9 Nov 2017

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 9 Nov 2017"

Juniper Burke
6 years ago
Views:

1 Renforcement Lernng under Model Msmtch Aurko Roy 1, Hun Xu 2, nd Sebstn Pokutt 2 rxv: v2 cs.lg 9 Nov Google Eml: urkor@google.com 2 ISyE, Georg Insttute of Technology, Atlnt, GA, USA. Eml: hun.xu@sye.gtech.edu 2 ISyE, Georg Insttute of Technology, Atlnt, GA, USA. Eml: sebstn.pokutt@sye.gtech.edu November 10, 2017 Abstrct We study renforcement lernng under model msspecfcton, where we do not hve ccess to the true envronment but only to resonbly close pproxmton to t. We ddress ths problem by extendng the frmework of robust MDPs of 2, 17, 13 to the model-free Renforcement Lernng settng, where we do not hve ccess to the model prmeters, but cn only smple sttes from t. We defne robust versons of Q-lernng, SARSA, nd TD-lernng nd prove convergence to n pproxmtely optml robust polcy nd pproxmte vlue functon respectvely. We scle up the robust lgorthms to lrge MDPs v functon pproxmton nd prove convergence under two dfferent settngs. We prove convergence of robust pproxmte polcy terton nd robust pproxmte vlue terton for lner rchtectures under mld ssumptons. We lso defne robust loss functon, the men squred robust projected Bellmn error nd gve stochstc grdent descent lgorthms tht re gurnteed to converge to locl mnmum. 1 Introducton Renforcement lernng s concerned wth lernng good polcy for sequentl decson mkng problems modeled s Mrkov Decson Process MDP, v nterctng wth the envronment 22, 20. In ths work we ddress the problem of renforcement lernng from msspecfed model. As motvtng exmple, consder the scenro where the problem of nterest s not drectly ccessble, but nsted the gent cn nterct wth smultor whose dynmcs s resonbly close to the true problem. Another plusble pplcton s when the prmeters of the model my evolve over tme but cn stll be resonbly pproxmted by n MDP. To ddress ths problem we use the frmework of robust MDPs whch ws proposed by 2, 17, 13 to solve the plnnng problem under model msspecfcton. The robust MDP frmework consders clss of models nd fnds the robust optml polcy whch s polcy tht performs best under the worst model. It ws shown by 2, 17, 13 tht the robust optml polcy stsfes the robust Bellmn equton whch nturlly leds to exct dynmc progrmmng lgorthms to fnd n optml polcy. However, ths pproch s model dependent nd does not mmedtely generlze to the model-free cse where the prmeters of the model re unknown. Essentlly, renforcement lernng s model-free frmework to solve the Bellmn equton usng smples. Therefore, to lern polces from msspecfed models, we develop smple bsed methods to solve the robust Work done whle t Georg Tech 1

2 Bellmn equton. In prtculr, we develop robust versons of clsscl renforcement lernng lgorthms such s Q-lernng, SARSA, nd TD-lernng nd prove convergence to n pproxmtely optml polcy under mld ssumptons on the dscount fctor. We lso show tht the nomnl versons of these tertve lgorthms converge to polces tht my be rbtrrly worse compred to the optml polcy. We lso scle up these robust lgorthms to lrge scle MDPs v functon pproxmton, where we prove convergence under two dfferent settngs. Under techncl ssumpton smlr to 6, 26 we show convergence of robust pproxmte polcy terton nd vlue terton lgorthms for lner rchtectures. We lso study functon pproxmton wth nonlner rchtectures, by defnng n pproprte men squred robust projected Bellmn error MSRPBE loss functon, whch s generlzton of the men squred projected Bellmn error MSPBE loss functon of 23, 24, 7. We propose robust versons of stochstc grdent descent lgorthms s n 23, 24, 7 nd prove convergence to locl mnmum under some ssumptons for functon pproxmton wth rbtrry smooth functons. Contrbuton. In summry we hve the followng contrbutons: 1. We extend the robust MDP frmework of 2, 17, 13 to the model-free renforcement lernng settng. We then defne robust versons of Q-lernng, SARSA, nd TD-lernng nd prove convergence to n pproxmtely optml robust polcy. 2. We lso provde robust renforcement lernng lgorthms for the functon pproxmton cse nd prove convergence of robust pproxmte polcy terton nd vlue terton lgorthms for lner rchtectures. We lso defne the MSRPBE loss functon whch contns the robust optml polcy s locl mnmum nd we derve stochstc grdent descent lgorthms to mnmze ths loss functon s well s estblsh convergence to locl mnmum n the cse of functon pproxmton by rbtrry smooth functons. 3. Fnlly, we demonstrte emprclly the mprovement n performnce for the robust lgorthms compred to ther nomnl counterprts. For ths we used vrous Renforcement Lernng test envronments from OpenAI 10 s benchmrk to ssess the mprovement n performnce s well s to ensure reproducblty nd consstency of our results. Relted Work. Recently, severl pproches hve been proposed to ddress model performnce due to prmeter uncertnty for Mrkov Decson Processes MDPs. A Byesn pproch ws proposed by 21 whch requres perfect knowledge of the pror dstrbuton on trnston mtrces. Other probblstc nd rsk bsed settngs were studed by 11, 28, 25 whch propose vrous mechnsms to ncorporte percentle rsk nto the model. A frmework for robust MDPs ws frst proposed by 2, 17, 13 who consder the trnston mtrces to le n some uncertnty set nd proposed dynmc progrmmng lgorthm to solve the robust MDP. Recent work by 26 extended the robust MDP frmework to the functon pproxmton settng where under techncl ssumpton the uthors prove convergence to n optml polcy for lner rchtectures. Note tht these lgorthms for robust MDPs do not redly generlze to the model-free renforcement lernng settng where the prmeters of the envronment re not explctly known. For renforcement lernng n the non-robust model-free settng, severl tertve lgorthms such s Q- lernng, TD-lernng, nd SARSA re known to converge to n optml polcy under mld ssumptons, see 5 for survey. Robustness n renforcement lernng for MDPs ws studed by 15 who ntroduced robust lernng frmework for lernng wth dsturbnces. Smlrly, 18 lso studed lernng n the presence of n dversry who mght pply dsturbnces to the system. However, for the lgorthms proposed n 15, 18 no theoretcl gurntees re known nd there s only lmted emprcl evdence. Another recent work on robust renforcement lernng s 14, where the uthors propose n onlne lgorthm wth certn trnstons beng stochstc nd the others beng dversrl nd the devsed lgorthm ensures low regret. 2

3 For the cse of renforcement lernng wth lrge MDPs usng functon pproxmtons, theoretcl gurntees for most TD-lernng bsed lgorthms re only known for lner rchtectures 3. Recent work by 7 extended the results of 23, 24 nd proved tht stochstc grdent descent lgorthm mnmzng the men squred projected Bellmn equton MSPBE loss functon converges to locl mnmum, even for nonlner rchtectures. However, these lgorthms do not pply to robust MDPs; n ths work we extend these lgorthms to the robust settng. 2 Prelmnres We consder n nfnte horzon Mrkov Decson Process MDP 20 wth fnte stte spce X of sze n nd fnte cton spce A of sze m. At every tme step t the gent s n stte X nd cn choose n cton A ncurrng cost c t,. We wll mke the stndrd ssumpton tht future cost s dscounted, see e.g., 22, wth dscount fctor ϑ < 1 ppled to future costs,.e., c t, := ϑ t c,, where c, s fxed constnt ndependent of the tme step t for X nd A. The sttes trnston ccordng to probblty trnston mtrces τ := {P } A whch depends only on ther lst tken cton. A polcy of the gent s sequence π = 0, 1,..., where every t corresponds to n cton n A f the system s n stte t tme t. For every polcy π, we hve correspondng vlue functon v π R n, where v π for stte X mesures the expected cost of tht stte f the gent were to follow polcy π. Ths cn be expressed by the followng recurrence relton v π := c, 0 + ϑe j X v π j. 1 The gol s to devse lgorthms to lern n optml polcy π tht mnmzes the expected totl cost: Defnton 2.1 Optml polcy. Gven n MDP wth stte spce X, cton spce A nd trnston mtrces P, let Π be the strtegy spce of ll possble polces. Then n optml polcy π s one tht mnmzes the expected totl cost,.e., π := rg mn π Π E t=0 ϑ t c t, t t. In the robust cse we wll ssume s n 17, 13 tht the trnston mtrces P re not fxed nd my come from some uncertnty regon P nd my be chosen dversrlly by nture n future runs of the model. In ths settng, 17, 13 prove the followng robust nlogue of the Bellmn recurson. A polcy of nture s sequence τ := P 0, P 1,... where every P t P corresponds to trnston probblty mtrx chosen from P. Let T denote the set of ll such polces of nture. In other words, polcy τ T of nture s sequence of trnston mtrces tht my be plyed by t n response to the ctons of the gent. For ny set P R n nd vector v R n, let σ P v := sup { p v p P } be the support functon of the set P. For stte X, let P be the projecton onto the th row of P. Theorem We hve the followng perfect dulty relton mn mx E τ ϑ t c t, t t π Π τ T t=0 = mx mn E τ ϑ t c t, t t τ T π Π t=0. 2 The optml vlue functon v π correspondng to the optml polcy π stsfes v π = mn c, + ϑσ P A v π, 3 } nd π cn then be obtned n greedy fshon,.e., rg mn A {c, + ϑσ P v. The mn shortcomng of ths pproch s tht t does not generlze to the model free cse where the trnston probbltes re not explctly known but rther the gent cn only smple sttes ccordng to these 3

4 z x y Fgure 1: Exmple trnston mtrces shown wthn the probblty smplex n wth uncertnty sets beng l 2 blls of fxed rdus. probbltes. In the bsence of ths knowledge, we cnnot compute the support functons of the uncertnty sets P. On the other hnd t s often esy to hve confdence regon U, e.g., bll or n ellpsod, correspondng to every stte-cton pr X, A tht quntfes our uncertnty n the smulton, wth the uncertnty set P beng the confdence regon U centered round the unknown smultor probbltes. Formlly, we defne the uncertnty sets correspondng to every stte cton pr n the followng fshon. Defnton 2.3 Uncertnty sets. Correspondng to every stte-cton pr, we hve confdence regon U so tht the uncertnty regon P of the probblty trnston mtrx correspondng to, s defned s P := {x + p x U }, 4 where p s the unknown stte trnston probblty vector from the stte X to every other stte n X gven cton durng the smulton. As smple exmple, we hve the ellpsod U := { x x A x 1, X x = 0 } for some n n psd mtrx A wth the uncertnty set P beng P := { x + p x U }, where p s the unknown smultor stte trnston probblty vector wth whch the gent trnstoned to new stte durng trnng. Note tht whle t my esy to come up wth good descrptons of the confdence regon U, the pproch of 17, 13 breks down snce we hve no knowledge of p nd merely observe the new stte j smpled from ths dstrbuton. See Fgure 1 for n llustrton wth the confdence regons beng n l 2 bll of fxed rdus r. In the followng sectons we develop robust versons of Q-lernng, SARSA, nd TD-lernng whch re gurnteed to converge to n pproxmtely optml polcy tht s robust wth respect to ths confdence regon. The robust versons of these tertve lgorthms nvolve n ddtonl lner optmzton step over the set U, whch n the cse of U = { x 2 r} smply corresponds to ddng fxed nose durng every updte. In lter sectons we wll extend t to the functon pproxmton cse where we study lner rchtectures s well s nonlner rchtectures; n the ltter cse we derve new stochstc grdent descent lgorthms for computng pproxmtely robust polces. 4

5 3 Robust exct dynmc progrmmng lgorthms In ths secton we develop robust versons of exct dynmc progrmmng lgorthms such s Q-lernng, SARSA, nd TD-lernng. These methods re sutble for smll MDPs where the sze n of the stte spce s not too lrge. Note tht confdence regon U must lso be constrned to le wthn the probblty smplex n, see Fgure 1. However snce we do not hve knowledge of the smultor probbltes p, we do not know how fr wy p s from the boundry of n nd so the lgorthms wll mke use of proxy confdence regon Û where we drop the requrement of Û n, to compute the robust optml polces. Wth sutble choce of step lengths nd dscount fctors we cn prove convergence to n pproxmtely optml U -robust polcy where the pproxmton depends on the dfference between the unconstrned proxy regon Û nd the true confdence regon U. Below we gve specfc exmples of possble choces for smple confdence regons. 1. Ellpsod: Let {A }, be sequence of n n psd mtrces. Then we cn defne the confdence regon s { } U := x x A x 1, x = 0, pj x j 1 pj, j X. 5 X Note tht U hs some ddtonl lner constrnts so tht the uncertnty set P := { p + x x } U les nsde n. Snce we do not know p, we wll mke use of the proxy confdence regon Û := {x x A x 1, X x = 0}. In prtculr when A = r 1 I n for every X, A then ths corresponds to sphercl confdence ntervl of r, r n every drecton. In other words, ech uncertnty set P s n l 2 bll of rdus r. 2. Prllelepped: Let {B }, be sequence of n n nvertble mtrces. Then we cn defne the confdence regon s { } U := x B x 1 1, x = 0, pj x j 1 pj, j X. 6 X As before, we wll use the unconstrned prllelepped Û wthout the pj x j 1 pj constrnts, s proxy for U snce we do not hve knowledge p. In prtculr f B = D for dgonl mtrx D, then the proxy confdence regon Û corresponds to rectngle. In prtculr f every dgonl entry s r, then every uncertnty set P s n l 1 bll of rdus r. 3.1 Robust Q-lernng Let us recll the noton of Q-fctor of stte-cton pr, nd polcy π whch n the non-robust settng s defned s Q, := c, + E j X vj, 7 where v s the vlue functon of the polcy π. In other words, the Q-fctor represents the expected cost f we strt t stte, use the cton nd follow the polcy π subsequently. One my smlrly defne the robust Q-fctors usng smlr nterpretton nd the mnmx chrcterzton of Theorem 2.2. Let Q denote the Q-fctors of the optml robust polcy nd let v R n be ts vlue functon. Note tht we my wrte the vlue functon n terms of the Q-fctors s v = mn A Q,. From Theorem 2.2 we hve the followng expresson for Q : Q, = c, + ϑσ P v 8 = c, + ϑσ U v + ϑ pj mn j X A Q j,, 9 5

6 where equton 9 follows from Defnton 2.3. For n estmte Q t of Q, let v t R n be ts vlue vector,.e., v t := mn A Q t,. The robust Q-terton s defned s: Q t, := 1 γ t Q t 1, + γ t c, + ϑσû v t 1 + ϑ mn Q t 1j,, 10 A where stte j X s smpled wth the unknown trnston probblty pj usng the smultor. Note tht the robust Q-terton of equton 10 nvolves n ddtonl lner optmzton step to compute the support functon σû v t of v t over the proxy confdence regon Û. We wll prove tht tertng equton 10 converges to n pproxmtely optml polcy. The followng defnton ntroduces the noton of n ε-optml polcy, see e.g., 5. The error fctor ε s lso referred to s the mplfcton fctor. We wll tret the Q-fctors s X A mtrx n the defnton so tht ts l norm s defned s usul. Defnton 3.1 ε-optml polcy. A polcy π wth Q-fctors Q s ε-optml wth respect to the optml polcy π wth correspondng Q-fctors Q f Q Q ε Q. 11 The followng smple lemm llows us to decompose the optmzton of lner functon over the proxy uncertnty set P n terms of lner optmzton over P, U, nd Û. Lemm 3.2. Let v R n be ny vector nd let β := mx y Û σ P v + β v. mn x U y x 1. Then we hve v σ P Proof. Note tht every pont p n P s of the form p + x for some x U nd every pont q P s of the form p + y for some y Û, nd ths correspondence s one to one by defnton. For ny vector v Rn nd prs of ponts p P nd q P we hve q v = p v + q p v 12 sup p v + p + y p x v 13 p P = σ P v + y x v. 14 σ P v + y x v 15 σ P v + y v mn x v 16 x U σ P v + mx mny x v 17 y Û x U σ P v + mx mn y x y Û x U 1 v 18 σ P v + β v. 19 Snce equton 19 holds for every q P, t follows tht t lso holds for rg mx σ P v so tht σ P v σ P v + β v. 20 The followng theorem proves tht under sutble choce of step lengths γ t nd dscount fctor ϑ, the terton of equton 10 converges to n ε-pproxmtely optml polcy wth respect to the confdence regons U. 6

7 Theorem 3.3. Let the step lengths γ t of the Q-terton lgorthm be chosen such tht t=0 γ t = nd t=0 γt 2 < nd let the dscount fctor ϑ < 1. Let β be s n Lemm 3.2 nd let β := mx X, A β. If ϑ1 + β < 1 then wth probblty 1 the terton of equton 10 converges to n ε-optml polcy where ε := ϑβ 1 ϑ1+β. Proof. Let P be the proxy uncertnty set for stte X nd A,.e., P := { } x + p x Û. We denote the vlue functon of Q by v. Let us defne the followng opertor H mppng Q-fctors to Q-fctors s follows: H Q, := c, + v. 21 ϑσ P We wll frst show tht soluton Q to the equton H Q = Q s n ε-optml polcy s n Defnton 3.1,.e., Q Q ε Q. Q, Q, = H Q, c, ϑσ P v 22 = ϑ v σ P σ P v 23 ϑ mx y x 1 Q + σ P y Û v σ P v 24,x U ϑβ Q + σ P v σ P v 25 ϑβ Q + ϑ mx q q P j mn j X A Q j, mx q j mn q P j X A Q j, 26 ϑβ Q + ϑ mx q j mn q P j X A Q j, mn A Q j, 27 ϑβ Q + ϑ mx q j mx q P j X A Q j, Q j, 28 ϑβ Q + ϑ mx q j Q Q 29 q P j X ϑβ Q + ϑ Q Q, 30 where we used Lemm 3.2 to derve equton 24. Equton 30 mples tht Q Q ϑβ 1 ϑ Q. If Q Q then we re done snce ϑβ 1 ϑ ϑβ 1 ϑ1+β. Otherwse ssume tht Q > Q nd use the trngle nequlty: Q Q = Q Q Q Q. Ths mples tht 1 ϑ ϑβ Q Q Q Q Q, 31 from whch t follows tht Q Q ε Q under the ssumpton tht ϑ1 + β < 1 s clmed. The Q-terton of equton 10 cn then be reformulted n terms of the opertor H s Q t, = 1 γ t Q t 1, + γ t H Q t, + η t,, 32 where η t, := mn A Q t j, E j p mn A Q t j, where the expectton s over the sttes j X wth the trnston probblty from stte to stte j gven by p j. Note tht ths s n exmple of 7

8 stochstc pproxmton lgorthm s n 5 wth nose prmeter η t. Let F t denote the hstory of the lgorthm untl tme t. Note tht E j p η t, F t = 0 by defnton nd the vrnce s bounded by E j p ηt, 2 F t K 1 + mx j X A Q 2 t j,. 33 Thus the nose term η t stsfes the zero condtonl men nd bounded vrnce ssumpton Assumpton 4.3 n 5. Therefore t remns to show tht the opertor H s contrcton mppng to rgue tht tertng equton 10 converges to the optml Q-fctor Q. We wll show tht the opertor H s contrcton mppng wth respect to the nfnty norm.. Let Q nd Q be two dfferent Q-vectors wth vlue functons v nd v. If U s not necessrly the sme s the unconstrned proxy set Û for some X, A, then we need the dscount fctor to stsfy ϑ1 + β n order to ensure convergence. Intutvely, the dscount fctor should be smll enough tht the dfference n the estmton due to the dfference of the sets U nd Û converges to 0 over tme. In ths cse we show contrcton for opertor H s follows H Q, H Q, ϑ mx q P ϑ mx q P ϑ mx j X j X y Û,x U q j mn A mn A Q j, 34 q j mx Qj, Q j, 35 A y x 1 Q Q + ϑ mx q P ϑβ Q Q + ϑ Q Q mx q P j X j X q j Q Q 36 q j 37 ϑβ + 1 Q Q 38 where we used Lemm 3.2 wth vector vj := mx A Qj, Q j, to derve equton 36 nd the fct tht P n to conclude tht mx q P j X q j = 1. Therefore f ϑ1 + β < 1, then t follows tht the opertor H s norm contrcton nd thus the robust Q-terton of equton 10 converges to soluton of H Q = Q whch s n ε-pproxmtely optml polcy for ε = ϑβ 1 ϑ1+β, s ws proved before. Remrk 3.4. If β = 0 then note tht by Theorem 3.3, the robust Q-tertons converge to the exct optml mn x U y x ξ mx y Û Q-fctors snce ε = 0. Snce β = mx X, A ξ mn, t follows tht β = 0 ff Û = U for every X, A. Ths hppens when the confdence regon s smll enough so tht the smplex constrnts pj x j 1 pj j X n the descrpton of P become redundnt for every X, A. Equvlently every p s fr from the boundry of the smplex n compred to the sze of the confdence regon U, see e.g., Fgure 1. Remrk 3.5. Note tht smply usng the nomnl Q-terton wthout the σû v term does not gurntee convergence to Q. Indeed, the nomnl Q-tertons converge to Q-fctors Q where Q Q my be rbtrry lrge. Ths follows esly from observng tht Q, Q, = σû v, where v s the vlue functon of Q nd so Q Q = mx X, A σû v, 39 whch cn be s hgh s v = Q. See Secton 5 for n expermentl demonstrton of the dfference n the polces lerned by the robust nd nomnl lgorthms. 8

9 3.2 Robust SARSA Recll tht the updte rule of SARSA s smlr to the updte rule for Q-lernng except tht nsted of choosng the cton = rg mn A Q t 1 j,, we choose the cton where wth probblty δ, the cton s chosen unformly t rndom from A nd wth probblty 1 δ, we hve = rg mn A Q t 1 j,. Therefore, t s esy to modfy the robust Q-terton of equton 10 to gve us the robust SARSA updtes: Q t, := 1 γ t Q t 1, + γ t c, + ϑσû v t 1 + ϑ Q t 1 j,. 40 In the exct dynmc progrmmng settng, t hs the sme convergence gurntees s robust Q-lernng nd cn be seen s corollry of Theorem 3.3. Corollry 3.6. Let the step lengths γ t be chosen such tht t=0 γ t = nd t=0 γt 2 < nd let the dscount fctor ϑ < 1. Let β be s n Lemm 3.2 nd let β := mx X, A β. If ϑ1 + β < 1 then ϑβ wth probblty 1 the terton of equton 40 converges to n ε-optml polcy where ε := 1 ϑ1+β. In prtculr f β = β = 0 so tht the proxy confdence regons Û re the sme s the true confdence regons U, then the terton 40 converges to the true optmum Q. 3.3 Robust TD-lernng Recll tht TD-lernng llows us to estmte the vlue functon v π for gven polcy π. In ths secton we wll generlze the TD-lernng lgorthm to the robust cse. The mn de behnd TD-lernng n the non-robust settng s the followng Bellmn equton v π := E j p π c, π + v π j. 41 Consder trjectory of the gent 0, 1,..., where m denotes the stte of the gent t tme step m. For tme step m, defne the temporl dfference d m s d m := c m, π m + ϑv π m+1 v π m. 42 Let λ 0, 1. The recurrence relton for TDλ my be wrtten n terms of the temporl dfference d m s v π k = E m=0 ϑλ m k d m + v π k. 43 The correspondng Robbns-Monro stochstc pproxmton lgorthm wth step sze γ t for equton 43 s v t+1 k := v t k + γ t ϑλ m k d m. 44 m=k A more generl vrnt of the TDλ tertons uses elgblty coeffcents z m for every stte X nd temporl dfference vector d m n the updte for equton 44 v t+1 := v t + γ t z m d m. 45 m=k Let m denote the stte of the smultor t tme step m. For the dscounted cse, there re two possbltes for the elgblty vectors z m ledng to two dfferent TDλ tertons: 9

10 1. The every-vst TDλ method, where the elgblty coeffcents re { ϑλz m 1 f m = z m := ϑλz m f m =. 2. The restrt TDλ method, where the elgblty coeffcents re { ϑλz m 1 f m = z m := 1 f m =. We mke the followng ssumptons bout the elgblty coeffcents tht re suffcent for proof of convergence. Assumpton 3.7. The elgblty coeffcents z m stsfy the followng condtons 1. z m 0 2. z 1 = 0 3. z m ϑz m 1 f / { 0, 1,... } 4. The weght z m gven to the temporl dfference d m should be chosen before ths temporl dfference s generted. Note tht the elgblty coeffcents of both the every-vst nd restrt TDλ tertons stsfy Assumpton 3.7. In the robust settng, we re nterested n estmtng the robust vlue of polcy π, whch from Theorem 2.2 we my express s v π := c, π + ϑ mx p P π E j p v π j, 46 where the expectton s now computed over the probblty vector p chosen dversrlly from the uncertnty regon P. As n Secton 3.1, we my decompose mx p P E j p vj = σ P v s mx p P π E j p vj = σ π U v + E π j p vj, 47 where p π s the trnston probblty of the gent durng smulton. For the remnder of ths secton, we wll drop the subscrpt nd just use E to denote expectton wth respect to ths trnston probblty p π. Defne smulton to be trjectory { 0, 1,..., Nt } of the gent, whch s stopped ccordng to rndom stoppng tme N t. Note tht N t s rndom vrble for mkng stoppng decsons tht s not llowed to foresee the future. Let F t denote the hstory of the lgorthm up to the pont where the t th smulton s bout to commence. Let v t be the estmte of the vlue functon t the strt of the t th smulton. Let { 0, 1,..., Nt } be the trjectory of the gent durng the t th smulton wth 0 =. Durng trnng, we generte severl smultons of the gent nd updte the estmte of the robust vlue functon usng the the robust temporl dfference d m whch s defned s d m := d m + ϑσûπm v t, 48 m = c m, π m + ϑv t m+1 v t m + ϑσûπm v t, 49 m 10

11 where d m s the usul temporl dfference defned s before d m := c m, π m + ϑv t m+1 v t m. 50 The robust TD-updte s now the usul TD-updte, except tht we use the robust temporl dfference computed over the proxy confdence regon: N t 1 v t+1 := v t + γ t z m dm, 51 m=0 N t 1 = v t + γ t z m m=0 ϑσûπm v t + d m. 52 m We defne n ε-pproxmte vlue functon for fxed polcy π n wy smlr to the ε-optml Q-fctors s n Defnton 3.1: Defnton 3.8 ε-pproxmte vlue functon. Gven polcy π, we sy tht vector v R n s n ε- pproxmton of v π f the followng holds v v π ε v π. The followng theorem gurntees convergence of the robust TD terton of equton 51 to n pproxmte vlue functon for π under Assumpton 3.7. Theorem 3.9. Let β be s n Lemm 3.2 nd let β := mx X, A β. Let ρ := mx X m=0 z m. If ϑ1 + ρβ < 1 then the robust TD-tertons of equton 51 converges to n ε-pproxmte vlue functon, ϑβ where ε := 1 ϑ1+ρβ. In prtculr f β = β = 0,.e., the proxy confdence regon Û s the sme s the true confdence regon U, then the convergence s exct,.e., ε = 0. Note tht n the specl cse of regulr TDλ tertons, ρ = ϑλ 1 ϑλ. Proof. Let { P be the proxy uncertnty } set for stte X nd cton A s n the proof of Theorem 3.3,.e., P := x + p x Û. Let I t := {m m = } be the set of tme ndces the t th smulton vsts stte. We defne δ t := mx qm P πm E m q m m It z m F t, so tht we my wrte the updte of m equton 51 s v t+1 = v t 1 γ t δ t + γ t δ t E N t 1 m=0 z m d Ft m + v t 53 δ t +γ t δ t ϑ N t 1 m=0 z m d m E N t 1 m=0 z m d Ft m. 54 δ t Let us defne the opertor H t : R n R n correspondng to the t th smulton s E N t 1 m=0 z m c m, π m + ϑσûπm v + ϑv m+1 v m F t m H t v := δ t + v

12 We clm s n the proof of Theorem 3.3 tht soluton v to H t v = v must be n ε-pproxmton to v π. Defne the opertor H t wth the proxy confdence regons replced by the true ones,.e., H tv := E N t 1 m=0 z m c m, π m + ϑσ πm U m δ t v + ϑv m+1 v m F t + v. 56 Note tht H t v π = v π for the robust vlue functon v π snce c m, π m + ϑσ πm U v π + ϑv π m+1 m v π m = 0 for every m X by Theorem 2.2. Fnlly by Lemm 3.2 we hve σûπm v + E v m σ πm U + E v m + β v, 57 m m for ny vector v, where the expectton s over the stte m p π m 1 m 1. Thus for ny soluton v to the equton H t v = v, we hve v v π = H t v v π 58 H tv v π + ϑβ v E Nt 1 m=0 = H tv H tv π + ϑβ v E z m Nt 1 z m m= ϑ v v π + ϑρβ v, 61 where equton 61 follows from equton 56. Therefore the soluton to H t v = v s n ε-pproxmton to ϑβ v π for ε = 1 ϑ1+ρβ f ϑ1 + ρβ < 1 s n the proof of Theorem 3.3. Note tht the opertor H t ppled to the tertes v t s H t v t = E pproxmton lgorthm of the form N t 1 m=0 zt m d Ft m,t δ t + v t so tht the updte of equton 51 s stochstc v t+1 = 1 γ t v t + γ t H t v t + η t, where γ t = γ t δ t nd η t s nose term wth zero men nd s defned s η t := N t 1 m=0 zt m d m E N t 1 m=0 zt m d Ft m. 62 δ t Note tht by Lemm 5.1 of 5, the new step szes stsfy t=0 γ t = nd t=0 γ t 2 < f the orgnl step sze γ t stsfes the condtons t=0 γ t = nd t=0 γt 2 <, snce the condtons on the elgblty coeffcents re unchnged. Note tht the nose term lso stsfes the bounded vrnce of Lemm 5.2 of 5 snce ny q P π stll specfes dstrbuton s P π n. Therefore, t remns to show tht H t s norm contrcton wth respect to the l norm on v. Let us defne the opertor A t s A t v := E N t 1 m=0 z m ϑσûπm m v + ϑv m+1 v m F t δ t + v 63 12

13 nd the expresson b t := E N t 1 m=0 c m,π m F t δ t so tht H t v = A t v + b t. We wll show tht A t v α v for some α < 1 from whch the contrcton on H t follows becuse for ny vector v R n nd the ε-optml vlue functon v = H t v we hve H t v v = H t v H t v = A t v v α v v. 64 Let us now nlyze the expresson for A t. We wll show tht Nt 1 E z m ϑv m+1 v m + ϑσûπ v + m=0 m I t α v E z m v F t 65 z m F t. 66 m I t We frst replce the σûπm term wth σ πm U usng Lemm 3.2 whle ncurrng ρβ v penlty. Let us m m collect together the coeffcents correspondng to v m n the expresson for the expectton: E Nt 1 z m m=0 mx q m P πm m = mx q m P πm m ϑv m+1 v m + ϑσ U πm m E m q m Nt 1 m=0 E m q m Nt m=0 v + m I t z m ϑv m+1 v m + ϑz m 1 z m v m + where we obtn nequlty 68 by subsumng the σ U πm m m I t m I t z m v F t + ϑρβ v 67 z m v F t + ϑρβ v 68 z m v F t + ϑρβ v, 69 term wthn the expectton snce P π m m s now prt of the smplex n nd tkng the worst possble dstrbuton q m. We lso used the fct tht z 1 = 0 nd z Nt = 0. Note tht whenever m =, the coeffcent ϑz m 1 z m of v m s nonnegtve whle whenever m =, then the coeffcent ϑz m 1 z m + z m s lso nonnegtve. Therefore, we my bound the rght hnd sde of equton 67 s mx q m P πm m mx q m P πm m E m q m Nt m=0 E m q m Nt m=0 ϑz m 1 z m v m + ϑz m 1 z m v + m I t m I t z m v F t + ϑρβ v 70 z m v F t + ϑρβ v

14 Let us now collect the terms correspondng to fxed z m : mx q m P πm m E m q m Nt m=0 = v mx v q m P πm m mx q m P πm m ϑz m 1 z m v + Nt 1 E m q m m=0 E m q m m I t m I t z m v F t + ϑρβ v 72 z m ϑ 1 + z m F t + ϑρβ v 73 m I t z m ϑ 1 + z m F t + ϑρβ v 74 m I t v ϑ 1 + ρβ E z m F t 75 m I t where equton 74 follows snce ϑ < 1. Therefore settng α = ϑ 1 + ρβ, our clm follows under the ssumpton tht ϑ1 + ρβ < 1. 4 Robust Renforcement Lernng wth functon pproxmton In Secton 3 we derved robust versons of exct dynmc progrmmng lgorthms such s Q-lernng, SARSA, nd TD-lernng respectvely. If the stte spce X of the MDP s lrge then t s prohbtve to mntn lookup tble entry for every stte. A stndrd pproch for lrge scle MDPs s to use the pproxmte dynmc progrmmng ADP frmework 19. In ths settng, the problem s prmetrzed by smller dmensonl vector θ R d where d n = X. The nturl generlztons of Q-lernng, SARSA, nd TD-lernng lgorthms of Secton 3 re v the projected Bellmn equton, where we project bck to the spce spnned by ll the prmeters n θ R d, snce they re the vlue functons representble by the model. Convergence for these lgorthms even n the non-robust settng re known only for lner rchtectures, see e.g., 3. Recent work by 7 proposed stochstc grdent descent lgorthms wth convergence gurntees for smooth nonlner functon rchtectures, where the problem s frmed n terms of mnmzng loss functon. We gve robust versons of both these pproches. 4.1 Robust pproxmtons wth lner rchtectures In the pproxmte settng wth lner rchtectures, we pproxmte the vlue functon v π of polcy π by Φθ where θ R d nd Φ s n n d feture mtrx wth rows φj for every stte j X representng ts feture vector. Let S be the spn of the columns of Φ,.e., S := { Φθ θ R d} s the set of representble vlue functons. Defne the opertor T π : R n R n s T π v := c, π + ϑ p π j vj, 76 j X so tht the true vlue functon v π stsfes T π v π = v π. A nturl pproch towrds estmtng v π gven current estmte Φθ t s to compute T π Φθ t nd project t bck to S to get the next prmeter θ t+1. The motvton behnd such n terton s the fct tht the true vlue functon s fxed pont of ths operton f t belonged to the subspce S. Ths gves rse to the projected Bellmn equton where the projecton Π s typclly tken wth respect to weghted Euclden norm ξ,.e., x ξ = X ξ x 2, where ξ s some probblty dstrbuton over the sttes X, see 3 for survey. In the model free cse, where we do not hve explct knowledge of the trnston probbltes, vrous methods lke LSTDλ, LSPEλ, nd TDλ hve been proposed see e.g., 4, 9, 8, 16, 23, 24. The key de behnd provng convergence for these methods s to show tht the mppng ΠT π s contrcton mppng 14

15 wth respect to the ξ for some dstrbuton ξ over the sttes X. Whle the opertor T π n the non-robust cse s lner nd s contrcton n the l norm s n Secton 3, the projecton opertor wth respect to such norms s not gurnteed to be contrcton. However, t s known tht f ξ s the stedy stte dstrbuton of the polcy π under evluton, then Π s non-expnsve n ξ 5, 3. Hence becuse of dscountng, the mppng ΠT π s contrcton. We generlze these methods to the robust settng v the robust Bellmn opertors T π defned s T π v := c, π + ϑσ π P v. 77 Snce we do not hve ccess to the smultor probbltes p, we wll use proxy set P s n Secton 3, wth the proxy opertor denoted by T π. Whle the tertve methods of the non-robust settng generlze v the robust opertor T π nd the robust projected Bellmn equton Φθ = ΠT π Φθ, t s however not cler how to choose the dstrbuton ξ under whch the projected opertor ΠT π s contrcton n order to show convergence. Let ξ be the stedy stte dstrbuton of the explorton polcy π of the MDP wth trnston probblty mtrx P π,.e. the polcy wth whch the gent chooses ts ctons durng the smulton. We mke the followng ssumpton on the dscount fctor ϑ s n 26. Assumpton 4.1. For every stte X nd cton A, there exsts constnt α 0, 1 such tht for ny p P we hve ϑp j αp π j for every j X. Assumpton 4.1 mght pper rtfclly restrctve; however, t s necessry to prove tht ΠT π s contrcton. Whle 26 requre ths ssumpton for provng convergence of robust MDPs, smlr ssumpton s lso requred n provng convergence of off-polcy Renforcement Lernng methods of 6 where the sttes re smpled from n explorton polcy π whch s not necessrly the sme s the polcy π under evluton. Note tht n the robust settng, ll methods re necessrly off-polcy snce the trnston mtrces re not fxed for gven polcy. The followng lemm s n ξ-weghted Euclden norm verson of Lemm 3.2. Lemm 4.2. Let v R n be ny vector nd let β := mx y Û where ξ mn := mn X ξ. σ P mn x U y x ξ ξ mn. Then we hve v σ P v + β v ξ, 78 Proof. Sme s Lemm 3.2 except now we tke Cuchy-Schwrz wth respect to weghted Euclden norm ξ n the followng mnner b Ξb ξ mn ξ b ξ ξ mn. 79 The followng theorem shows tht the robust projected Bellmn equton s contrcton under resonble ssumptons on the dscount fctor ϑ. Theorem 4.3. Let β be s n Lemm 4.2 nd let β := mx X β π. If the dscount fctor ϑ stsfes Assumpton 4.1 for some α nd α 2 + ϑ 2 β 2 < 1 2, then the opertor T π s contrcton wth respect to ξ. In other words, for ny two θ, θ R d, we hve T π Φθ T π Φθ 2 2 α 2 + ϑ 2 β 2 Φθ Φθ 2 ξ ξ < Φθ Φθ 2 ξ

16 If β = β = 0 so tht Ûπ = U π, then we hve smpler contrcton under the ssumpton tht α < 1,.e., T π Φθ T π Φθ α Φθ Φθ ξ ξ < Φθ Φθ ξ. 81 Proof. Consder two prmeters θ nd θ n R d. Then we hve T π Φ θ T π Φ θ 2 2 = ξ T π Φ θ T π Φ θ 82 ξ X = ϑ 2 ξ X = ϑ 2 ξ X ϑ 2 ξ X ϑ 2 ξ X σ θ σ Φ P π Φ P π sup q P π sup q P π sup q P π q Φθ q sup P π q Φθ Φθ θ q Φθ 2 q Φθ Φθ + β Φθ Φθ ξ ξ α Pj π φj θ φj θ + ϑβ 2 Φθ Φθ ξ 87 X j X 2 ξ α 2 Pj π X j X φj θ φj θ 2 + ϑ 2 β 2 Φθ Φθ 2 ξ 88 2α 2 + ϑ 2 β 2 Φθ Φθ 2 ξ 89 where we used Lemm 4.2 nd the defnton of β n lne 86, the nequlty + b b 2, 2 nd the fct tht Pj π P π j. Note tht f β π = β = 0 so tht the proxy confdence regon s the sme s the true confdence regon, then we hve the smple upper bound of T π Φ θ T π Φ θ 2 ξ α 2 Φθ Φθ 2 ξ T nsted of π Φ θ T π Φ θ 2 ξ 2α2 Φθ Φθ 2 ξ snce we do not hve the cross term n equton 87 n ths cse. The followng corollry shows tht the soluton to the proxy projected Bellmn equton converges to soluton tht s not too fr wy from the true vlue functon v π. Corollry 4.4. Let Assumpton 4.1 hold nd let β be s n Theorem 4.3. Let ṽ π be the fxed pont of the projected Bellmn equton for the proxy opertor T π,.e., Π T π ṽ π = ṽ π. Let v π be the fxed pont of the proxy opertor T π,.e., T π v π = v π. Let v π be the true vlue functon of the polcy π,.e., T π v π = v π. Then the followng holds ṽ π v π ξ ϑβ v π ξ + Πv π v π ξ α 2 + ϑ 2 β 2 16

17 In prtculr f β = β = 0.e., the proxy confdence regon s ctully the true confdence regon, then the proxy projected Bellmn equton hs soluton stsfyng ṽ π v π ξ Πv π v π ξ 1 α. Proof. We hve the followng expresson ṽ π v π ξ ṽ π Πv π ξ + Πv π v π ξ 91 ξ Π T π ṽ π ΠT π v π + Πv π v π ξ 92 Π T π ṽ π Π T π v π + ϑβ v π ξ + Πvπ v π ξ 93 ξ Π T π ṽ π Π T π v π + ϑβ v π ξ + Πv π v π ξ 94 2α 2 + ϑ 2 β 2 ṽ π v π ξ + ϑβ v π ξ + Πv π v π ξ, 95 ξ where we used Lemm 4.2 to derve nequlty 93 nd Theorem 4.3 to conclude tht Π T π ṽ π Π T π v π 2α2 + ϑ 2 β 2 ṽ π v π ξ. If β π = β = 0 so tht the proxy confdence regons re the sme s the true confdence regons, then we hve α nsted of 2α 2 + ϑ 2 β 2 n the lst equton due to Theorem 4.3. Theorem 4.3 gurntees tht the robust projected Bellmn tertons of LSTDλ, LSPEλ nd TDλ- methods converge, whle Corollry 4.4 gurntees tht the soluton t converges to s not too fr wy from the true vlue functon v π. We refer the reder to 3 for more detls on LSTDλ, LSPEλ snce ther proof of convergence s nlogous to tht of TDλ. 4.2 Robust stochstc grdent descent lgorthms Whle the TDλ-lernng lgorthms wth functon pproxmton wth lner rchtectures converges to v π f the sttes re smpled ccordng to the polcy π, t s known to be unstble f the sttes re smpled n n offpolcy mnner,.e., n the termnology of the prevous secton π = π. Ths ssue ws ddressed by 23, 24 who proposed stochstc grdent descent bsed TD0 lgorthm tht converges for lner rchtectures n the off-polcy settng. Ths ws further extended by 7 who extended t to pproxmtons usng rbtrry smooth functons nd proved convergence to locl optmum. In ths secton we show how to extend these off-polcy methods to the robust settng wth uncertn trnstons. Note tht ths s n lterntve pproch to the requrement of Assumpton 4.1, snce under ths ssumpton ll off-polcy methods would lso converge. The mn de of 24 s to devse stochstc grdent lgorthms to mnmze the followng loss functon clled the men squre projected Bellmn error MSPBE lso studed n 1, 12. MSPBEθ := v θ ΠT π v θ 2 ξ. 96 Note tht the loss functon s 0 for θ tht stsfes the projected Bellmn equton, Φθ = T π Φθ. Consder lner rchtecture s n Secton 4.1 where v θ := Φθ. Let X be rndom stte chosen wth dstrbuton ξ. Denote φ by the shorthnd φ nd φ by φ. Then t s esy to show tht MSPBEθ := v θ ΠT π v θ 2 ξ = E dφ E φφ 1 E dφ, 97 where the expectton s over the rndom stte nd d s the temporl dfference error for the trnston,.e., d := c, + ϑθ φ θ φ, where the cton nd the new stte re chosen ccordng to the explorton polcy π. The negtve grdent of the MSPBE functon s 1 φ 2 MSPBEθ = E ϑφ φ w 98 = E dφ ϑe φ φ w 99 17

18 where w = E φφ 1 E dφ. Both d nd w depend on θ. Snce the expectton s hrd to compute exctly 24 ntroduce set of weghts w k whose purpose s to estmte w for fxed θ. Let d k denote the temporl dfference error for prmeter θ k. The weghts w k re then updted on fst tme scle s w k+1 := w k + β k d k φk w k φ k, 100 whle the prmeter θ k s updted on slower tmescle n the followng two possble mnners θ k+1 := θ k + α k φk ϑφ k φ k w k GTD2 101 θ k+1 := θ k + α k d k φ k ϑα k φ k φ k w k TDC extended ths to the cse of smooth nonlner rchtectures, where the spce S := { v θ θ R d} spnned by ll vlue functons v θ s now dfferentble sub-mnfold of R n rther thn lner subspce. Projectng onto such nonlner mnfolds s computtonlly hrd problem, nd to get round ths 7 project nsted onto the tngent plne t θ ssumng the prmeter θ chnges very lttle n one step. Ths llows 7 to generlze the updtes of equtons 100 nd 101 wth n ddtonl Hessn term 2 v θ whch vnshes f v θ s lner n θ. In the followng sectons we extend the stochstc grdent lgorthms of 7, 23, 24 to the robust settng wth uncertn trnston mtrces. Snce the number n of sttes s prohbtvely lrge, we wll mke the smplfyng ssumpton tht U = U nd Û = U for the results of the followng sectons Robust stochstc grdent lgorthms wth lner rchtectures In ths secton we extend the results of 24 to the robust settng, where we re nterested n fndng soluton to the robust projected Bellmn equton Φθ = T π Φθ, where T π s the robust Bellmn opertor of equton 77. Let T π denote the proxy robust Bellmn opertors usng the proxy uncertnty set Û nsted of U. A nturl generlzton of 24 s to ntroduce the followng loss functon whch we cll men squred robust projected Bellmn error MSRPBE: 2 MSRPBEθ := v θ Π T π v θ, 103 ξ where the proxy robust Bellmn opertor T s used. Note tht T π s no longer truly lner n θ even for lner rchtectures v θ = Φθ s T π Φθ = c, π + ϑσ π P Φθ 104 = c, π + ϑθ Φ p π + ϑ sup q θ, 105 q Φ Û where p π re the smultor trnston probblty vector. However, under the ssumpton tht Û s ncely behved set such s bll or n ellpsod, so tht chngng θ n smll neghborhood does not led to jumps n σ Φ Û θ, we my defne the grdent θ T π Φθ s θ T π Φθ := ϑφ p π + ϑ rg mx q θ 106 q Φ Û = ϑ rg mx q Φ P π q θ. 107 Recll the robust temporl dfference error d for stte wth respect to the proxy set Û s n equton 48 d := c, π + ϑv θ + σûv θ v θ

19 Under the ssumpton tht E φφ s full rnk, we my wrte the MSRPBE loss functon n terms of the robust temporl dfference errors d of equton 48 s n 24: MSRPBEθ = E dφ E φφ 1 E dφ. 109 = 0 becuse of equ- Note tht f E φφ s full rnk, then MSRPBEθ = 0 f nd only f E dφ ton 109. Defne µ P θ := mx y P y v θ = mx y P y Φθ = Φ rg mx y P y θ = rg mx y Φ P y θ 110 for ny convex compct set P R n, so tht the grdent of the MSRPBE loss functon cn be wrtten s 1 φ 2 MSRPBEθ = E ϑµûθ ϑφ φ E φφ 1 E dφ, 111 φ = E ϑµûθ φ w, 112 = E dφ ϑe φ φ w ϑe µûθφ w 113 where w = E φφ 1 E dφ s the sme s n equton 98 nd 24. Therefore, s n 24 we hve n estmtor w k for the weghts w for fxed prmeter θ k s w k+1 := w k + β k dk φ k w k wth the correspondng prmeter θ k beng updted s φ k, 114 θ k+1 := θ k + α k φk ϑµûθ φ k φ k w k robust-gtd2 115 θ k+1 := θ k + α k dk φ k ϑα k φ k + µ Û θφ k w k robust-tdc. 116 Run tme nlyss: Let T n P denote the tme to optmze lner functons over the convex set P for some P R n. Note tht the vlues v θ cn be computed smply n Od tme. Thus the updtes of robust-gtd2 nd robust-tdc cn be computed n O d + T n Û tme. In prtculr f the set Û s smple set lke n ellpsod wth ssocted mtrx A, then the optmum vlue σûv θ s smply θ Φ AΦθ, where Φ s the feture mtrx. In ths cse we only need to compute Φ AΦ once nd store t for future use. However, note tht ths stll tkes tme polynoml n n, whch s undesrble for n d. In ths cse, we need to to mke the ssumpton tht there re good rnk-d pproxmtons to Û.e., A BB for some n d mtrx B. Thus the totl run tme for ech updte n ths cse s Od 2. If the uncertnty set s spherclly symmetrc,.e., bll, then the expresson s smply Φθ 2 nd the robust temporl dfference errors of equton 48 nd the updtes of equton 114 nd 115 cn be vewed smply s regulr updtes of 23 wth n dded nose term Robust stochstc grdent lgorthms wth nonlner rchtectures In ths secton we generlze the results of Secton where we show how to extend the lgorthms of equton 114 nd 115 to the cse when the vlue functon v θ s no longer lner functon of θ. Ths lso generlzes the results of 7 to the robust settng wth correspondng robust nlogues of nonlner GTD2 nd nonlner TDC respectvely. Let M := { v θ θ R d} be the mnfold spnned by ll possble vlue functons nd let PM θ be the tngent plne of M t θ. Let TM θ be the tngent spce,.e., the trnslton of PM θ to the orgn. In other words, TM θ := { Φ θ u u R d}, where Φ θ s n n d mtrx wth entres 19

20 Φ θ, j := θ j v θ. Let Π θ denote the projecton wth to the weghted Euclden norm ξ on to the spce TM θ, so tht Π θ = Φ θ Φ θ ΞΦ θ 1 Φ θ Ξ 117 where Ξ s the n n dgonl mtrx wth entres ξ for X s n Secton 4.1. The men squred projected Bellmn equton MSPBE loss functon consdered by 7 cn then be defned s MSPBEθ = v θ Π θ Tv θ 2 ξ, 118 where we now project to the the tngent spce TM θ. The robust verson of the MSPBE loss functon, the men squred robust projected Bellmn equton MSRPBE loss cn then be defned n terms of the robust Bellmn opertor over the proxy uncertnty set Û MSRPBEθ = v θ Π θ Tv θ 2 ξ, 119 nd under the ssumpton tht E v θ v θ s non-sngulr, ths my be expressed n terms of the robust temporl dfference error d of equton 48 s n 7 nd equton 109: MSRPBEθ = E d vθ E v θ v θ 1 E d vθ, 120 where the expectton s over the sttes X drwn from the dstrbuton ξ. Note tht under the ssumpton tht E v θ v θ s non-sngulr, t follows due to equton 120 tht MSRPBEθ = 0 f nd only f E d vθ = 0. Snce v θ s no longer lner n θ, we need to redefne the grdent µ of σ for ny convex, compct set P s µ P θ := mx y P y v θ = Φ θ rg mx y P y v θ, 121 where Φ θ := v θ. The followng lemm expresses the grdent MSRPBEθ n terms of the robust temporl dfference errors, see Theorem 1 of 7 for the non-robust verson. Lemm 4.5. Assume tht v θ s twce dfferentble wth respect to θ for ny X nd tht Wθ := E v θ v θ s non-sngulr n neghborhood of θ. Let φ := v θ nd defne for ny u R d hθ, u := E d φ u 2 v θ u. 122 Then the grdent of MSRPBE wth respect to θ cn be expressed s 1 2 MSRPBEθ = E φ ϑµûθ ϑφ φ w + hθ, w, 123 where w = E φφ 1 E dφ s before. Proof. The proof s smlr to Theorem 1 of 7 by usng µûθ s the grdent of σûθ. Lemm 4.5 leds us to the followng robust nlogues of nonlner GTD nd nonlner TDC. The updte of the weght estmtors w k s the sme s n equton 114 w k+1 := w k + β k dk φk w k φ k,

21 wth the prmeters θ k beng updted on slower tmescle s { φk } θ k+1 := Γ θ k + α k ϑφ k ϑµ Û θ φk w k h k robust-nonlner-gtd2 125 { } θ k+1 := Γ θ k + α k dk φ k ϑφ k ϑµ Û θφ k w k h k robust-nonlner-tdc, 126 where h k := dk φk k w 2 v θk k w k nd Γ s projecton nto n pproprtely chosen compct set C wth smooth boundry s n 7. As n 7 the mn m of the projecton s to prevent the prmeters to dverge n the erly stges of the lgorthm due to the nonlnertes { n the lgorthm. } In prctce, f C s lrge enough tht t contns the set of ll possble solutons θ E d vθ = 0 then t s qute lkely tht no projectons wll hppen. However, we requre the projecton for the convergence nlyss of the robustnonlner-gtd2 nd robust-nonlner-tdc lgorthms, see Secton Let T n P denote the tme to optmze lner functon over the set P R n. Then the run tme s O d + T n Û. If Û s n ellpsod wth ssocted mtrx A, then n pproxmte optmum my be computed by smplng, f we hve rnk-d pproxmton to A,.e., A BB for some n d mtrx. If Û s spherclly symmetrc, then the σ Û s smply v θ 2 so tht the updtes of equtons 124 nd 115 my be vewed s the regulr updtes of 7 wth n dded nose term Convergence nlyss In ths secton we provde convergence nlyss for the robust-nonlner-gtd2 nd robust-nonlner-tdc lgorthms of equtons 124 nd 125. Note tht ths lso proves convergence of the robust-gtd2 nd robust-tdc lgorthms of equtons 114 nd 115 s specl cse. Gven the set C let CC denote the spce of ll C R d contnuous functons. Defne s n 7 the functon Γ : CC C R d Γθ + ε f θ θ Γ f θ := lm. 127 ε 0 ε Snce Γθ = rg mn θ C θ θ nd the boundry of C s smooth, t follows tht Γ s well defned. Let C denote the nteror of C nd C denote ts boundry so tht C = C \ C. If θ C, then Γvθ = vθ, otherwse Γθ s the projecton of vθ to the tngent spce of C t θ. Consder the followng ODE s n 7: θ = Γ 12 MSRPBE θ, θ0 C 128 { } nd let K be the set of ll stble equlbr of equton 128. Note tht the soluton set θ E dφ = 0 K. The followng theorem shows tht under the ssumpton of Lpschtz contnuous grdents nd sutble ssumptons on the step lengths α k nd β k nd the uncertnty set Û, the updtes of equtons 124 nd 125 converge. Theorem 4.6 Convergence of robust-nonlner-gtd2. Consder the robust nonlner updtes of equtons 124 nd 125 wth step szes tht stsfy k=0 α k = k=0 β k =, k=0 α2 k, k=0 β2 k <, nd α k β k 0 s k. Assume tht for every θ we hve E φ θ φθ s non-sngulr. Also ssume tht the mtrx Φ θ of grdents of the vlue functon defned s Φ θ := v θ s Lpschtz contnuous wth constnt L,.e., Φ θ Φ θ 2 L θ θ 2. Then wth probblty 1, θ k K s k. Proof. The rgument s smlr to the proof of Theorem 2 n 7. The only thng we need to verfy s the Lpschtz contnuty of the robust verson gθ k, w k of the functon gθ k, w k of 7 defned s gθ k, w k := E φ k ϑµûθφk w k h k θ k, w k,

Thus we only need to verfy Lpschtz contnuty of µûθ. Let y := rg mx y Û y v θ nd let z := rg mx z Û z v θ.

22 Fgure 2: Performnce of robust models wth dfferent szes of confdence regons on two envronments. Left: FrozenLke-v0 Rght: Acrobot-v1 where gθ k, w k s defned s gθ k, w k := E φ k ϑφ k θφ k w k h k θ k, w k, where φ k s the fetures of the stte the smultor trnstons to from stte. Thus we only need to verfy Lpschtz contnuty of µûθ. Let y := rg mx y Û y v θ nd let z := rg mx z Û z v θ. µûθ µûθ 2 = Φ θ y Φ θ z Φ θ y Φ θ y Φ θ Φ θ 2 y Φ θ Φ θ 2 rg mx y Therefore the µûθ s Lpschtz contnuous wth constnt L rg mx y Û y 2. y Û L rg mx y 2 θ θ y Û Corollry 4.7. Under the sme condtons s n Theorem 4.6, the robust-gtd2, robust-tdc nd robustnonlner-tdc lgorthms stsfy wth probblty 1 tht θ k K s k. 5 Experments We mplemented robust versons of Q-lernng, SARSA, nd TDλ-lernng s descrbed n Secton 3 nd evluted ther performnce gnst the nomnl lgorthms usng the OpenAI gym frmework 10. The envronments consdered for the exct dynmc progrmmng lgorthms re the text envronments of FrozenLke-v0, FrozenLke8x8-v0, Tx-v2, Roulette-v0, NChn-v0, s well s the control tsks of CrtPole-v0, CrtPole-v1, InvertedPendulum-v1, together wth the contnuous control tsks of MuJoCo 27. To test the performnce of the robust lgorthms, we perturb the models slghtly by choosng wth smll probblty p rndom stte fter every cton. The sze of the confdence regon U for the robust model s chosen by 10-fold cross vldton usng lne serch. After the Q-tble or the vlue functons re lerned for the robust nd the nomnl lgorthms, we evlute ther performnce on the true envronment. To compre the true lgorthms we compre both the cumultve rewrd s well s the tl dstrbuton functon complementry cumultve dstrbuton functon s n 26 whch for every plots the probblty tht the lgorthm erned rewrd of t lest. 22

Fgure 3: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on FrozenLke8x8-v0 wth p = 0.01.

Note tht there s trdeoff n the performnce of the robust lgorthms versus the nomnl lgorthms n terms of the vlue p.

Once we exceed the smplex n however, the robust lgorthms decys n performnce.

proportonl to how much the proxy confdence regon Û s outsde n.

However, snce lrge vlues of β lso led to suboptml convergence, we lso expect poor performnce for too lrge confdence regons,.e., lrge vlues of p.

Note tht the verge score ppers somewht errtc s functon of the sze of the uncertnty set, however ths s due to our smll smple sze used n the lne

23 Fgure 3: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on FrozenLke8x8-v0 wth p = Fgure 4: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on FrozenLke8x8-v0 wth p = 0.1. Note tht there s trdeoff n the performnce of the robust lgorthms versus the nomnl lgorthms n terms of the vlue p. As the vlue of p ncreses, we expect the robust lgorthm to gn n edge over the nomnl ones s long s Û s stll wthn the smplex n. Once we exceed the smplex n however, the robust lgorthms decys n performnce. Ths s due to the presence of the β term n the convergence results, whch s defned s β := mx mx mn y x 1, 135 X, A y Û nd t grows lrger proportonl to how much the proxy confdence regon Û s outsde n. Note tht whle β s 0, the robust lgorthms converge to the exct Q-fctor nd vlue functon, whle the nomnl lgorthm does not. However, snce lrge vlues of β lso led to suboptml convergence, we lso expect poor performnce for too lrge confdence regons,.e., lrge vlues of p. Fgure 2 depcts how the sze of the confdence regon ffects the performnce of the robust models; note tht the. Note tht the verge score ppers somewht errtc s functon of the sze of the uncertnty set, however ths s due to our smll smple sze used n the lne serch. See Fgures 3, 4, 5, 6, 7, 8, 9, 10, 11, nd 12 for comprson of the best robust model nd the nomnl model. 6 Acknowledgments The uthors would lke to thnk Guy Tennenholtz nd nonymous revewers for helpng mprove the presentton of the pper. References 1 András Antos, Csb Szepesvár, nd Rém Munos. Lernng ner-optml polces wth bellmnresdul mnmzton bsed ftted polcy terton nd sngle smple pth. Mchne Lernng, 711:89 129, x U 23

Fgure 5: Tl dstrbuton nd cumultve rewrds

nomnl Q-lernng on FrozenLke-v0 wth p = 0.1.

nomnl Q-lernng on CrtPole-v0 wth p = 0.001.

nomnl Q-lernng on CrtPole-v0 wth p = 0.01.

nomnl Q-lernng on CrtPole-v0 wth p = 0.3.

24 Fgure 5: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on FrozenLke-v0 wth p = 0.1. Fgure 6: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on CrtPole-v0 wth p = Fgure 7: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on CrtPole-v0 wth p = Fgure 8: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on CrtPole-v0 wth p =

25 Fgure 9: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on CrtPole-v1 wth p = 0.1. Fgure 10: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on CrtPole-v1 wth p = 0.3. Fgure 11: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on Tx-v2 wth p = 0.1. Fgure 12: Tl dstrbuton nd cumultve rewrds durng trnsent nd sttonry phse of robust vs nomnl Q-lernng on InvertedPendulum-v1 wth p =

Partially Observable Systems. 1 Partially Observable Markov Decision Process (POMDP) Formalism

Partially Observable Systems. 1 Partially Observable Markov Decision Process (POMDP) Formalism CS294-40 Lernng for Rootcs nd Control Lecture 10-9/30/2008 Lecturer: Peter Aeel Prtlly Oservle Systems Scre: Dvd Nchum Lecture outlne POMDP formlsm Pont-sed vlue terton Glol methods: polytree, enumerton,