MArkov decision processes (MDPs) have been widely

Size: px

Start display at page:

Download "MArkov decision processes (MDPs) have been widely"

Nora Oliver
6 years ago
Views:

1 Spre Mrkov Deciion Procee with Cul Spre Tlli Entropy Regulriztion for Reinforcement Lerning yungje Lee, Sungjoon Choi, nd Songhwi Oh rxiv: v3 [c.lg] 3 Oct 07 Abtrct In thi pper, re Mrkov deciion proce MDP with novel cul re Tlli entropy regulriztion i propoed. The propoed policy regulriztion induce re nd multi-modl optiml policy ditribution of re MDP. The full mthemticl nlyi of the propoed re MDP i provided. We firt nlyze the optimlity condition of re MDP. Then, we propoe re vlue itertion method which olve re MDP nd then prove the convergence nd optimlity of re vlue itertion uing the Bnch fixed point theorem. The propoed re MDP i compred to oft MDP which utilize cul entropy regulriztion. We how tht the performnce error of re MDP h contnt bound, while the error of oft MDP incree logrithmiclly with reect to the number of ction, where thi performnce error i cued by the introduced regulriztion term. In experiment, we pply re MDP to reinforcement lerning problem. The propoed method outperform exiting method in term of the convergence eed nd performnce. I. INTRODUCTION MArkov deciion procee MDP hve been widely ued mthemticl frmework to olve tochtic equentil deciion problem, uch utonomou driving [], pth plnning [], nd qudrotor control [3]. In generl, the gol of n MDP i to find the optiml policy function which mximize the expected return. The expected return i performnce meure of policy function nd it i often defined the expected um of dicounted rewrd. An MDP i often ued to formulte reinforcement lerning RL [4], which im to find the optiml policy without the explicit ecifiction of tochticity of n environment, nd invere reinforcement lerning IRL [5], whoe gol i to erch the proper rewrd function tht cn explin the behvior of n expert who follow the underlying optiml policy. While the optiml olution of n MDP i determinitic policy, it i not deirble to pply n MDP to the problem with multiple optiml ction. In perective of RL, the knowledge of multiple optiml ction mke it poible to cope with unexpected itution. For exmple, uppoe tht n utonomou vehicle h multiple optiml route to rech given gol. If trffic ccident occur t the currently elected optiml route, it i poible to void the ccident by chooing nother fe optiml route without dditionl computtion of new optiml route. For thi reon, it i more deirble to lern ll poible optiml ction in term of robutne of. Lee, S. Choi, nd S. Oh re with the Deprtment of Electricl nd Computer Engineering nd ASRI, Seoul Ntionl Univerity, Seoul 0886, ore e-mil: {kyungje.lee, ungjoon.choi, onghwi.oh}@cplb.nu.c.kr. policy function. In perective of IRL, ince the expert often mke multiple deciion in the me itution, determinitic policy h limittion in expreing the expert behvior. For thi reon, it i indienble to model the policy function of n expert multi-modl ditribution. Thee reon give rie to the neceity of multi-modl policy model. In order to ddre the iue with determinitic policy function, cul entropy regulriztion method h been utilized [6] [0]. Thi i minly due to the fct tht the optiml olution of n MDP with cul entropy regulriztion become oftmx ditribution of tte-ction vlue Q,, expq, i.e., = expq,, which i often referred to oft MDP []. While oftmx ditribution h been widely ued to model tochtic policy, it h wekne in modeling policy function when the number of ction i lrge. In other word, the policy function modeled by oftmx ditribution i prone to ign non-negligible probbility m to non-optiml ction even if tte-ction vlue of thee ction re dimiible. Thi tendency get wore the number of ction incree demontrted in Figure. In thi pper, we propoe re MDP by preenting novel cul re Tlli entropy regulriztion method, which cn be interpreted ecil ce of Tlli generlized entropy []. The propoed regulriztion method h unique property in tht the reulting policy ditribution become re ditribution. In other word, the upporting ction et which h non-zero probbility m contin re ubet of the ction ce. We provide full mthemticl nlyi bout the propoed re MDP. We firt derive the optimlity condition of re MDP, which i nmed re Bellmn eqution. We how tht the re Bellmn eqution i n pproximtion of the originl Bellmn eqution. Interetingly, we further find the connection between the optimlity condition of re MDP nd the probbility implex projection problem [3]. We preent re vlue itertion method for olving re MDP problem, where the optimlity nd convergence re proven uing the Bnch fixed point theorem [4]. We further nlyze the performnce gp of the expected return of the optiml policie obtined by re MDP nd oft MDP compred to tht of the originl MDP. In prticulr, we prove tht the performnce gp between the propoed re MDP nd the originl MDP h contnt bound the number of ction incree, where the performnce gp between oft MDP nd the originl MDP grow logrithmiclly. From thi property, re MDP hve benefit over oft MDP when

The propoed method i lo compred to the deep determinitic policy grdient DDPG method [5], which i deigned to operte in continuou ction ce without dicretiztion.

2 it come to olving problem in robotic with continuou ction ce. To vlidte effectivene of re MDP, we pply the propoed method to the explortion trtegy nd the updte rule of Q-lerning nd compre to the ɛ-greedy method nd oftmx policy [9]. The propoed method i lo compred to the deep determinitic policy grdient DDPG method [5], which i deigned to operte in continuou ction ce without dicretiztion. The propoed method how the tte of the rt performnce compred to other method the dicretiztion level of n ction ce incree. Rewrd mp nd ction vlue t tte. II. BACGROUND A. Mrkov Deciion Procee A Mrkov deciion proce MDP h been widely ued to formulte equentil deciion mking problem. An MDP cn be chrcterized by tuple M = {S, F, A, d, T, γ, r}, where S i the tte ce, F i the correonding feture ce, A i the ction ce, d i the ditribution of n initil tte, T, i the trnition probbility from S to S by tking A, γ 0, i dicount fctor, nd r i the rewrd function. The objective of n MDP i to find policy which mximize E [ t=0 γt r t, t, d, T ], where policy i mpping from the tte ce to the ction ce. For nottionl implicity, we denote the expecttion of dicounted ummtion of function f,, i.e., E[ t=0 γt f t, t, d, T ], by E [f, ], where f, i function of tte nd ction, uch rewrd function r, or n indictor function {= }. We lo denote the expecttion of dicounted ummtion of function f, conditioned on the initil tte, i.e., E[ t=0 γt f t, t, 0 =, T ], by E [f, 0 = ]. Finding n optiml policy for n MDP cn be formulted follow: mximize E [r t, t ] ubject to =,, 0. The necery condition for the optiml olution of i clled the Bellmn eqution. The Bellmn eqution i derived from the Bellmn optimlity principl follow: Q, = r, + γ V T, V = mx Q, = rg mx Q,, where V i vlue function of, which i the expected um of dicounted rewrd when the initil tte i given, nd Q, i tte-ction vlue function of, which i the expected um of dicounted rewrd when the initil tte nd ction re given nd, reectively. Note tht the optiml olution i determinitic function, which i referred to determinitic policy. B. Entropy Regulrized Mrkov Deciion Procee In order to obtin multi-modl policy function, n entropyregulrized MDP, lo known oft MDP, h been widely b Propoed policy model nd vlue difference drker i better. c Softmx policy model nd vlue difference drker i better. Fig. : A -dimenionl multi-objective environment with point m dynmic. The tte i loction nd the ction i velocity bounded with [ 3, 3] [ 3, 3]. The left figure how the rewrd mp with four mxim multiple objective. The ction ce i dicretized into two level: 9 low reolution nd 49 high reolution. The middle re., right figure how the optiml ction vlue t tte indicted red cro point when the number of ction i 9 re., 49. b The firt nd third figure indicte the propoed policy ditribution t tte induced by the ction vlue in. The econd nd fourth figure how mp of the performnce difference between the propoed policy nd the optiml policy t ech tte when the number of ction i 9 nd 49, reectively. The lrger the error, the brighter the color of the tte. c All figure re obtined in the me wy b by replcing the propoed policy with oftmx policy. Thi exmple how tht the propoed policy model i le ffected when the number of ction incree. ued [8] []. In oft MDP, cul entropy regulriztion over i introduced to obtin multi-modl policy ditribution, i.e.,. Since cul entropy regulriztion penlize determinitic ditribution, it mke n optiml policy of oft MDP to be oftmx ditribution. A oft MDP i formulted follow: mximize E [r t, t ] + H ubject to =,, 0, where H E [ log t t ] i γ-dicounted cul entropy nd i regulriztion coefficient. Thi problem h been extenively tudied in [6], [8], []. In [],

3 3 oft Bellmn eqution nd the optiml policy ditribution re derived from the ruh uhn Tucker T condition follow: Q oft, = r, + γ V oft T, Q V oft oft, = log exp Q oft, exp = exp Q oft,, where V oft = E [r t, t log t t 0 = ] Q oft, = E [r t, t log t t 0 =, 0 = ]. V oft i oft vlue of indicting the expected um of rewrd including the entropy of policy, obtined by trting t tte nd Q oft, i oft tte-ction vlue of, which i the expected um of rewrd obtined by trting t tte by tking ction. Note tht the optiml policy ditribution i oftmx ditribution. In [], oft vlue itertion method i lo propoed nd the optimlity of oft vlue itertion i proved. By uing cul entropy regulriztion, the optiml policy ditribution of oft MDP i ble to repreent multi-modl ditribution. The cul entropy regulriztion h n effect of mking the reulting policy of oft MDP cloer to uniform ditribution the number of ction incree. To hndle thi iue, we propoe novel regulriztion method whoe reulting policy ditribution till h multiple mode tochtic policy but the performnce lo i le thn oftmx policy ditribution. III. SPARSE MAROV DECISION PROCESSES We propoe re Mrkov deciion proce by introducing novel cul re Tlli entropy regulrizer: [ ] W E γ t t t, d, T t=0 [ ] = E. By dding W to the objective function of, we im to olve the following optimiztion problem: mximize E [r, ] + W ubject to =,, 0, where > 0 i regulriztion coefficient. We will firt derive the re Bellmn eqution from the necery condition of 3. Then by oberving the connection between the re Bellmn eqution nd the probbility implex projection, we how tht the optiml policy become remx ditribution, where the rity cn be controlled by. In ddition, we preent re vlue itertion lgorithm where the optimlity i gurnteed uing the Bnch fixed point theorem. The detiled derivtion of lemm nd theorem in thi pper cn be found in Appendix A. 3 A. Nottion nd Propertie We firt introduce nottion nd propertie ued in the pper. In Tble I, ll nottion nd definition re ummrized. The utility, vlue, tte viittion cn be compctly expreed below in term of vector nd mtrice: J J oft = d G r, = d G r oft V, V oft = G r = G r oft, ρ = d G where x i the trnoe of vector x, G = I γt, indicte re MDP problem nd oft indicte oft MDP problem. B. Spre Bellmn Eqution from ruh-uhn-tucker condition The re Bellmn eqution cn be derived from the necery condition of n optiml olution of re MDP. We crefully invetigte the ruh uhn Tucker T condition, which indicte necery condition for olution to be optiml when ome regulrity condition bout the feible et re tified. The feible et of re MDP tifie linerity contrint qulifiction [6] ince the feible et conit of liner fine function. In thi regrd, the optiml olution of re MDP necerily tify T condition follow. Theorem. If policy ditribution i the optiml olution of re MDP 3, then nd the correonding re vlue function V necerily tify following eqution for ll tte nd ction pir: Q, = r, + γ V T, V = Q, Q, τ + S Q, Q, = mx τ, 0, 4 Q Q, where τ, = S, S i et of Q, j ction tifying + i Q, i > i j=0 with indicting the ction with the ith lrget ction vlue i Q, i, nd i the crdinlity of S. The full proof of Theorem i provided in Appendix A-A. The proof depend on the T condition where the derivtive of Lgrngin objective function with reect to policy become zero t the optiml olution, the ttionry condition. From 4, it cn be hown tht the optiml olution obtined from the re MDP ign zero probbility to the ction whoe ction vlue Q, i below the threhold Q τ, nd the optiml policy ign poitive probbility to ner optiml ction in proportion to their ction vlue, Q where the threhold τ, determine the rnge of ner optiml ction. Thi property mke the optiml policy to hve re ditribution nd prevent the performnce drop cued by igning non-negligible poitive probbilitie to non-optiml ction, which often occur in oft MDP. From the definition of S nd, we cn further oberve n intereting connection between the re Bellmn eqution nd the probbility implex projection problem [3].

4 4 Term re MDP [ oft MDP Utility J E r, + ] = dv = J oft E [r, log ] r ρ = oft dv = roft ρ Vlue V [ E r, + 0 = ] V oft = r + γ E [r, log 0 = ] V T = r oft + γ V oft T Action vlue Q, r, + γ Q oft V T,, r, + γ V oft T, Expected r Stte Rewrd r, + r oft r, log [ Policy Regulriztion = H = E [ log ] W E ], ρ =, log ρ Mx Approximtion mxz z i τz + logumexp z log i expzi Vlue U r, +γ x = mx x T, U oft r, +γ x = logumexp Itertion Opertor Stte Viittion Stte Action Viittion Trnition Probbility given [ ] ρ E { =} = d + γ, T, ρ, [ ] ρ, E { =, =} = d + γ, T, ρ, T T, TABLE I: Nottion nd Propertie x T, C. Probbility Simplex Projection nd SpreMx Opertion The probbility implex projection [3] i well known problem of projecting d-dimenionl vector into d dimenionl probbility implex in n Eucliden metric ene. A probbility implex projection problem i defined follow: minimize p ubject to p z d p i =, p i 0, i =,, d, where z i given d-dimenionl vector, d i the dimenion of p nd z, nd p i i the ith element of p. Let z i be the ith lrget element of z nd uppz be the upporting et of the optiml olution defined by uppz = {z i + iz i > i j= z j}. It i well known fct tht the problem 5 h cloed form olution which i p i z = mxz i τz, 0, where i indicte the ith dimenion, p i z i the ith element of the optiml olution for fixed z, nd τz = z i with = uppz [3], [7]. Interetingly, the optiml olution p, τ nd the upporting et upp of 5 cn be preciely mtched to thoe of the re Bellmn eqution 4. From thi obervtion, it cn be hown tht the optiml policy ditribution of re MDP i the projection of Q, into probbility implex. Note tht we refer p remx ditribution. More urpriingly, V cn be repreented n pproximtion of the mx opertion derived from p z. A differentible pproximtion of the mx opertion i defined follow: mxz z i τz We cll mxz remx. In [7], it i proven tht mxz i n indefinite integrl of p z, i.e., mxz = p z dz+c, where C i contnt nd, in our ce, C =. We provide imple upper nd lower bound of mxz with reect to mxz z mxz mx mxz + d d. 7 The lower bound of remx i hown in [7]. However, we provide nother proof of the lower bound nd the proof for the upper bound in Appendix A-B. The bound 7 how tht remx i bounded nd mooth pproximtion of mx nd, from thi fct, 4 cn be interpreted n pproximtion of the originl Bellmn eqution. Uing thi nottion, V cn be rewritten, Q V = mx., D. Supporting Set of Spre Optiml Policy The upporting et S of re MDP i et of ction with nonzero probbilitie nd the crdinlity of S cn be controlled by regulriztion coefficient, while the upporting et of oft MDP i lwy the me the entire ction ce. In re MDP, ction igned with non-zero probbility mut tify the following inequlity: + iq, i > i j= Q, j, 8 where i indicte the ction with the ith lrget ction vlue. From thi inequlity, it cn be hown tht control the mrgin between the lrget ction vlue nd the other included in the upporting et. In other word, incree,

5 5 the crdinlity of upporting et incree ince the ction vlue tht tify 8 incree. Converely, decree, the upporting et decree. In extreme ce, if goe zero, only will be included in S nd if goe infinity, the entire ction will be included in S. On the other hnd, in oft MDP, the upporting et of oftmx ditribution cnnot be controlled by the regulriztion coefficient even if the hrpne of the oftmx ditribution cn be djuted. Thi property mke re MDP hve n dvntge over oft MDP, ince we cn give zero probbility to non-optiml ction by controlling. E. Connection to Tlli Generlized Entropy The notion of the Tlli entropy w introduced by C. Tlli generl extenion of entropy [] nd the Tlli entropy h been widely ued to decribe thermodynmic ytem nd moleculr motion. Surpriingly, the propoed regulriztion i cloely relted to ecil ce of the Tlli entropy. The Tlli entropy i defined follow: S q,k p = k q i where p i probbility m function, q i prmeter clled entropic-index, nd k i poitive rel contnt. Note tht, if q nd k =, S, p i the me entropy, i.e., i p i logp i. In [], [8], it i hown tht H i n extenion of S, ince H = E [S, ] =, log ρ. We dicover the connection between the Tlli entropy nd the propoed regulriztion when q = nd k =. Theorem. The propoed policy regulriztion W i n extenion of the Tlli entropy with prmeter q = nd k = to the verion of cul entropy, i.e., p q i W = E [S, ]. The proof i provided in Appendix A-D From thi theorem, W cn be interpreted n extenion of S, p to the ce of cully conditioned ditribution, imilrly to the cul entropy. IV. SPARSE VALUE ITERATION In thi ection, we propoe n lgorithm for olving cul re Tlli entropy regulrized MDP problem. Similr to the originl MDP nd oft MDP, the re verion of vlue itertion cn be induced from the re Bellmn eqution. We firt define re Bellmn opertion U : R S R S : for ll, r, + γ U x = mx x T,,, where x i vector in R S nd U x i the reulting vector fter pplying U to x nd U x i the element for tte in U x. Then, the re vlue itertion lgorithm cn be decribed imply x i+ = U x i, where i i the number of itertion. In the following ection, we how the convergence nd the optimlity of the propoed re vlue itertion method. A. Optimlity of Spre Vlue Itertion In thi ection, we prove the convergence nd optimlity of the re vlue itertion method. We firt how tht U h monotonic nd dicounting propertie nd, by uing thoe propertie, we prove tht U i contrction. Then, by the Bnch fixed point theorem, repetedly pplying U for n rbitrry initil point lwy converge into the unique fixed point. Lemm. U i monotone: for x, y R S, if x y, then U x U y, where indicte n element-wie inequlity. Lemm. For ny contnt c R, U x + c = U x + γc, where R S i vector of ll one. The full proof cn be found in Appendix A-E. The proof of Lemm nd Lemm rely on the bounded property of the remx opertion. It i poible to prove tht the re Bellmn opertor U i contrction uing Lemm nd Lemm follow: Lemm 3. U i γ-contrction mpping nd hve unique fixed point, where γ i in 0, by definition. Uing Lemm, Lemm, nd Lemm 3, the optimlity nd convergence of re vlue itertion cn be proven. Theorem 3. Spre vlue itertion converge to the optiml vlue of 3. The proof cn be found in Appendix A-E. Theorem 3 i proven uing the uniquene of the fixed point of U nd the re Bellmn eqution. V. PERFORMANCE ERROR BOUNDS FOR SPARSE VALUE ITERATION We prove the bound of the performnce gp between the policy obtined by re MDP nd the policy obtined by the originl MDP, where thi performnce error i cued by regulriztion. The boundedne of 7 ply n crucil role to prove the error bound. The performnce bound cn be derived from bound of remx. A imilr pproch cn be pplied to prove the error bound of oft MDP ince log-um-exp function i lo bounded pproximtion of the mx opertion. Comprion of log-um-exp nd remx opertion i provided in Appendix A-C Before explining the performnce error bound, we introduce two ueful propoition which re employed to prove the performnce error bound of re MDP nd oft MDP. We firt prove n importnt fct which how tht the optiml vlue of re vlue itertion nd oft vlue itertion re greter thn tht of the originl MDP. Lemm 4. Let U nd U oft be the Bellmn opertion of n originl MDP nd oft MDP, reectively, uch tht, for tte

6 6 Algorithm Spre Deep Q-Lerning : Initilize prioritized reply memory M =, Q network prmeter θ nd θ : for i = 0 to N do 3: Smple initil tte 0 d 0 4: for t = 0 to T do 5: Smple ction t t 4 6: Excute t nd oberve next tte t+ nd rewrd r t 7: Add experience to reply memory M with n initil importnce weight, M t, t, r t, t+, w 0 M 8: Smple mini-btch B from M bed on importnce weight Qj+, ;θ 9: Set trget vlue y j of j, j, r j, j+, w j in B, y j = r j + γmx 0: Minimize j wj yj Qj, j; θ uing grdient decent method : Updte importnce weight {w j} bed on temporl difference error δ j = y j Q j, j; θ [9] : end for 3: Updte θ = θ every c itertion 4: end for nd x R S, Ux = mx U oft x = log r, + γ x T, r, + γ x T, exp. Then following inequlitie hold for every integer n > 0: U n x U n x, U n x U oft n x, where U n re., U n i the reult fter pplying U re., U n time. In ddition, let x, x nd x oft be the fixed point of U, U nd U oft, reectively. Then, following inequlitie lo hold: x x, x x oft. The detiled proof i provided in Appendix A-F. Lemm 4 how tht the optiml vlue, V nd V oft, obtined by re vlue itertion nd oft vlue itertion re lwy greter thn the originl optiml vlue V. Intuitively eking, the reon for thi inequlity i due to the regulriztion term, i.e., W or H, dded to the objective function. Now, we dicu other ueful propertie bout the propoed cul re Tlli entropy regulriztion W nd cul entropy regulriztion H. Lemm 5. W nd H hve following upper bound: W γ A, H log A A γ where A i the crdinlity of the ction ce A. The proof i provided in Appendix A-F. Theorem 5 cn be induced by extending the upper bound of S, nd S, to the cul entropy nd cul re Tlli entropy. By uing Lemm 4 nd Lemm 5, the performnce bound for re MDP nd oft MDP cn be derived follow. Theorem 4. Following inequlitie hold: E r, A E r, E r,, γ A where nd re the optiml policy obtined by the originl MDP nd re MDP, reectively. Theorem 5. Following inequlitie hold: E r, γ log A E oftr, Er, where nd oft re the optiml policy obtined by the originl MDP nd oft MDP, reectively. The proof of Theorem 4 nd Theorem 5 cn be found in Appendix A-F. Thee error bound how u tht the expected return of the optiml policy of re MDP h lwy tighter error bound thn tht of oft MDP. Moreover, it cn be lo known tht the bound for the propoed re MDP converge γ to contnt the number of ction incree, where the error bound of oft MDP grow logrithmiclly. Thi property h cler benefit when re MDP i pplied to robotic problem with continuou ction ce. To pply n MDP to continuou ction ce, dicretiztion of the ction ce i eentil nd fine dicretiztion i required to obtin olution which i cloer to the underlying continuou optiml policy. Accordingly, the number of ction become lrger the level of dicretiztion incree. In thi ce, re MDP h dvntge over oft MDP in tht the performnce error of re MDP i bounded by contnt fctor the number of ction incree, where performnce error of optiml policy of oft MDP grow logrithmiclly. VI. SPARSE EXPLORATION AND UPDATE RULE FOR SPARSE DEEP Q-LEARNING In thi ection, we firt propoe re Q-lerning nd further extend to re deep Q-lerning where remx policy nd the re Bellmn eqution re employed explortion method nd updte rule. Spre Q-lerning i model free method to olve the propoed re MDP without the knowledge of trnition probbilitie. In other word, when the trnition probbility T, i unknown but mpling from T, i poible, re Q-lerning etimte n optiml Q of the re MDP uing mpling, Q-lerning find n pproximted vlue of n optiml Q of the conventionl MDP. Similr to Q-lerning, the updte eqution of re Q-lerning i derived

7 7 Performnce Bound b Supporting Set Comprion Fig. : The performnce gp i clculted the bolute vlue of the difference between the performnce of re MDP or oft MDP nd the performnce of n originl MDP. b The rtio of the number of upporting ction to the totl number of ction i hown. The ction ce of unicycle dynmic i dicretized into 5 ction. from the re Bellmn eqution, Q i, i Q i, i + [ Q ] i+, ηi r i, i + γmx Q i, i, where i indicte the number of itertion nd ηi i lerning rte. If the lerning rte ηi tifie i=0 ηi = nd i=0 ηi <, then, the number of mple incree to infinity, re Q-lerning converge to the optiml olution of re MDP. The proof of the convergence nd optimlity of re Q-lerning i the me tht of the tndrd Q- lerning [0]. The propoed re Q-lerning cn be eily extended to re deep Q-lerning uing deep neurl network n etimtor of the re Q vlue. In ech itertion, re deep Q-lerning perform grdient decent tep to minimize the qured lo y Q, ; θ, where θ i the prmeter of the Q network. Here, y i the trget vlue defined follow: Q, ; θ y = r, + γmx, where i the next tte mpled by tking ction t the tte nd θ indicte network prmeter. Moreover, we employ the remx policy the explortion trtegy where the policy ditribution i computed by 4 with ction vlue etimted by deep Q network. The remx policy exclude the ction whoe etimted ction vlue i too low to be re-explored, by igning zero probbility m. The effectivene of the remx explortion i invetigted in Section VII. For tble convergence of Q network, we utilize double Q- lerning [], where the prmeter θ for obtining policy nd the prmeter θ for computing the trget vlue re eprted nd θ i updted to θ t every predetermined itertion. In other word, double Q-lerning prevent intbility of deep Q-lerning by lowly updting the trget vlue. Prioritized experience reply [9] i lo pplied where the optimiztion of network proceed in conidertion of the importnce of experience. The whole proce of re deep Q-lerning i ummrized in Algorithm. VII. EXPERIMENTS We firt verify Theorem 4, Theorem 5 nd the effect of 8 in imultion. For verifiction of Theorem 4 nd Theorem 5, we meure the performnce of the expected return while increing the number of ction, A. For verifiction of the effect of 8, the crdinlity of the upporting et of optiml policie of re nd oft MDP re compred t different vlue of. To invetigte effectivene of the propoed method, we tet remx explortion nd the re Bellmn updte rule on reinforcement lerning with continuou ction ce. To pply Q-lerning to continuou ction ce, fine dicretiztion i necery to obtin olution which i cloer to the originl continuou optiml policy. A the level of dicretiztion incree, the number of ction to be explored become lrger. In thi regrd, n efficient explortion method i required to obtin high performnce. We compre our method to other explortion method with reect to the convergence eed nd the expected um of rewrd. We further check the effect of the updte rule. A. Experiment on Performnce Bound nd Supporting Set To verify our theorem bout performnce error bound, we crete trnition model T by dicretiztion of unicycle dynmic defined in continuou tte nd ction ce nd olve the originl MDP, oft MDP nd re MDP under predefined rewrd while increing the dicretiztion level of the ction ce. The rewrd function i defined liner combintion of two qured exponentil function, i.e., rx = exp x x exp x x, where x i σ σ loction of unicycle, x i gol point, x i the point to void, nd σ nd σ re cle prmeter. The rewrd function i deigned to let n gent to nvigte towrd x while voiding x. The bolute vlue of difference between the expected return of the originl MDP nd tht of re MDP or oft MDP i meured. A hown in Figure, the performnce gp of re MDP converge to contnt bound while the performnce of the oft MDP grow logrithmiclly. Note tht the performnce gp of the re MDP nd oft MDP re lwy mller thn their error bound. Supporting et experiment re conducted uing dicretized unicycle dynmic. The crdinlity of optiml policie re meured while vrie from 0. to 00. In Figure b, while the rtio of the upporting et for oft MDP i chnged from 0.79 to.00, the rtio for re MDP i chnged from 0.4 to 0.99, demontrting the rene of the propoed re MDP compred to oft MDP. B. Reinforcement Lerning in Continuou Action Spce We tet our method in MuJoCo [], phyic-bed imultor, uing two problem with continuou ction ce: Inverted Pendulum nd Recher. The ction ce i dicretized to pply Q-lerning to continuou ction ce nd experiment re conducted with four different dicretiztion level to vlidte the effectivene of remx explortion nd the re Bellmn updte rule.

8 8 We compre the remx explortion method to the ɛ- greedy method nd oftmx explortion [0] nd further compre the re Bellmn updte rule to the originl Bellmn updte rule [0] nd the oft Bellmn updte rule []. In ddition, three different regulriztion coefficient etting re experimented. In totl, we tet 7 combintion of vrint of deep Q-lerning by combining three explortion method, three updte rule, nd three different regulriztion coefficient of 0.0, 0., nd. The deep determinitic policy grdient DDPG method [5], which operte in continuou ction ce without dicretiztion of the ction ce, i lo compred. Hence, totl of 8 lgorithm re teted. Reult re hown in Figure 3 nd Figure 4 for inverted pendulum nd recher, reectively, where only the top five lgorithm re plotted nd ech point in grph i obtined by verging the vlue from three independent run with different rndom eed. Reult of ll 8 lgorithm re provided in Appendix B. Q network with two 5 dimenionl hidden lyer i ued for the inverted pendulum problem nd network with four 56 dimenionl hidden lyer i ued for the recher problem. Ech Q-lerning lgorithm utilize the me network topology. For inverted pendulum, ince the problem i eier thn the recher problem, mot of top five lgorithm converge to the mximum return of 000 t ech dicretiztion level hown in Figure 3. Four of top five lgorithm utilize the propoed remx explortion. Only one of the top five method utilize the oftmx explortion. In Figure 3b, the number of epiode required to rech ner optiml return, 980, i hown. The remx explortion require le number of epiode to obtin ner optiml vlue thn ɛ-greedy, oftmx explortion. For the recher problem, the lgorithm with remx explortion lightly outperform ɛ-greedy method nd the performnce of oftmx explortion i not included in the top five hown in Figure 4. In term of the number of required epiode, remx explortion outperform epilon greedy method hown in Figure 4b, where we et the threhold return to be 6. DDPG how poor performnce in both problem ince the number of mpled epiode i inufficient. In thi regrd, deep Q-lerning with remx explortion outperform DDPG with le number of epiode. From thee experiment, it cn be known tht the remx explortion method h n dvntge over oftmx explortion, ɛ-greedy method nd DDPG with reect to the number of epiode required to rech the optiml performnce. VIII. CONCLUSION In thi pper, we hve propoed new MDP with novel cul re Tlli entropy regulriztion which induce re nd multi-modl optiml policy ditribution. In ddition, we hve provided the full mthemticl nlyi of the propoed re MDP: the optimlity condition of re MDP given the re Bellmn eqution, re vlue itertion nd it convergence nd optimlity propertie, nd the performnce bound between the propoe MDP nd the To tet DDPG, we ued the code from Open AI vilble t com/openi/beline. Expected Return b Required Epiode Fig. 3: Inverted pendulum problem. Algorithm re nmed <explortion method>+<updte rule>+<>. The verge performnce of ech lgorithm fter 3000 epiode. The performnce of DDPG i out of cle. b The verge number of epiode required to rech the threhold vlue 980. Expected Return b Required Epiode Fig. 4: Recher problem. The verge performnce of ech lgorithm fter 0000 epiode. b The verge number of epiode required to rech the threhold vlue 6. originl MDP. We hve lo proven tht the performnce gp of re MDP i trictly mller thn tht of oft MDP. In experiment, we hve verified tht the theoreticl performnce gp of re MDP nd oft MDP from the originl MDP re correct. We hve pplied the remx policy nd re Bellmn eqution to deep Q-lerning the explortion trtegy nd updte rule, reectively, nd hown tht the propoed explortion method how ignificntly better performnce compred to ɛ-greedy, oftmx explortion, nd DDPG, the number of ction incree. From the nlyi nd experiment, we hve demontrted tht the propoed re MDP cn be n efficient lterntive to problem with lrge number of poible ction nd even continuou ction ce. APPENDIX A A. Spre Bellmn Eqution from ruh-uhn-tucker condition The following proof explin the optimlity condition of the re MDP from ruh-uhn-tucker T condition. Proof of Theorem : The T condition of 3 re follow:, = 0, 0 9, λ 0 0, λ = 0, L, c, λ = 0

9 9 where c nd λ re Lgrngin multiplier for the equlity nd inequlity contrint, reectively, nd 9 i the feibility of priml vrible, 0 i the feibility of dul vrible, i the complementry lckne nd i the ttionrity condition. The Lgrngin function of 3 i written follow: L, c, λ = J + c, λ where the mximiztion of 3 i chnged into the minimiztion problem, i.e., min J. Firt, the derivtive of J cn be obtined by uing the chin rule. J = d G + γd G T G = ρ r + T γρ V r = ρ r, + + γ = ρ Q, +. r V T, Here, the prtil derivtive of Lgrngin i obtined follow: L, c, λ = ρ Q, + + c λ = 0. Firt, conider poitive where the correonding Lgrngin multiplier λ i zero due to the complementry lckne. By umming with reect to ction, Lgrngin multiplier c cn be obtined follow: 0 = ρ Q, + + c = c ρ + + Q, = c ρ + >0 >0 + Q, = Q c = ρ, >0 + where i the number of poitive element of. By replcing c with thi reult, the optiml policy ditribution i induced follow. = c, ρ + + Q = Q, >0 Q, A thi eqution i derived under the umption tht i poitive. For > 0, following condition i necerily fulfilled, Q, > >0 Q,. We notte thi upporting et S = { + Q, > Q, >0 }. S contin the ction which h lrger ction vlue thn threhold τq, = >0 Q,. By uing thee nottion, the optiml policy ditribution cn be rewritten follow: Q = mx, Q τ,, 0. By ubtituting with thi reult, the following optimlity eqution of V i induced. V Q, + = = Q, + = S Q = S =, S Q, Q τ, + Q, Q + τ, + Q, τ Q, + To ummrize, we obtin the re Bellmn eqution follow: Q, = r, + γ V T, V = Q, Q, τ + S Q, Q, = mx τ, 0. B. Upper nd Lower Bound for Spremx Opertion In thi ection, we prove the lower nd upper bound of mxz defined in 6. We would like to mention tht the proof of lower bound of 7 i provided in [7]. However, we find nother intereting wy to prove 7 by uing the Cuchy-Schwrtz inequlity nd the nonnegtive property of qudrtic eqution. We firt prove mxz mxz nd next prove mxz mxz + d d. Without lo of generlity, we ume tht = but the originl inequlitie cn be imply obtined by replcing z with z. Lower Bound of SpreMx Opertion. For ll z R d, mxz mxz hold. Proof: We prove tht, for ll z, mxz z 0 where z = mxz by definition. The proof i done by

10 0 imply rerrnging the term in 6, mxz z = z i τz + z = z i z i + z = z i z i + z z i z i z + = = z + z i i= z + z i z +. i= The qudrtic term cn be decompoed follow: z + z i i= = z + z i + + z i= i= z i z z i. By putting thi reult into the eqution nd rerrnging them, three term re obtined follow: i= mxz z { = } z z z i + i= + zi + z i + z i. i= i= i= Then, i= z i + i= z i + cn be replced with i= zi + { i= z i nd we lo decompoe the econd term z i= z i + into two } { } prt: z i= z i + nd z, nd rerrnge the eqution follow, = { z z zi + } i= i= + zi + z i z i. Agin, we chnge i= z i i= i z into i= z i + + by dding nd ubtrcting follow, = + i= i= { z z zi + } i= zi + z i + +. i= i= { Then, the term z z i= zi + } i i=z i + reformulted z. i= z i+ By uing thi reformultion, we cn obtin following eqution. [ i= zi + ] = z + i= z i+ + zi + z i + +. Finlly, we cn obtin three term by rerrnging the bove eqution, i= [ i= zi + ] = z + zi + i= z i + i= [ i= zi + ] = z + zi + i= i= i= + z i + + where the firt nd third term re qudrtic nd lwy nonnegtive. The econd term i lo lwy nonnegtive by the Cuchy-Schwrtz inequlity. The Cuchy-Schwrtz inequlity i written p q p q. Let z : = [z,, z ], then, by etting p = z : + nd q = where i dimenionl vector of one, it cn be hown tht the econd term i nonnegtive. Therefore, mxz z i lwy nonnegtive for ll z ince three remining term re lwy nonnegtive, completing the proof. Now, we prove the upper bound of remx opertion. Upper Bound of SpreMx Opertion. For ll z R d, mxz mxz + d d hold. Proof: Firt, we decompoe the ummtion of 6 into

11 two term follow: mxz = = = = z i τz + zi τz z i + τz + p i z z i + τz + p i zz i + τz p i zz i + τz = p i zz i + p i z + + z i + where p i = mxz i τz, 0 which i the optiml olution of the implex projection problem 5 nd p i z = by definition. Now, we ue the fct tht, for every p on d dimenionl implex, d i p iz i mxz for ll z R d. By uing thi property, p z nd re on the probbility implex, following inequlity i induced, mxz = p i zz i + z i + mxz + mxz + mxz + mxz d + where d by definition of. Therefore, mxz mxz + d d hold. C. Comprion to Log-Sum-Exp We explin the error bound for the log-um-exp opertion nd compre it to the bound of the remx opertion. The log-um-exp opertion h widely known bound, mxz logumexpz mxz + logd. We would like to note tht remx h tighter bound thn log-um-exp it i lwy tified tht, for ll d >, d d logd. Intuitively, the pproximtion error of logum-exp incree the dimenion of input ce incree. However, the pproximtion error of remx pproche to the dimenion of input ce goe infinity. Thi fct ply crucil role in compring performnce error bound of the re MDP nd oft MDP. D. Cul Spre Tlli Entropy The following proof how tht W i equivlent to the dicounted expected um of ecil ce of Tlli entropy when q = nd k =. Proof of Theorem : The proof i imply done by rewriting our regulriztion follow: W [ ] = E γ t t t, d, T t=0 = [ ] E γ t {t =, t =}, d, T, t=0 = ρ,, = ρ = = = ρ ρ ] S, ρ = E [S,. E. Convergence nd Optimlity of Spre Vlue Itertion In thi ection, the monotonicity, dicounting property, contrction of re Bellmn opertion U re proved. Proof of Lemm : In [7], the monotonicity of 6 i proved. Then, the monotonicity of U cn be proved uing 6. Let x nd y re given uch tht x y. Then, r, + γ x T, r, + γ y T, where T, i trnition probbility which i lwy nonnegtive. Since the remx opertion i monotone, the following inequlity i induced r, + γ x T, mx r, + γ mx Finlly, we cn obtin U x U y. y T,. Proof of Lemm : In [7], it i hown tht for c R nd x R S, mxx + c = mxx + c. Uing thi property, U x + c r, + γ x + ct, = mx r, + γ x T, + γc = mx T, r, + γ x T, = mx + γc r, + γ x T, = mx + γc U x + c = U x + γc.

12 Proof of Lemm 3: Firt, we prove tht U i γ- contrction mpping with reect to d mx. Without lo of generlity, the proof i dicued for generl function φ : R S R S with dicounting nd monotone propertie. Let d mx x, y = M. Then, y M x y + M i tified. By monotone nd dicounting propertie, the following inequlity between mpping φx nd φy i etblihed. φy γm φx φy + γm, where γ i dicounting fctor of φ. From thi inequlity, d mx φx, φy γm = γd mx x, y nd γ 0,. Therefore, φ i γ-contrction mpping. In our ce, U i γ-contrction mpping. A R S nd d mx x, y re non-empty complete metric ce, by Bnch fixed-point theorem, γ-contrction mpping U h unique fixed point. Uing Lemm, Lemm, nd Lemm 3, we cn prove the convergence nd optimlity of re vlue itertion. Proof of Theorem 3: Spre vlue itertion converge into fixed point of U by the contrction property. Let x be fixed point of U nd, by definition of U, x i the point tht tifie the re Bellmn eqution, i.e. x = U x. Hence, by Theorem, x tifie neceity condition of the optiml olution. By the Bnch fixed point theorem, x i unique point which tifie neceity condition of optiml olution. In prticulr, x = U x i preciely equivlent to the re Bellmn eqution. In other word, there i no other point tht tifie the re Bellmn eqution. Therefore, x i the optiml vlue of re MDP. F. Performnce Error Bound for Spre Vlue Itertion In thi ection, we prove the performnce error bound for re vlue itertion nd oft vlue itertion. We firt how tht the optiml vlue of re MDP nd oft MDP re greter thn tht of the originl MDP. Proof of Lemm 4: We firt prove the inequlity of the re Bellmn opertion U n x U n x, x x. Thi inequlity cn be proven by the mthemticl induction. When n =, the inequlity i proven follow: Therefore, mx r, + γ x T, mx r, + γ x T, mxz mxz. Ux U x. For ome poitive integer k, let u ume tht U k x U k x hold for every x R S. Then, when n = k +, U k+ x = U k Ux U k Ux U k x U k x U k U x Ux U x = U k+ x. Therefore, by mthemticl induction, it i tified U n x U n x for every poitive integer n. Then, the inequlity of the fixed point of U nd U cn be obtined by n, x x where indicte the fixed point. The bove rgument lo hold when U nd remx re replced with U oft nd log-um-exp opertion, reectively. Before howing the performnce error bound, the upper bound of W nd H re proved firt. Proof of Lemm 5: For W, W = ρ ρ A A = A γ A A A ρ = γ. The inequlity tht A A cn be obtined by finding the point where the derivtive of xx i zero. Similrly, for H, [ ] H = E γ t log t t, d, T t=0 = [ ] log E γ t {t =, t =}, d, T, t=0 =, log ρ, = ρ log ρ log A log log A = γ log A ρ = γ. The inequlity tht log log A lo cn be obtined by finding the point where the derivtive of x logx i zero. Uing Lemm 4 nd Lemm 5, the error bound of re nd oft vlue itertion cn be proved. Proof of Theorem 4: Let be the optiml policy of the originl MDP, where the problem i defined mx E r,. E r, mx E r, = E r,. The rightide inequlity i by the definition of optimlity. Before proving the leftide inequlity, we firt derive the following inequlity from Lemm 4: V V, 3 where indicte n optiml vlue. Since the fixed point of U nd U re the optiml olution of the originl MDP nd re MDP, reectively, 3 cn be derived from Lemm 4.

13 3 The leftide inequlity i proved uing 3 follow: E r, = d V d V = J E r, + γ = E r, + W A A Lemm 5. Proof of Theorem 5: Let be the optiml policy of the originl MDP which i defined mx E r,. The rightide inequlity i by the definition of optimlity. E oft r, mx E r, = E r,. Before proving the leftide inequlity, we firt derive following inequlity from Lemm 4: V V oft 4 where indicte n optiml olution. Then, the proof of the leftide inequlity i done by uing 4 follow: E r, = d V d V oft = J oft E oftr, + = E oft r, + H oft log A Lemm 5. γ [] M. Bloem nd N. Bmbo, Infinite time horizon mximum cul entropy invere reinforcement lerning, in 53rd IEEE Conference on Deciion nd Control, December 04, pp [] C. Tlli, Poible generliztion of boltzmnn-gibb ttitic, Journl of ttiticl phyic, vol. 5, no., pp , 988. [3] W. Wng nd M. A. Crreir-Perpinán, Projection onto the probbility implex: An efficient lgorithm with imple proof, nd n ppliction, rxiv preprint rxiv:309.54, 03. [4] D. R. Smrt, Fixed point theorem. CUP Archive, 980, vol. 66. [5] T. P. Lillicrp, J. J. Hunt, A. Pritzel, N. Hee, T. Erez, Y. T, D. Silver, nd D. Wiertr, Continuou control with deep reinforcement lerning, rxiv preprint rxiv: , 05. [6] J. Ye, Contrint qulifiction nd necery optimlity condition for optimiztion problem with vritionl inequlity contrint, SIAM Journl on Optimiztion, vol. 0, no. 4, pp , 000. [7] A. Mrtin nd R. Atudillo, From oftmx to remx: A re model of ttention nd multi-lbel clifiction, in Interntionl Conference on Mchine Lerning, June 06, pp [8] B. D. Ziebrt, Modeling purpoeful dptive behvior with the principle of mximum cul entropy, Ph.D. dierttion, Crnegie Mellon Univerity, Pittburgh, PA, USA, 00. [9] T. Schul, J. Qun, I. Antonoglou, nd D. Silver, Prioritized experience reply, rxiv preprint rxiv:5.0595, 05. [0] C. J. Wtkin nd P. Dyn, Q-lerning, Mchine Lerning, vol. 8, no. 3-4, pp. 79 9, 99. [] H. vn Helt, A. Guez, nd D. Silver, Deep reinforcement lerning with double q-lerning, in Proc. of the Thirtieth AAAI Conference on Artificil Intelligence, Februry 06, pp [] E. Todorov, T. Erez, nd Y. T, Mujoco: A phyic engine for modelbed control, in Interntionl Conference on Intelligent Robot nd Sytem, October 0, pp APPENDIX B In thi ection, we preent the full experimentl reult of reinforcement lerning with continuou ction ce. We performe experiment on Inverted Pendulum nd Recher nd 8 lgorithm re teted including our re explortion method nd re Bellmn updte rule. REFERENCES [] S. Brechtel, T. Gindele, nd R. Dillmnn, Probbilitic deciion-mking under uncertinty for utonomou driving uing continuou pomdp, in 7th Interntionl Conference on Intelligent Trnorttion Sytem, October 04, pp [] S. Rgi nd E.. P. Chong, UAV pth plnning in dynmic environment vi prtilly obervble mrkov deciion proce, IEEE Trn. Aeroce nd Electronic Sytem, vol. 49, no. 4, pp , 03. [3] J. Hwngbo, I. S, R. Siegwrt, nd M. Hutter, Control of qudrotor with reinforcement lerning, IEEE Robotic nd Automtion Letter, vol., no. 4, pp , 07. [4] J. ober, J. A. Bgnell, nd J. Peter, Reinforcement lerning in robotic: A urvey, Interntionl Journl of Robotic Reerch, vol. 3, no., pp , 03. [5] A. Y. Ng nd S. J. Ruell, Algorithm for invere reinforcement lerning, in Proc. of the 7th Interntionl Conference on Mchine Lerning, June 000, pp [6] T. Hrnoj, H. Tng, P. Abbeel, nd S. Levine, Reinforcement lerning with deep energy-bed policie, in Proc. of the 34th Interntionl Conference on Mchine Lerning, Augut 07, pp [7] N. Hee, D. Silver, nd Y. W. Teh, Actor-critic reinforcement lerning with energy-bed policie, in Proc. of the Tenth Europen Workhop on Reinforcement Lerning, June 0, pp [8] J. Schulmn, P. Abbeel, nd X. Chen, Equivlence between policy grdient nd oft q-lerning, rxiv preprint rxiv: , 07. [9] M. Tokic nd G. Plm, Vlue-difference bed explortion: Adptive control between epilon-greedy nd oftmx, in I 0: Advnce in Artificil Intelligence, 34th Annul Germn Conference on AI, October 0, pp [0] P. Vmplew, R. Dzeley, nd C. Fole, Softmx explortion trtegie for multiobjective reinforcement lerning, Neurocomputing, vol. 63, pp , 07.

14 4 The Number of Action Averge Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn DDPG TABLE II: Expected return of Inverted Pendulum. Top five performnce re mrked in bold. The Number of Action Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn TABLE III: The number of epiode required to rech the threhold return, 980.

15 5 The Number of Action Averge Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn DDPG TABLE IV: Expected return of Recher. Top five performnce re mrked in bold. The Number of Action Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn TABLE V: The number of epiode required to rech the threhold return, -6.

Artificial Intelligence Markov Decision Problems

Artificial Intelligence Markov Decision Problems rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome